eResearch

You are here

ANDS DIMER

Overview

The ANDS Diffraction Image Experiment Repository (DIMER) project developed enhancements to the UQ repository of X-ray diffraction images, providing support for automated capture of images and metadata, syndication of published diffraction image datasets, and improvements to the robustness, scalability, and usability of the repository.

Funding
Australian National Data Service (ANDS) Data Capture funding
Period
2011-2012
Structure
The DIMER project is a collaboration between the UQ eResearch Group and the UQ Remote Operation Crystallization and X-Ray Diffraction Facility (UQ ROCX)
Project team
Prof Jane Hunter, Project Director, UQ eResearch Group
Prof Jenny Martin, Project Advisor, Institute for Molecular Bioscience
Prof Bostjan Kobe, Project Advisor, School of Chemistry and Molecular Biosciences
Karl Byriel, Project Advisor, Institute for Molecular Bioscience
Simon McNaughton, Software Developer, UQ eResearch Group
Jonathan Ellis, Bioinformatician, School of Chemistry and Molecular Biosciences
Charles Brooking, Project Manager, UQ eResearch Group
Dr Nigel Ward, Project Manager, UQ eResearch Group

Significance

Structural biology has emerged as one of the most powerful approaches for defining the functions of proteins. The strong predictive power of structure in functional annotation has resulted in the rapid growth of the field of structural genomics, which has enabled the discovery of novel drug targets and advanced our understanding of protein evolution. The field promises to have a major impact on the life sciences, biotechnology, and medicine.

High throughput or parallel processing approaches have been developed for producing protein samples for structural biology and functional studies. Crystals that are successfully formed in these samples are subject to X-ray diffraction, which is the most widely used approach for protein structure determination, accounting for approximately 85% of structures in the Protein Data Bank. At UQ, the Structural Genomics group has established such a high throughput processing pipeline at the Remote Operation Crystallization and X-ray Diffraction Facility (UQ ROCX) that applies parallel processing to hundreds of protein targets.

Divers and Goals

An ARC Discovery project (funded from 2006‐2011) that involved a collaboration between the Structural Genomics group and the UQ eResearch Lab, enabled the development of prototype e‐research services to capture the data resulting from the protein crystallization and structural analysis pipeline: TIMTAM, a laboratory information management system for target selection and crystallisation experiments; and DIMER, a repository for X‐ray diffraction images that are processed to determine crystal structure. The present ANDS project was established based on the need for several enhancements to DIMER.

It was recognised that support for automated capture of images and metadata had to be added in order to maximise the number of datasets in DIMER: this removes a significant barrier for users, given the time pressures felt by scientists and the difficulty is managing the large file sizes of diffraction image sets. It also assists researchers in fulfilling their data management obligations, which include the storage and backup of research outputs and granting access rights to appropriate colleagues and supervisors.

DIMER was also extended to allow published datasets to be accessible and shareable both via UQ DataSpace and the ANDS Research Data Australia discovery services. This boosts the profile of datasets hosted by DIMER and fills a gap in publishing the outputs of X-ray crystallography studies: in addition to journal articles, captured by services such as PubMed, and protein structures, captured by the Protein Data Bank, syndication to UQ DataSpace and ANDS RDA allows the diffraction image datasets leading to solved structures to be accessible and discoverable online.

In addition to these two major features, this project also involved a range of improvements to the robustness, scalability, and usability of the repository. These changes represented a transition of DIMER from a prototype to production system.

Data

Datasets from DIMER have been made available under the Creative Commons Attribution 3.0 Australia License, and can be viewed on Research Data Australia under the ‘Diffraction Image Experiment Repository’ collection, located at http://researchdata.ands.org.au/diffraction-image-experiment-repository-httpdataspaceuqeduaucollections1w. These datasets come from a variety of projects undertaken at UQ that have the potential to inform the design of novel drugs; for example through the development of a new class of antivirulence compounds to combat antibiotic-resistant infection. Further collections will come online as datasets are automatically captured from UQ ROCX and more researchers are encouraged to make their datasets publicly accessible.

The diagram below illustrates the flow of data from the UQ ROCX and Australian Synchrotron facility into DIMER, which supports the management, search, publication, and download of data. Data collection are publicised through UQ DataSpace and ANDS Research Data Australia.

DIMER application

DIMER provides tools for creating and managing projects, experiments, and datasets: these provide a structure for organising diffraction image data, adding descriptive metadata, and setting user access permissions. Users are able to upload data sets through the web interface; there is also an automatic data capture component that harvests datasets produced at the UQ ROCX facility. Projects are unpublished by default, but can be marked as published and publicised via UQ DataSpace and ANDS Research Data Australia. Each of these features of DIMER are described in the sections that follow.

The DIMER application is online at http://dimer.uq.edu.au/.

Projects, Experiments, and Datasets

DIMER defines the concepts of projects, experiments, and datasets as a structure within which diffraction image datasets are represented. At the highest level, projects link together a group of researchers who are collaborating on a series of experiments. User accounts are assigned to each project with one of the following permissions: reader (read-only access); writer (ability to upload and edit data); or manager (ability to upload/edit data and assign permissions). Restricting access prior to the publication of research findings is extremely important for users of DIMER, given the competitive research.

The screenshot below shows a project, PaDsba, which is a collaboration between two researchers and their laboratory group. Nested within the image is an experiment, SelenoMet PaDsba, contained in the project. An experiment is simply a container for a series of datasets, with a title and optional description.

Diffraction images are uploaded to datasets, as shown in the example below. Each dataset contains a range of bibliographic metadata, including: a title and description; the authors to which the dataset is attributed (these may be different from the users assigned to the project, particularly if not all researchers have user accounts in DIMER); the date of creation and publication; and links to related publications via PubMed. There are also several metadata fields specific to diffraction images: the facility and specific light source from which the dataset was gathered; the attributes of diffraction images contained in the dataset (i.e. exposure time, oscillation range, detection distance, etc.); and a link to the PDB entry for any resulting structure. Dataset pages, as with projects and experiments, have a Reference URL, which is a short URL intended for citation from manuscripts and elsewhere.

The representation of diffraction images in a dataset is shown in the screenshots below. On the left, the diffraction image metadata is listed in a table, with JPEG image thumbnails shown below. Diffraction image metadata and JPEG images are extracted from raw diffraction image files using the CCP4 Diffraction Image Library. On the right, DIMER provides two file listings: the Image Files section contains the raw diffraction image files in the dataset; the Associated Files section contains files closely related to the diffraction images, such as processing log files and other files generated while solving a structure.

Uploading Datasets

Files can be uploaded to DIMER through a variety of means. As part of the ANDS DIMER project, we developed a Java applet, embedded in the DIMER website, that supports the upload of diffraction images and associated files from a users local file system through a graphical interface. This applet uses the data compression features of HTTP (i.e. gzip content encoding) to minimise transfer time. As an alternative to using this applet, users can transfer files using the WebDAV protocol: there are several WebDAV clients available, including support in most operating systems for configuring WebDAV shares as a remote drive.

Automatic Data Capture

DIMER includes a File Monitor component that continually monitors the directories on the UQ ROCX file server into which each diffractometer streams detector images. A settings page is provided in DIMER that maps each user's DIMER account to their username in Crystal Clear, the software used to control the diffractometers in UQ ROCX. This allows images found on the file server to be automatically uploaded to DIMER under a staging area for each user.

An email notification sent to the user when a new dataset is detected. Upon clicking this link, the user is guided to enter required metadata and publish the dataset. If the user decides not to click the link, the dataset will not be published to DIMER: a reminder email is sent after 2 weeks and, if no action is taken, the harvested files are removed after a further 2 weeks. The status of harvest requests can be monitored by users, as shown in the figure below.

Publishing Datasets

Researchers are encouraged to provide open access to their datasets and to cite their raw diffraction image datasets from journal articles. To facilitate this, DIMER provides the ability to publish datasets to UQ DataSpace and ANDS Research Data Australia. The records in UQ DataSpace and ANDS RDA are shown below; click each image to visit the website for that record. Metadata contained in these records includes: title, description, authors, link, data access and rights statement, keywords (e.g. Field of research, Socio-economic impact codes), and links to related publications (e.g. journal articles, PDB entries).

Source code and system architecture

The source code for DIMER is on SourceForge at http://sourceforge.net/projects/dimer/.

DIMER Deployment

The overall deployment environment for DIMER is shown below, with arrows indicating the flow of data. DIMER is deployed on server infrastructure within the UQ Institute for Molecular Bioscience (IMB). It is connected directly to the IMB mass-storage facility and, through the DIMER File Monitor component for automatic data capture, to the UQ ROCX facility (UQ ROCX is housed within the IMB). Users of DIMER may be researchers from the IMB, the School of Chemistry and Molecular Biosciences (SCMB), or other groups inside or outside UQ. Image files reach DIMER either automatically, via the DIMER File Monitor, or manually, via researchers uploading images obtained from other facilities, such as the Australian Synchrotron. The DIMER Web Application can be accessed from any Internet-connected device with a Web browser.

Use of the “S-drive” mass-storage facility within the IMB is vital due to the large volumes of data required to be stored: a single diffraction dataset of 360 images consumes over 6 GB, and thousands of datasets are generated each year by UQ researchers using UQ ROCX and other facilities. The S-drive facility provides for large storage volumes and for off-site backups. This fulfils a significant role of DIMER in improving the management of research data, ensuring that diffraction image datasets are not simply stored on portable disks or local workstations but in a backed-up, secure location.

DIMER Software Components

The software components that make up DIMER are shown below. The DIMER Web Application is implemented in Java and is based predominantly on the Jersey JAX-RS REST framework and the Apache Jackrabbit Java Content Repository, both of which are freely available and run on all major operating systems.

Jackrabbit, in particular, is a good fit for this project: as opposed to a relational database containing tables, the Jackrabbit JCR implementation manages data as a tree of nodes (matching the projects – experiments – datasets – files model of DIMER) and provides support for node-based authentication and Lucene-indexed file storage out-of-the-box. Under the hood, Jackrabbit stores its data in a MySQL database, and has been configured in this project to store files on the IMB S-drive facility via a SAMBA mount.

The CCP4 Diffraction Image library – a C++ library compiled and accesses using JNI – is used to extract metadata headers and JPEG images from raw diffraction image files. Alongside the DIMER Web Application is the File Monitor component, which is triggered using the Quartz Java Scheduler.

Users access DIMER through a Web browser and can upload files using either the DIMER Uploader Java Applet, which is embedded within the website, or by connecting using a WebDAV client to the DIMER Web Application. DIMER provides a full implementation of the WebDAV protocol via the Jackrabbit WebDAV module.

The DIMER server components are deployed on a CentOS Linux server running Apache Tomcat.


ANDS Logo This project is supported by the Australian National Data Service (ANDS). ANDS is supported by the Australian Government through the National Collaborative Research Infrastructure Strategy Program and the Education Investment Fund (EIF) Super Science Initiative.