The Phenomics Ontology Driven Data Management Project (PODD) is a National e-Research Architecture Taskforce (NeAT) project co-funded by ANDS and ARCS. The aim of the project is to develop data management solutions to meet the needs of researchers working at the Australian Plant Phenomics Facility (APPF) and the Australian Phenomics Network (APN). Both research communities have the need to gather and annotate data from both high and low throughput phenotyping devices.
Gavin Kennedy is the Project Manager of the data repository project. The development team comprises Lead Developer Dr Yuan-Fang Li and Developers Faith Davies and Philip Wu, and two bioinformaticians, Dr James Eddes and Dr Kai Xu. James Eddes works on the bioinformatics for the plant phenomics clients at the APPF. Kai Xu is a mouse bioinformatician at the APN. The development team will define the data models and workflows for the phenotypic data that is to be captured into the repository, and implement the repository backend and user and web services interfaces.
The image files that are generated by this research range from simple RGB to infrared and 3D and are typically quite large. A single investigation can generate tens of gigabytes. In a high throughput environment several terabytes are expected to be generated each year. The bioinformaticians need to identify what metadata needs to be captured. It is important for sufficient metadata to be generated to ensure that the data is organised, can be made accessible to researchers and provides sufficient contextual information for experiments to be understood and reproduced. The metadata capture is done on a case by case basis. Mouse microscopy (histopathology) data is mostly generated automatically, and a parallel EIF project at the University of Melbourne is supporting this metadata capture. Fedora Commons repository software is being used to manage, preserve and link data. Data is either left in situ on servers or deposited into a large data store. The objective of PODD is to collect and maintain the data that is relevant and store it in perpetuity. This means that data curation and data archiving activities will also need to be considered.
PODD is not only a data management tool, but a data publishing tool as well. PODD will support open access to data where possible and we expect to engage with the phenomics community about their data sharing practices. A data management workshop is planned for early next year with a focus on data management for bioinformatics. Some of the issues to be explored are access sharing policies within the PODD repository, data usage and attribution of data sources. PODD will also support data discovery through the Atlas of Living Australia and ultimately through the Australian Research Data Commons.
Y-F. Li, G. Kennedy, F. Davies, P. Wu and J. Hunter, ''An ontology-centric architecture for extensible scientific data management systems'', Future Generation Computer Systems, In Press, Available online 8 July 2011.
Y-F. Li, G. Kennedy, F. Davies and J. Hunter, ''PODD - Towards An Extensible, Domain-agnostic Scientific Data Management System'', IEEE eScience 2010. Brisbane, Australia, December 8 - 10, 2010.
Y-F Li, G. Kennedy, F. Davies and J. Hunter, ''Towards A Semantic & Domain-agnostic Scientific Data Management System'', Proceedings of the Workshop on Semantic Repositories for the Web (SERES-2010). ISWC, Shanghai, China, November 7, 2010.
Y-F. Li, G. Kennedy, F. Davies and J. Hunter, ''PODD: An Ontology-driven Data Repository for Collaborative Phenomics Research'', International Conference on Digital Libraries (ICADL). Gold Coast, Australia, June 21 - 25, 2010.