Data Integration and Annotation Services for Biodiversity
The DIAS-B project is a NeAT-funded effort to provide metadata and annotation services primarily in support of the Atlas of Living Australia (ALA), but which may be reused by other Australian research initiatives. The development is being undertaken by teams within CSIRO and The University of Queensland. Having the ALA as an actual user with identified needs and requirements will help ensure that the outcome is of a high quality that will be of use to a diverse range of communities with similar needs.
These pages describe Danno, the Annotation Service and associated tools developed by the eResearch Unit at the University of Queensland for the DIAS-B project. The development team at UQ has been able to leverage more than ten developer-years of experience in the area of on-line annotation and metadata services to build a set of open-source tools and facilities based on established standards that use the most up to date technologies.
Future information and project updates are available on the DANNO blog.
The benefits of being able to electronically annotate on-line documents shared by communities of researchers and others has been well described and documented [add refs]. Through this mechanism, understanding can be enhanced and errors and omissions addressed. In devising the architecture for the ALA, it was recognized that a large amount of on-line biodiversity data exists which is valuable despite containing missing or incorrect data. A frequently encountered example is a native Australian sighting of flora or fauna with a set of geospatial coordinates which have a positive latitude value, resulting in a location somewhere in the ocean, above the equator. If it were possible for instances like this to be annotated to highlight the error, the biodiversity community in general would benefit. Further, if the annotations could be routinely harvested, categorized, and fed back to the owners of the collection, permanent corrections might be made resulting in improved data quality through collective community wisdom. The ALA therefore decided to incorporate a service in their architecture which would give their users the opportunity to store comments on various kinds of data items. Examples include:
To understand the basic service, we will first consider the simple case where a small, closed community of users share a set of resource pages that are authored to provide an annotation capability. This permits the group to effectively insert annotations, replies to annotations, and replies to replies directly into the pages. As we shall see, no actual change to the HTML source files of the resources stored on the web server takes place, but to the user community, the integration of pages retrieved by a URL and any associated annotations and replies will appear largely transparent. In the following paragraphs, the term annotation should be read to include annotation replies, and replies to replies.
Links on the page and in the pop-ups provide Create, Update, and Delete functionality over annotations and replies. The user highlights page text, or selects an annotation, then clicks a link to open a HTML form retrieved from a web service on the host. When completed and submitted, the form builds an RDF XML object from the user data which is sent as a HTTP POST to the Annotation Server. An asynchronous AJAX callback automatically inserts the new annotation into the user's viewed page. Other users viewing the same page will be unaware of this, but their next view of the page, or a page refresh will re-fetch all annotations.
The choice of XPath/Xpointer to locate annotations leverages an existing standard, but introduces a cross-browser problem due to the lack of in-built support for this facility in web browsers. This is compounded by the differences between browsers in how they represent the formatted structure of a loaded page. Many use the standardized W3C Document Object Model (DOM), while others have taken a proprietary route. The Mozilla based browsers Firefox, Netscape, and Safari, together with the new Google Chrome browser follow the standard. Sadly, Microsoft's Internet Explorer (IE) employs a proprietary DOM which makes creating portable XPaths a rather hueristic process. A more robust mechanism would make use of easily located DOM element identifiers—a mechanism supported by all web browsers— to locate the annotation insertion points. However, this requires that the source document be authored with this in mind. Very, very few pages are authored this way, so the more flawed but more universally applicable approach must be taken.
In Utopia, the illustration above would apply to the entire World Wide Web. Any user would be able to annotate any page and all users would have access to the annotations made by others. Because this is Utopia, all users would be benign and co-operative, so policing of who can do what to whom would not be an issue. But re-authoring every single page in existence to be "annotation-friendly" is just as impossible as expecting all user to play nicely together. So taking a step back towards the pragmatic case, we can define some requirements:
This list is a big ask.
The User may now create annotations using another Bookmarklet which communicates with the Repeater host—perhaps acting as a proxy—to store, edit, or delete annotations. Asynchronous callbacks take care of inserting newly created annotations. As before, new annotations related to the resource currently being viewed which are created by other users cannot be automatically distributed; their insertion requires a refresh of the repeated page, thus causing a full reload of all annotations from the shared Annotation Service.
This simple scheme is not without problems. The Repeater is not a real Proxy, so if the User follows a link to another page, this will be retrieved direct from the real host, requiring the User to again click their Repeater bookmarklet to view any associated annotations. Some pages will defeat the annotation display mechanism through complex application of Cascading Style Sheets (CSS). Pages which make use a <frame> based structure are not currently supported, and some Web 2.0 type pages with sophisticated AJAX triggers may cause unpredicted behaviour. However, for the majority of straight forward HTML page structures, the prototype behaves quite well.
The Annotation Server and repository are based on the W3C Annotea Project. This was chosen because:
We have chosen to name the new implementation DANNO. The server is implemented in the Java ™ language using the Spring framework. Isolation layers enable key components to be replaced with relative ease. Currently, annotations are persisted in a MySql database using the Jenna RDF triplestore framework. The server fully and faithfully implements the Annotea protocol and has been benchmarked for throughput and scalability. An extensive test suite was built in conjunction with the server. This is included in the source distribution and integrated with the Maven build environment. The server supports harvesting (see below).
The client code supports three levels of annotation granularity. The broadest scope is "whole of document" where the annotation mark and any associated replies appear as floating tags in the top left corner of the view. Moving the mouse pointer over the marker triggers a hover text box displaying the annotaton title, creator, and text. At a finer level of granularity, the user may select the text to which the annotation refers. The marker tag is injected at the end of the selection like a "footnote" and scrolls with the page text. The full text of the selection will display in the hover popup. This selection process also applies to regions of images. In this case, the marker appears in the corner of a rectangle defining the selected region. The hover popup will display the clipped image region and the user's annotation text.
DANNO implements the Open Archives Initiative's Protocol for Metadata Harvesting (OAI-PMH) through the open source OAI-CAT web service. The implementation allows annotations to be harvested in the mandatory OAI_DC format through a DC to RDF crosswalk, or in the native RDF format. The cross-walk references the Annotea URL for the annotation body rather than the body content itself. De-referencing this URL ensures the current body content is retrieved rather than a temporal snapshot of the item. In contrast, when harvested in the RDF format, the text of the annotation body is nested within the associated annotation or reply instance. This is more suited to harvesting for reporting purposes where the data will be formatted for immediate display and refreshed periodically.
The current release implements resumption tokens. Harvesting by sets is not yet supported. A future release will superimpose multiple sets over the repository to allow, for example, all annotations for a base host URL to be harvested as a single set.
As well as the benefits conveyed by the annotations themselves, knowledge of what resources have the most annotations, which annotations have attracted the most replies, which page was most recently annotated, etc, inform both web site administrators and other users of what topics the community at large are currently interested by. Among the simple tools provided with DANNO are scripts that can be scheduled for periodic execution. One performs an OAI-PMH harvest placing summary data in an SQL table. A second script generates tabular reports from this data, formatting the results as a HTML page for use by administrators who may choose to provide it to users of a service as well. This live link provides an example statistics page from a live DANNO server.
At this time, the DANNO Annotation Server and the associated Dannotate client components should be considered as a pre-production release. The core functionality is complete and sufficient information is provided in the project README files to allow it to be configured and deployed from the binary Web Archive (war) file, or built from the source. Future development will extend the currently rudimentary and optional security model to make use of emerging Australian Identity Servers and implement OAI-PMH sets using a query-based structure.