eResearch

You are here

Overview

Data Integration and Annotation Services for Biodiversity

The DIAS-B project is a NeAT-funded effort to provide metadata and annotation services primarily in support of the Atlas of Living Australia (ALA), but which may be reused by other Australian research initiatives. The development is being undertaken by teams within CSIRO and The University of Queensland. Having the ALA as an actual user with identified needs and requirements will help ensure that the outcome is of a high quality that will be of use to a diverse range of communities with similar needs.

These pages describe Danno, the Annotation Service and associated tools developed by the eResearch Unit at the University of Queensland for the DIAS-B project. The development team at UQ has been able to leverage more than ten developer-years of experience in the area of on-line annotation and metadata services to build a set of open-source tools and facilities based on established standards that use the most up to date technologies.

Future information and project updates are available on the DANNO blog.

Scope

The benefits of being able to electronically annotate on-line documents shared by communities of researchers and others has been well described and documented [add refs]. Through this mechanism, understanding can be enhanced and errors and omissions addressed. In devising the architecture for the ALA, it was recognized that a large amount of on-line biodiversity data exists which is valuable despite containing missing or incorrect data. A frequently encountered example is a native Australian sighting of flora or fauna with a set of geospatial coordinates which have a positive latitude value, resulting in a location somewhere in the ocean, above the equator. If it were possible for instances like this to be annotated to highlight the error, the biodiversity community in general would benefit. Further, if the annotations could be routinely harvested, categorized, and fed back to the owners of the collection, permanent corrections might be made resulting in improved data quality through collective community wisdom. The ALA therefore decided to incorporate a service in their architecture which would give their users the opportunity to store comments on various kinds of data items. Examples include:

  • Plain text annotations providing comments or proposed corrections for any data item
  • Structured annotations proposing corrections for data items with well-known structures and formats
  • Annotations providing links to other data items or vocabulary terms
  • Responses from data providers or other users to any annotation

objectives

  1. Build a highly scalable, responsive, robust, Annotation Repository based on existing standards using modern, cross-platform, open-source technology. This service should be easily deployed by simple configuration allowing it to be used by a diverse variety of communities.
  2. Include provisions for annotations held in the Repository to be harvested using OAI-PMH (the Open Archives Initiative Protocol for Metadata Harvesting). As well as supporting general reporting and archiving, harvesting by sets will enable individual data owners to see and review entries pertaining to their data alone.
  3. Provide simple client side annotation tool libraries that permit web site authors to integrate annotation support direct into their resources.
  4. Develop a means by which basic annotation services can be overlaid on existing web resources without the need for the underlying resource to be modified in any way.
  5. Strive for the lowest possible client common denominator. That is, make the client tools available for use with all major Web Browser technologies and design them such that at the simplest level, a user does not need to perform any installation step requiring downloading, administrator privileges, nor open themselves to any potential exploitation or harm.

Conceptual Architecture

The Simple Case

To understand the basic service, we will first consider the simple case where a small, closed community of users share a set of resource pages that are authored to provide an annotation capability. This permits the group to effectively insert annotations, replies to annotations, and replies to replies directly into the pages. As we shall see, no actual change to the HTML source files of the resources stored on the web server takes place, but to the user community, the integration of pages retrieved by a URL and any associated annotations and replies will appear largely transparent. In the following paragraphs, the term annotation should be read to include annotation replies, and replies to replies.

Simple Case In the simple case, a small community of users access resources provided by a shared Web Server. These pages are made "annotation-friendly" by referencing some common JavaScript code which is called by the page On-load trigger. This will fetch annotations and replies associated with the page URL from an Annotation Server running on the same host as the Web Server. This co-location is required to observe the restrictions placed by the Same Origin security model which is imposed by web browsers to prevent "man in the middle" type attacks. The script makes a HTTP request and receives an XML response from the Annotation Server using a synchronous AJAX call so no page reload takes place. The XML is parsed and XPath/XPointer strings used to locate the part of the page the annotation relates to. A marker is then inserted into the page at this point that allows the actual annotation to be viewed. There are many options for displaying the annotation text. In our prototype, this is done through the use of a hover-text pop-up window triggered by a mouse-over event at the inserted annotation marker. So although the page being viewed remains unchanged on the server, to the user, the annotations appear as part of the page.

Links on the page and in the pop-ups provide Create, Update, and Delete functionality over annotations and replies. The user highlights page text, or selects an annotation, then clicks a link to open a HTML form retrieved from a web service on the host. When completed and submitted, the form builds an RDF XML object from the user data which is sent as a HTTP POST to the Annotation Server. An asynchronous AJAX callback automatically inserts the new annotation into the user's viewed page. Other users viewing the same page will be unaware of this, but their next view of the page, or a page refresh will re-fetch all annotations.

The choice of XPath/Xpointer to locate annotations leverages an existing standard, but introduces a cross-browser problem due to the lack of in-built support for this facility in web browsers. This is compounded by the differences between browsers in how they represent the formatted structure of a loaded page. Many use the standardized W3C Document Object Model (DOM), while others have taken a proprietary route. The Mozilla based browsers Firefox, Netscape, and Safari, together with the new Google Chrome browser follow the standard. Sadly, Microsoft's Internet Explorer (IE) employs a proprietary DOM which makes creating portable XPaths a rather hueristic process. A more robust mechanism would make use of easily located DOM element identifiers—a mechanism supported by all web browsers— to locate the annotation insertion points. However, this requires that the source document be authored with this in mind. Very, very few pages are authored this way, so the more flawed but more universally applicable approach must be taken.

The Ideal Case

In Utopia, the illustration above would apply to the entire World Wide Web. Any user would be able to annotate any page and all users would have access to the annotations made by others. Because this is Utopia, all users would be benign and co-operative, so policing of who can do what to whom would not be an issue. But re-authoring every single page in existence to be "annotation-friendly" is just as impossible as expecting all user to play nicely together. So taking a step back towards the pragmatic case, we can define some requirements:

  1. The system must provide a very annotation feature rich environment for pages that are created or modified to be annotation-friendly
  2. The system must also provide a way of merging annotations with arbitrary, existing web pages that does not require any change to the page source, nor the servers providing them
  3. Users should be able to create annotations for selected text or images, including selected regions of images, on the majority of web pages
  4. A level of security must be provided sufficient to control any sociopathic users who wish to offend or inconvenience other, or those who would abuse the system by using it to mount a denial of service attack
  5. The system should be sufficiently simple to use that anyone who understands rudimentary browser operation can create and view annotations without "reading the manual"
  6. The client side interface should not require any guru-level intervention for installation or configuration, neither should it introduce anything even the most paranoid system administrator would take objection to (we must assume however that client side scripting is enabled)
  7. As far as is practical, the service should function in a uniform, predictable, repeatable manner regardless of the host and client hardware/software environment and version, including the users' choice of browser (or lack of choice in some cases).

This list is a big ask.

The Pragmatic Approach

To merge the Simple Case approach with the Ideal Case requirements, the first challange is injecting the JavaScript code and an on-load trigger into arbitrary web pages without making any change to those pages, nor the servers that provide them. This can be accomplished through a proxy-like repeater service.

Repeater The user needs a Bookmarklet that will request the page currently being viewed from a Repeater Service. Bookmarklets are easily installed by drag and drop (Firefox and others), or "add to favourites" (IE). This action is no more onerous than bookmarking a page and requires only that the browser have JavaScript execution enabled in order to function. On receipt of the request from the bookmarklet, the Repeater Service does a HTTP GET of the page from the host web server. It then adds links to the required JavaScript files and style sheets into the <head> section and an On-load call to the JavaScript code to the <body>. It also inserts a <base> element into the page head section which references the page host URL. This instructs browsers to retrieve any relative links for images etc from the actual page host, relieving the Repeater of the responsibility for supplying them. Somewhat surprisingly, this does not trigger a Same Origin violation. This modified page is returned to the User's browser.

When the page loads, the trigger requests any associated annotations from the Repeater host using URLs set in the JavaScript code which has been loaded from a web server running on the Repeater host. This is valid as to the User's browser, it appears that this is the host which supplied the page, so no Same Origin violation occurs. The Repeater might also host the Annotation Server, or alternately, it may intercept these requests and proxy them to the actual server. Regardless, the annotations, if any, are retrieved and injected to the DOM by the User's browser in the same manner as we saw in the Simple Case above. If the User needs any indication that the page is being relayed, it is provided by the change in the URL visible in the address bar of the browser.

The User may now create annotations using another Bookmarklet which communicates with the Repeater host—perhaps acting as a proxy—to store, edit, or delete annotations. Asynchronous callbacks take care of inserting newly created annotations. As before, new annotations related to the resource currently being viewed which are created by other users cannot be automatically distributed; their insertion requires a refresh of the repeated page, thus causing a full reload of all annotations from the shared Annotation Service.

This simple scheme is not without problems. The Repeater is not a real Proxy, so if the User follows a link to another page, this will be retrieved direct from the real host, requiring the User to again click their Repeater bookmarklet to view any associated annotations. Some pages will defeat the annotation display mechanism through complex application of Cascading Style Sheets (CSS). Pages which make use a <frame> based structure are not currently supported, and some Web 2.0 type pages with sophisticated AJAX triggers may cause unpredicted behaviour. However, for the majority of straight forward HTML page structures, the prototype behaves quite well.

Collaborative Annotation Service

The Annotation Server and repository are based on the W3C Annotea Project. This was chosen because:

  1. It is a simple, mature, established standard that can be extended as required for multi-media
  2. We have experience with it from earlier advanced prototype implementations
  3. Compatible third party, open-source, client-side annotation tools exist
  4. It is HTTP based, simplifying client-side requirements
  5. It can be secured to any degree required.

We have chosen to name the new implementation DANNO. The server is implemented in the Java ™ language using the Spring framework. Isolation layers enable key components to be replaced with relative ease. Currently, annotations are persisted in a MySql database using the Jenna RDF triplestore framework. The server fully and faithfully implements the Annotea protocol and has been benchmarked for throughput and scalability. An extensive test suite was built in conjunction with the server. This is included in the source distribution and integrated with the Maven build environment. The server supports harvesting (see below).

Annotation Authoring Tools

As it is based on the the W3C Annotea standard, DANNO may be used with a number of existing client-side annotation authoring tools. The most mature, open-source tool of this type is Annozilla. As implied by the name, this project provides annotation and reply authoring and viewing for Mozilla based Web Browsers such as Firefox, Safari, and Netscape Navigator. Anozilla is provided as an add-on service to the browser and as it essentially becomes part of the browser, it is able to ignore certain security provisions, provided the user chooses to accept the risk and has the authority to install such components on their computer.

For architecture and standards non-compliance reasons, Annozilla cannot be used with Microsoft's Internet Explorer (IE). This is unfortunate as in the main, IE is still the most widely used Browser. The UQ eResearch Lab has developed an advanced prototype plug-in for IE which provides similar capabilities to Annozilla. While extremely successful as a research project, there are significant issues providing this facility as a general, prime-time service, not the least of which is the cost of maintaining the code against IE evolutions. And like Annozilla, installation of the IE side-bar plug-in requires the user to accept on good faith that while they are installing a service capable of breaking sand-box security, it will not expose them to harm or loss through its ability to silently communicate with a third party host—an action which could open them to a man in the middle security attack—precicely the reason browsers implement the Same Origin security policy. For this reason, the DIAS-B project set a goal at the outset of providing as rich an annotation authoring environment as is possible without the need for a user to install any special plug-in code.

While meeting this goal is not difficult on a per-browser basis, providing a browser-neutral facility is a significant technical challenge. Prototyping has so far yielded a reliable client-side JavaScript based approach which gives repeatable results in most cases on browsers which implement the W3C Document Object Model (DOM). Those tested are:

  • Firefox 2 and 3 (Linux, Mac and Win32)
  • Microsoft Internet Explorer 8
  • Safari 3 (Mac and Win32)
  • Google Chrome
  • Opera 9

Surveys and server logs show that of these, the most widely used is Internet Explorer (IE). This browser uses a proprietary DOM which has required extensive work-arounds in order to make it respond in the same way as the other, standards based browsers. The problem is compounded by the high prevelence of aged versions of IE with some studies suggesting that IE6 is still the most widely used Internet browser on the planet. Due to extensive problems and quirks in the older versions, we have elected to support only IE8.

The call to create an annotation code can be embedded as a hyperlink on a web page, or made permanently available as a bookmarklet which can be saved as a "Favourite" and called from there as required (see the Demo page). The authoring code defaults to a simple, relatively unstructured format which is nevertheless compliant with the basic Annotea RDF schema. More advanced, context driven, structured annotation creation forms, such as those which have been prototyped for the ALA, can be easily introduced without need to modify the provided library code. This is seen as being of very high importance as the cost of maintenance is always the most significant expense, hence the ability to update or upgrade with minimal effort is a prime consideration.

The client code supports three levels of annotation granularity. The broadest scope is "whole of document" where the annotation mark and any associated replies appear as floating tags in the top left corner of the view. Moving the mouse pointer over the marker triggers a hover text box displaying the annotaton title, creator, and text. At a finer level of granularity, the user may select the text to which the annotation refers. The marker tag is injected at the end of the selection like a "footnote" and scrolls with the page text. The full text of the selection will display in the hover popup. This selection process also applies to regions of images. In this case, the marker appears in the corner of a rectangle defining the selected region. The hover popup will display the clipped image region and the user's annotation text.

Harvesting Provision

DANNO implements the Open Archives Initiative's Protocol for Metadata Harvesting (OAI-PMH) through the open source OAI-CAT web service. The implementation allows annotations to be harvested in the mandatory OAI_DC format through a DC to RDF crosswalk, or in the native RDF format. The cross-walk references the Annotea URL for the annotation body rather than the body content itself. De-referencing this URL ensures the current body content is retrieved rather than a temporal snapshot of the item. In contrast, when harvested in the RDF format, the text of the annotation body is nested within the associated annotation or reply instance. This is more suited to harvesting for reporting purposes where the data will be formatted for immediate display and refreshed periodically.

The current release implements resumption tokens. Harvesting by sets is not yet supported. A future release will superimpose multiple sets over the repository to allow, for example, all annotations for a base host URL to be harvested as a single set.

Reporting - Data Quality through Collaboration

As well as the benefits conveyed by the annotations themselves, knowledge of what resources have the most annotations, which annotations have attracted the most replies, which page was most recently annotated, etc, inform both web site administrators and other users of what topics the community at large are currently interested by. Among the simple tools provided with DANNO are scripts that can be scheduled for periodic execution. One performs an OAI-PMH harvest placing summary data in an SQL table. A second script generates tabular reports from this data, formatting the results as a HTML page for use by administrators who may choose to provide it to users of a service as well. This live link provides an example statistics page from a live DANNO server.

Future Directions

At this time, the DANNO Annotation Server and the associated Dannotate client components should be considered as a pre-production release. The core functionality is complete and sufficient information is provided in the project README files to allow it to be configured and deployed from the binary Web Archive (war) file, or built from the source. Future development will extend the currently rudimentary and optional security model to make use of emerging Australian Identity Servers and implement OAI-PMH sets using a query-based structure.


The DIAS-B project is funded from 2008 - 2011 by the National Collaborative Research Infrastructure Strategy (NCRIS) Platforms for Collaboration, through the National eResearch Architecture Taskforce (NeAT), and by the University of Queensland

NCRIS Logo UQ Logo