Overview

The ANDS On-line GIS-Enabled Infrastructure for Spatially Integrated Social Science (SISS) project developed online tools for statistical modelling and visualising spatial relationships in socio-spatial datasets derived from Australian voting outcomes and census data.

Funding
Australian National Data Service (ANDS) Data Capture funding
Period
2010-2011
Structure
The Spatially Integrated Social Science project was a collaboration between the UQ eResearch Group and the UQ Queensland Centre for Population Research (QCPR)
Project team
Prof Jane Hunter, Project lead, UQ eResearch Group
Prof Bob Stimson, Project advisor, UQ School of Geography, Planning and Environmental Management
Prof Martin Bell, Project advisor, Queensland Centre for Population Research
Dr Paul Shyy, Data analyst, Queensland Centre for Population Research
Irfan Azeezullah, Software developer, Queensland Centre for Population Research
Friska Pambudi, Software developer, Queensland Centre for Population Research
Dr Nigel Ward, Project Manager, UQ eResearch Group

Significance

The field of Spatially Integrated Social Science (SISS) recognises that much data that the social scientist examines has an associated geographic location (for example, surveys may be associated with the geo-location of respondents). SISS systems use such geographic information as the basis for both integrating heterogeneous social science data sets and for visualising the results of analyses.

Building a SISS system, however, involves a number of time-consuming and highly skilled processes such as sourcing data sets, understanding and encoding relationships between data and geography, and implementing appropriate statistical analysis techniques. The UQ SISS project aimed to reduce this burden by building spatial and statistical analysis tools for social scientists investigating Australian demographic, socio-economic and voting data.

Goals

Throughout 2011 the SISS project focussed on developing online tools that allow researchers to quickly access rich Australian socio-spatial datasets related to voting outcomes and census data, conduct statistical modelling, and visualize spatial relationships between the results. The project aimed to:

  • Create a repository of data and variables derived from Australian Bureau of Statistics Census Data and Australian Electoral Commission voting data;
  • Develop a set of statistical analysis and geospatial visualization services for analysing this data;
  • Expose the services through a Web portal interface that enables researchers to easily analyse the data;
  • Share RIF-CS research data collection descriptions of derived data via the ANDS Research Data Australia (RDA).

Data

The project collected data from the 1996, 2001 and 2006 Australian Censuses of Population and Housing (sourced from the Australian Bureau of Statistics - ABS), and the 2010 Australian Federal Election (sourced from the Australian Electoral commission). Our data analyst derived variables from this data relevant to socio-spatial scientists and mapped it to three levels of geography:

Polling booth catchments Statistical local areas Local government areas
population & demographic structure (age groups) 2006 census 2006 census 2006 census
industry & occupation distributions 2006 census 2006 census 2006 census
household structure / family type 2006 census 2006 census 2006 census
birthplace, ethnic categories, & religious affiliation 2006 census 2006 census 2006 census
housing types & tenure 2006 census 2006 census 2006 census
household income 2006 census 2006 census 2006 census
human capital indices 2006 census 1996-2001-2006 censuses 1996-2001-2006 censuses
social capital index n/a 2006 census 2006 census
shift-share of employment change n/a n/a 1996-2001-2006 censuses
% primary votes 2010 federal election n/a n/a
% two party preferred votes 2010 federal election n/a n/a
% primary vote for Coalition vs Labor 2010 federal election n/a n/a

Definitions for the Statistical Local Area and Local Government Area levels of geography come directly from the Australian Standard Geographical Classification. The Queensland Centre for Population Research (QCPR) created the Polling Booth Catchment level of geography by geo-coding polling booths (based on Australian Electoral Commission data) and spatially allocating Census Collection Districts (from ABS) to a nearest polling booth location to form polling booth catchments within each of the 150 Electoral Divisions

Many of the datasets also include location quotients comparing each variable's local value against the national benchmark for that variable.

SISS portals

The project developed and deployed portals for analysis and visualising the data at each level of geography:

Geospatial classification

Comparing classification of % primary vote for coalition vs labor (centroid) with an overlay of % low income households (polygon) in Melbourne Ports electoral division
After choosing the appropriate portal, a researcher firstly navigates to a region of interest (for example a particular electoral division). The system then allows the researcher to classify two variables of interest using equal interval, quantile classification, or natural breaks approaches. The interface shows the results of classification on a map by colouring either the centroid of a region or the polygon representing the region. This allows the researcher to compare two variables by representing one classification using the centroid and another using the polygon.

Once a researcher has finished their analysis, they can download a PDF representation of their generated classification results.

Statistical analysis
Plot showing goodness of fit of voting for the Labor party versus the Liberal or Greens Parties for selected variables in Melbourne Ports

Researchers can perform regression analysis on variables within the geographic region selected during the classification phase. The researcher firstly selects the dependent and independent variables for analysis. The system then sends the selected data to a regression analysis routine written in R. The researcher can see the results as both an adjusted R-Squared plot and as data statistics.

For example, the plot in the diagram to the right shows the adjusted R2 statistic for the percentage of the primary vote for the Labor Party, the Liberal Party and the Greens Party in the Melbourne Ports Electoral Division for low income, unemployed couples with children who identify as extractive/transformative industry workers with routine production occupations. This statistic measures the proportion of variability in the voting outcome, and shows that those voters have better goodness of fit (are more likely) to vote for the Labor Party than for the Liberal and the Greens Parties. The statistics are also available as a statistical representation of the outputs from the regression analysis function. To enable further inspection of the results at a later stage, the system allows researchers to download a PDF representation of the statistical results.

Although the portals currently only support regression analysis, they have been designed to allow easy integration of other statistical analysis tools written in R.

Access to data and metadata

Description of the polling booth census data in Research Data Australia
Although the original census and voting data is freely available, the "joined" data, the derived data, and some of the region definitions contain significant intellectual property developed prior to this project which we did not have permission to share. Despite this restriction, the project aimed to allow open analysis of the underlying and derived data, and so provides
  1. mediated access to the underlying data (researchers are encourages to contact the project team to discuss access);
  2. open access to analysis tools through the web portals;
  3. publicity for the data by syndicating data descriptions to the UQ Data Collections Registry and ANDS Research Data Australia

We feel his approach balances our obligations to protect the intellectual property of the research teams who created the derived data, while still promoting the data (through RDA), and allowing the socio-spatial research community to perform data analyses.

Source code and system architecture

The source code implementing the SISS portals is available in SourceForge under a GNU General Public License version 2 (GPL-2.0).

High-level system architecture for the SISS portals

Data variables and geographies are stored in a PostgreSQL database, extended with PostGIS support for geographic objects. The application uses JDBC to isolate the application code from the underlying the database.

The database also stores metadata about each of the variables available for analysis. The system exposes this metadata as XML via a Tomcat hosted Java servlet using the XStream library. The system uses JQuery to dynamically generate a web interface based on metadata and location results.

A Java Servlet implements the data classification algorithms, serving XML output that is translated into a Styled Layer Descriptor (SLD) and displayed using a combination of OpenLayers and GeoExt JavaScript libraries and Web Map Service (WMS) layers from Geoserver.

The system calls R routines to perform statistical analysis of variables using an Rserve TCP/IP server. The R routines produce XML that can be used for both graphical and textual representations of the results. The XML results are interpreted visually as graphs using the Processing.js browser side library. The system generates a PDF representation of the results using the GeoServer Mapfish Printing Module, Itext and Processing Java libraries.

The system currently supports an R regression analysis routine, but has been designed so that other R statistical analysis routines can be incorporated in the future.

The system exposes collection-level metadata about the variables used in each portal via a profile of the Atom Syndication Format. These descriptions are ingested into the UQ data collections registry (dataspace.uq.edu.au), which then syndicates the descriptions to ANDS Research Data Australia as RIF-CS transmitted using OAI-PMH.

Presentations and posters

N. Ward, T-K Shyy, S. Irfanullah, F. D. Ungkara, "Analysis and Visualisation Tools for Spatially Integrated Social Science", eResearch Australasia, 2011, Melbourne 6-10 Nov, 2011

N. Ward, T-K Shyy, S. Irfanullah, F. D. Ungkara, "Spatially Integrated Social Science: 2010 Voting Data & 2006 Census Data Analysis Tool", unpublished poster for eResearch group booth at eResearch Australasia, 2011, Melbourne 6-10 Nov, 2011

Future work

The University of Queensland is currently working on a project with the AURIN (Australian Urban Research Infrastructure Network) to expand and integrate the existing SISS infrastructure into the AURIN e-research infrastructure. The project aims to develop advanced Socio-spatial Statistical Analysis, Modeling and Visualisation tools for communities who analyse and interpret AURIN datasets. It will expand the tools developed within the ANDS SISS project, integrate them into the AURIN technical architecture and configure them to support data analysis within the following AURIN Lenses:

  1. Population and Demographic Futures and Benchmarked Social Indicators
  2. Economic Activity and Urban Labour Markets
  3. Urban Health, Well-being and Quality of Life


ANDS Logo This project is supported by the Australian National Data Service (ANDS). ANDS is supported by the Australian Government through the National Collaborative Research Infrastructure Strategy Program and the Education Investment Fund (EIF) Super Science Initiative.