Informal talk presented at DSTC Symposium 11-12 July, 1996
A Power User in Cyberspace: A Database Perspective
Robert M. Colomb Cooperative Research Centre for Distributed Systems Technology Department of Computer Science and Electrical Engineering The University of Queensland, QLD 4072 Australia
Abstract
To a power user, cyberspace has three parts: their own computing environment, the organizational computing environments in which the user participates because of their organizational relationships, and the rest of the world on the Net. This paper considers the database-oriented facilities required for this view of cyberspace to work smoothly, the present state of technology and some research issues.
- Introduction
- Personal Computing Environment
- Intraorganizational Computing Environment
- The Internet
- Complex interconnect problem
- Variety problem
- Collaboration problem
- Conclusion
Introduction
Cyberspace is the entire worldwide computing environment available to a person. The sort of person considered here is someone who understands the structure of the environment, who manipulates it regularly in a deep and creative way, and who has access to and use rights for substantial computing power including a workstation - in short, a power user.
It is useful to divide the computing environment into three parts:
- the user's own environment;
- the organizational computing environments in which the user participates by virtue of their organizational relationships;
- the rest of the world, accessed through the public network.
From a database perspective, the user's own environment consists of a terminal device, a large amount of persistent storage, and sufficient computing power. The persistent storage and computing power may reside either in a personal workstation or on a host computer. In the persistent storage, the user keeps a collection of various sorts of data objects, whose classifications and interrelationships are determined by the user's (possibly idiosyncratic) activities and view of the world. The workspace is equipped with an object management system which permits the user to employ the terminal to search the object store and to display selected contents in a variety of ways. It also includes facilities to manipulate various kinds of object. A user can change their own internal structure at will, limited mainly by the inertia of actually accomplishing change.
The organizational environment looks something like the personal environment, but on a larger scale: it consists of persistent object stores and computing resources of various kinds. The internal structure of the object stores is much more rigid, being determined by organizational means of obtaining consensus and therefore in principle more consistent, and available in a standard way to a large number of participating individuals. The structure may consist of islands of high coherence (e.g. particular information systems) held together with links reflecting intraorganizational formal communication paths and standardized messages. An individual user's connection into this environment is based on standardized messages and standard views of the object stores which are established through the individual's formal and informal organizational relationships. The structure of the organizational environment changes in a more-or-less disciplined way, with announcements to participants and concern shown for maintaining organizational coherence, such as access to legacy data.
In contrast, the rest of the world is chaotic. It consists of a very large number of personal and organizational environments, each of which has its own internal structure and a face it presents to the world. These sites are loosely coupled through some of the sites offering services which index or catalog other sites. A user connects to possibly several of these sites in a more or less regular way, possibly using stored procedures, and often will connect to a large number of sites casually during a resource discovery exercise. There are arrangements for notification of changes in structure from regularly accessed sites.
This world we have described has many failures of coherence and many difficulties in establishing and maintaining appropriate connections for a user, but these problems are not due to failures in the computing environment, but to the limited coherence and unstable nature of the society to which the user belongs. Ideally, the power user would be able to manipulate the three environments with the same set of tools.
People work today in environments which are embryonic versions of that described. Most of the problems encountered are problems with technology, particularly as that technology is integrated into the available system. The remainder of this paper looks at potential changes to this system, based partly on existing but not widely used or appreciated technological components, and partly on the outcome of research and standardization activities, existing or proposed.
Personal Computing Environment
A personal computing environment today generally has several gigabytes of disk, and is equipped with a number of personal productivity tools, like database, word processor, spreadsheet, graphics package etc. The objects stored include numerous word processing files, spreadsheet files, graphics files, and databases. A database is itself a persistent object store, the objects stored being traditionally very regular collections of very small elements (numbers, short text strings, dates). One of the advantages of the regularity of structure of a database's contents is that the database manager is able to have a variety of powerful query languages like SQL or QBE (Query by Example), and is able to visualize the results of queries in sophisticated ways. Since the elements stored are of such restricted type, however, the large number of more complex objects are stored not in a database, but in a file system managed by the operating system's directory handler.
More recent database systems, such as Microsoft's Access or Aldous's Fetch, are capable of storing a much greater variety of objects, using an object's native handler (such as a wordprocessor or spreadsheet) when needed to display or modify it. Linkages between the various tools are mediated by operating system constructs such as Windows OLE or Macintosh Publish/Subscribe. Images and other multimedia objects can be stored. A Brisbane systems house, 2i Corporation, has for example implemented a multi-user office environment using Access. These more advanced databases permit a single overarching interface to the workstation's persistent object store, so that the user has a consistent method of finding and presenting sets of objects, no matter what their type.
The most advanced of these new database systems, such as Infomix's Illustra, permit a tight coupling between the query language characteristic of the overarching database (SQL say) and the procedures characteristic of the individual object types, permitting selection of image objects by features of their content, for example. A workspace equipped with such a database system populated by a full set of methods for each class of object will be a very powerful tool for a power user to manage a personal persistent object store. This technology is rapidly maturing.
Further advances will be necessary, providing research opportunities. Several such come to mind:
- Existing query languages are computationally incomplete, permitting great flexibility in dealing with populations of structured objects but not in dealing with the structure itself. They are one-dimensional. Spreadsheets, on the other hand, permit calculations in two dimensions. It would be useful to augment database query languages with spreadsheet capabilities. In addition, we need to be able to create and copy complex structures.
- Computational processes take collections of data as input and produce collections of data as output. If the collections of data are seen as databases then it makes sense to consider the computational processes as functions from databases to databases. Addition of functional capabilities to databases is therefore strongly desirable.
- Very complex structures are not well served by the basically relational schemes of databases. The advances considered so far are essentially extensions of the type of data which can be the value set of an attribute. The contents of the store are still essentially tables, some of whose cells can be very complex. Many large objects are usefully seen as having a much more large scale structure than a table, for example a graph. A software engineering environment might manage what amounts to a collection of word processing files, but the persistent store needs to be able to manipulate the various kinds of inter-module relationships. Hypertext and hypermedia systems generally are objects of this sort. It is possible to build such persistent stores on top of the fundamentally relational databases considered so far, with user interfaces developed from graph editors.
Preliminary work has been done in each of these areas.
Intraorganizational Computing Environment
The sorts of objects available in the user's organizational environment are the same sorts as available in the personal workspace - databases and computational resources, but on a larger scale. They are managed by similar tools, but not typically the same tools.
Considerable progress has been made in making it possible for the user to employ the same software interfaces to the organizational resources as to their personal resources. A key feature of Microsoft's Access, for example, is its ability to form queries on an external database. For another example, spreadsheets have long been able to include SQL queries on remote databases. The sort of software which makes this connection possible is called middleware.
The scale of the organizational environment introduces new problems, however. First, a large organization typically has a large number (hundreds) of information systems supported by databases. These information systems were typically developed in isolation, and are only just being modified to communicate with each other without extraordinary difficulty. Intermediate structures such as multidatabase schemas and transactional workflows are being introduced to tie the environment together. These additional structures appear to the user as additional services. The Distributed Database Unit is working on these kinds of problems.
Routine use of intraorganizational resources is well-served by the existing infrastructure. The appropriate views and message interfaces are made available by the organization, supported by the middleware platforms. Ad hoc use is much more difficult. The scale of the services is so much greater than one person can build, or even comprehend, that to find the information in the system responding to a query can be very difficult. There is a need for resource discovery tools which take advantage of the documentation of the individual information systems and all the interconnection and overview information structures.
Sheer data volume is also a problem. Even though a workspace for a single person has persistent storage now measured in gigabytes, a large organization's data resources are measured in terabytes. Also, the size of the databases means that a typical ad hoc query might consume a large amount of computing resource in an unpredictable way, thereby interfering with the routine use of the information systems which operates the organization from day to day. It is therefore becoming more common to insulate the power user from the organization's information systems of record, and to confine ad hoc queries to a special data warehouse.
Both resource discovery (Andrew Goodchild) and data warehousing (Ralf Muhlberger, Anna Andrusiewicz) are being studied in PhD projects.
The Internet
Like the intraorganizational computing environment, the internet is outside the control of a single user. It also consists of a large number of different sorts of computing resources. However, it differs from a large intraorganizational environment in two main ways. First, but less important, is scale. An organization may have hundreds of information systems, the internet has hundreds of thousands of sites. Second, and critically, the internet is beyond organization: it is a society.
An organization may be large and diverse, but it has a more or less rational and more or less coherent structure, built, maintained and enforced by its management. A society is governed, not managed. The internet is so new and rapidly growing that it has little government even. Its structure is if not chaotic at least incoherent.
This has several important consequences for the power user.
First, the power user does not have so good a coupling with the net as with their organization. An Access connection can allow a user to make very complex queries spanning a number of different databases. The same user's connection with the net is a web browser which allows simple navigation, retrieval of single objects of a wide variety of types, and the sending of simple messages via HTML forms. We will call this the complex interconnect problem., since we must interact with services which require exchange of complex structured data.
Second, the resources on the Net are different from and more varied than those in a typical organizational environment. The dominant sort of resource in an organization is a structured (SQL-style) database. On the Net, it is much more common to have electronic mail, news, hypermedia sites, ftp sites and text-style databases. Applications where a user can get some sort of service by filling in an electronic form are rapidly increasing (electronic commerce broadly speaking). We will call this the variety problem.
Third, it is difficult to do things involving structured coordination among a number of autonomous sites. There is no reason in principle why purchasing a house, say, should not be carried out electronically. Coordination among a number of decision support services for primary producers is an emerging issue. We will call this the collaboration problem.
Complex interconnect problem
Solving the complexity bandwidth problem requires high-level communication protocols which permit the exchange of structured data in dialogues which may span sessions. Perhaps the best known such protocol is Z39.50, which is designed to access text databases. An early version of Z39.50 is the protocol used by WAIS servers on the Net. It offers three kinds of facilities:
- introduction to the server and its data schemas, and the ability to pose queries and retrieve responses;
- management of intermediate results in the server, not the client;
- management of state across sessions, supporting for example alerters, which are queries on the stream of new documents coming into the server's database.
The Resource Discovery Unit is working on Z39.50 technology. The Distributed Database Unit is working on the extension of Z39.50 technology to structured SQL databases (the ZINC project ).
Making an SQL database available on an open network environment requires solution of technical problems analogous to those of making a text database available on that environment.
- the server must have a standard way of making its system catalogues available to the client, of agreeing on the query language dialects and types of data available, and a standard means to verify version compatibility between client and server.
- the query language must support the ability of the server to not only process a query but to retain it as the basis for further queries, in the same sort of way as managing saved result sets in text databases.
- the server must support a standard means to manage and account for software actions (such as triggers) originating at a client but residing at a server site.
All of these facilities are available in the Z39.50 protocol. The ZINC project is building an adaptation of that protocol to an open SQL environment. The work can be seen as an extension of the existing Z39.50-1995 (Version 3) protocol, uniting the advantages of the SQL RDBMS's with those of Z39.50. It will result in a Net database access facility comparable to using Microsoft Access in an organizational computing environment. The power user will be able to use the same interfaces and facilities when accessing databases in their personal, organizational and Net environments.
Further development of the protocol will be towards more complex data types based on concepts such as the datablade of Illustra, and towards other complex data structures, in particular graph data structures. (A datablade is essentially an abstract data type packaged as a set of data structures and methods associated with a standardized type of complex object.)
Variety problem
To a large extent, the variety problem is similar to the analogous problem in the personal workstation environment. In the latter environment, the trend has been towards the use of database technology to store a variety of objects, and to record information about these objects to assist in their location and in the recording of their relationships. Since the problem in the Net environment is similar, it is plausible that similar technology can contribute to its solution.
A side effect of the ZINC project is that using the Z39.50 protocol a user will be able to query either a text database or a structured database using either a text-style boolean query or an SQL query. Naturally, using a boolean query a user will not be able to query a structured database so richly as with SQL, and a user will have to restrict the sorts of SQL statements directed to text databases, but they will be able to employ the same interface in both cases, say WAIS or Query by Example. This reduces variety.
Some services are usefully regarded as mostly databases. X-500 is an obvious example. News and mail are also close to being databases - most of the facilities offered by advanced mail handlers such as Pine are database functions. If these sorts of services were implemented behind Z39.50 servers, it would be possible to access them as databases, using the same interfaces. This also reduces variety.
Note that most of the facilities of Pine are database facilities which are conceptually in the user's workspace. A more radical implementation of mail would allow all the facilities relating to storage, retrieval and editing of historical collections of messages to be performed by the database and text handler facilities on the user's workstation. All the mail handler would have to do is accept a message for posting and to deliver sets of new messages to the user's workstation. This is essentially the same as permitting database-valued functions taking databases as arguments which was described above in the personal workstation section. The Z39.50 protocol would still be required. A system along this line using INGRES as a workstation data manager has been built in a student project.
Following this line of reasoning, any service with a forms interface can be viewed as a database. For example, consider a service posting error reports to a help desk. The error report itself has a structure, represented by a form. A form is a standard sort of interface to a database, so a filled-in form can be viewed as a database record, or perhaps a collection of related records. If the help-desk service were able to present its schema to the user's workstation database manager, the user's own software could manage the data acquisition task for the service. Note that use of HTML forms is a step in this direction. The proposal would be for the service to be implemented behind a perhaps lightweight Z39.50 server.
Pursuing the analogy with mail a little further, note that most of the persistent storage requirements in the mail application are generated by the user's desire to keep categorized historical records of mail sent and received. It would be very likely that the user would in the same way wish to keep categorized historical records of help requests and the help desk's responses. The same would likely be true of many forms-oriented services. Implementing these services behind Z39.50 servers would reduce variety and allow the services to be integrated more tightly with the user's personal environment.
Collaboration problem
In an intraorganizational computing environment, structured coordination among a number of applications residing perhaps on different computing systems can be accomplished by introducing an additional layer of software and an additional set of data structures, under the direction of the management of the relevant portion of the organization, and using its resources. Achievement of the infrastructure to support coordination therefore depends critically on the existence of a management structure commanding resources which spans the parts of the organization which need to cooperate to provide the coordination. This same management structure can insure that the coordinating applications remain sufficiently stable so that the cooperation can continue.
On the Net, such a management structure does not exist - the sites are autonomous. The Net operates by agreements to follow standards. A complex interaction among a number of specific sites can be achieved by negotiation among those sites, provided that some way can be found to resource the necessary infrastructure and some way can be found to insure sufficient stability in the applications being coordinated. An ad hoc complex interaction, such as purchase of a house, involves a number of initially unspecified sites, which are gradually recruited into the network as the interaction proceeds. This sort of interaction requires business-level standards such as provided by the EDI (Electronic Data Interchange) community- in particular Open EDI.
Assuming that the problems of authentication, security and electronics funds transfer are solved, ad hoc coordination requires repositories of standards for standard commercial messages, and also database technology in the individual sites to manage the dossiers of messages accumulated during the course of an interaction, and workflow technology to keep track of the state of the interaction. This level of standardization of the content of messages is at a higher level than the interface standards offered by DCE and CORBA. A PhD project (Hung Wing) is investigating this area.
A further example of a collaboration problem is the third party value-added service. This kind of service is relevant to power users in two ways: first, such services can evolve from a power user's collection of interactions with Net sites which evolve into a service through being tied together by workstation macro scripts; and second, it will require third party services in what may be described as user space to bring together as useful services the raw net sites, which are in what may be described as resource space. Several net sites may need to be accessed in a coordinated fashion to perform a class of service. Value added services require agreements about stability of functionality of autonomous component sites, and standard means for communication of changes.
Conclusion
This note has argued that database and related information systems technologies are essential in developing the infrastructure needed to support a power user in an environment including both intraorganizational and open Net facilities. It has also indicated that many of the problems raised are or have been under study by members of the research group in the Distributed Database Unit and in the Department with which the author is associated.
