The Capture and Tracking of 'Pieces of Information': Necessary Requirements for 'Educational' and Rich Repurposing Architectures?

Paul Shabajee* and Dave Reynolds**

*Graduate School of Education and Institute for Learning and Research Technology (ILRT), University of Bristol, Bristol, UK Email: paul.shabajee@bristol.ac.uk

**Hewlett Packard Labs, Bristol, UK Email: der@hplb.hpl.hp.com

Contents

1. Introduction

2. Asset and Content Management

3. Basic Knowledge Management and ARKive Type Systems

4. Capturing, Tracking and Indexing of Pieces of information

5. Describing Pieces of Information - Descriptive Metadata & Knowledge Bases

6. Short Term Solutions and Implementation - Pieces of Information as Objects?

7. Flexible Metadata and RDF

8. Medium Term Solutions - use of explicit ontologies

9. A Comprehensive and Long Term Solution?

10. Conclusions

Acknowledgements

References

Notes

Abstract

Existing multimedia and educational re-purposing systems can provide effective means of making use of individual multimedia assets in multiple contexts. This paper argues that where very rich repurposing of assets and information is planned (e.g. educational resources aiming to meet the needs of a diverse range of users from different contexts and backgrounds) not only the use of media objects, but also the 'pieces of information' used to produce the resources need to be 'indexed' and tracked. This requirement is based on the observation that a single 'piece of information' may be used in many different places and contexts for example as a pieces of text or re-worded in a video narration or as a statistic in a graph.

The implications and requirements for systems that could provide the tools for the capture, retrieval and tracking of these 'pieces of information' is discussed, including the use of Semantic Web and Artificial Intelligence based systems and existing barriers to the creation of robust and comprehensive solutions to meet the requirements identified. A short term means of capturing and retrieving them is introduced, which could meet the immediate needs of existing projects.

1. Introduction

The repurposing or reuse of individual digital multimedia assets in multiple contexts afford developers, users and funders of digitisation projects, opportunities to make more effective and efficient use of digital multimedia resources for learning and teaching. For example the same images may be used in on-line resources that have different educational goals or similar goals but with a different pedagogical approach or tailored to individuals with different levels of background knowledge or preferred learning styles.

The creation of computer based educational repurposing systems can been seen as central to the development of the effective and efficient use of digital learning resources as evidenced by for example the X4L funding initiative in the UK, (JISC 2002) and the development of a wide variety of CMS (Content Management Systems) (Mumford and Grout 2002) and LCMS (Learning Content Management Systems) (Brennan et al 2001). The ARKive-ERA (Educational Repurposing of Assets) (ARKive-ERA 2002) project was set up at the University of Bristol in January 2001, funded by Hewlett-Packard Labs. It aims to investigate the requirements for the design of the technological infrastructure for the creation of large on-line multimedia database systems for use in educational contexts. In particular the requirements related to the effective and efficient repurposing of individual assets to meet the needs of a diverse range of learners and educators.

The focus for the research activity of the ARKive-ERA project is the ARKive project an initiative of the Wildscreen Trust (Wildscreen Trust, 2002), ARKive is a large Web based multimedia database focused on animal, plant and fungi species and their habitats. During its initial phase of development it will contain data, profiling some 1500 globally endangered and native UK species and their habitats. This will take the form of approximately 9,000 minutes of digitised video and 30,000 still images along with hours of audio, maps, textual information and other supporting media and educational materials. The underlying data storage architecture is being developed by a research team based at HP-Labs, Bristol (HP Labs 2002).

This paper reviews work done and findings related to the requirements for the authoring, production and maintenance of online multimedia learning resources designed explicitly with repurposing in mind. In particular the capture and management of the 'knowledge' that underlies the development of these resources.

2. Asset and Content Management

Multimedia object repurposing e.g. where a single image may be used in many different places, in many different forms and may be embedded in composite objects e.g. Web pages, is now common place in Content Management Systems (CMS - see Browning and Lowndes 2001) and Learning Content Management Systems (LCMS - see Brennan et al 2001) well supported by the underlying 'asset management' systems variously called MAM (Multimedia Assessment Management), DAM (Digital Asset Management), RMAM (Rich Multimedia Asset Management) systems. The tracking of these objects can easily be automatically carried out via 'administrative metadata'/'meta-metadata', that is data about metadata (Gilliland-Swetland, 1998) e.g. who created the metadata and when and has it been checked for accuracy (see below for more detail).

3. Basic Knowledge Management and ARKive Type Systems

The ARKive project is creating encyclopaedic information content and associated learning resources that utilises the digital multimedia assets stored in their database. Accuracy and currency of all data and presentations of it, is extremely important to ARKive as scientific criteria are fundamental to its preservation and educational aims and also central to its relationships with its partner organisations such as conservation and bioscience organisations and specialists and wildlife media companies. This is equally true of many other similar projects such as those working to digitise and make available museum, cultural and scientific collections. More generally, user surveys related to ARKive indicate that accuracy and currency are highly important for potential users.

Integral to the creation of the ARKive system and data is research activity that collects data related to the species and their habitats. This information is used to create the components of many multimedia assets such as text on Web pages, narration over video, captions for images, on-line tutorials, maps and in many cases the underlying metadata associated with a species, habitat or multimedia objects. Figure 1 shows an example where a piece of information 'no one has found a golden toad since 1989' may be re-used or repurposed in the ARKive system.

e.g. golden toad: 'no one has found a golden toad since 1989'

Figure 1 - example of the repurposing of a 'piece of information'

This highly simplified diagram illustrates the basic ideas behind the repurposing of pieces of information. In this case the piece of information, gained from the source 'document', is used to produce a number of media and 'information' objects e.g. text in a database, graphic data, narration over a video images, etc… These basic objects are then themselves reused in a number of higher level objects e.g. various Web pages, perhaps focused on different aspects of the species, group of species, habitats, conservation stories, on-line tutorials, lecture presentations, downloadable files etc. All of these and many others might be part of resources used directly by the primary organisation (e.g. ARKive in this case) or by third parties (e.g. a university lecturer/department as part of a 'virtual learning environment').

In general information or 'knowledge' has not traditionally been rigorously captured by rich media centric content or digital asset management systems. Once used, the original pieces of information tend to be left in research notes and not directly linked to the places where they have been 'used'. Exceptions include formal metadata elements and what might be called 'core data' (e.g. a piece of text, or data elements in a database that are used (as is) directly to produce output on a Web page) for example:

Data Element Data Value
species text about conservation status for audience type 1 (general public) "Unless there are other undiscovered populations of the golden toad in the more inaccessible parts of the Monteverde Preserve, then this species may be extinct. …"
latin.name Bufo periglenes
size approx 5cm
...  

Table 1 - examples of data held in database used to generate on-line content

Each of these pieces of data may be held in the database for each species. The data can be used to automatically and dynamically create any Web pages or other downloadable resources. Any change in the values of the data can be made to automatically update any resources that use or reference those data elements.

Existing CMS (Content Management System) and DAM (Digital Asset Management) or RMAM (Rich Media Asset Management) tools reviewed as part of the HP Labs based research to develop the ARKive media system (HP-Labs 2002), do not explicitly provide the ability to capture pieces of information and track their use other than in the case of direct re-use above.

The reasons for this are unclear, however one possible reason is that, in practice the 'information content' of existing on-line multimedia information resources are produced using workflows that are similar to print based media or broadcast documentary production, in which research is a complex but generally short term activity undertaken by a relatively small team. As yet it is also rare to find projects that aim to produce and communicate extensive subject domain information to a wide variety of users in a wide variety of ways, using different media. However this situation is changing rapidly for example with the much closer linking of TV, Web, paper based and educational course content production processes such as in the case of, the BBC series Blue Planet (BBC 2002). Learning Content Management Systems (LCMS - see for brief review Brennan et al, 2001) do appear to offer more facility to re-purpose and provide tools to author and manage more flexible production of customised (for the individual learner) learning experiences and hence content. However those reviewed for this project appear to use traditional metadata and resource retrieval approaches and technologies.

It is important to note that metadata, as currently implemented in metadata standards e.g. Dublin Core (Dublin Core Metadata Initiative, 2002a) and IEEE LOM (Learning Object Metadata) (IEEE LTSC 2002) is a sub-set of pieces of information as the term is used here. Metadata standards provide a standardised way of representing data about data, whilst pieces of information may or may not be in a standard form and may or may not be about any-thing held within the system. Most fundamentally pieces of information can stand alone in the system whilst metadata by definition cannot.

If there is little or no direct linking to places where a piece of information is used, this causes a number of significant problems and issues for any system aiming to implement complex repurposing such as ARKive. For example:

3.1 During the Capture & Authoring Stages

a) In the ARKive accessioning and content production workflow, it may be that different researchers will work on the creation of any particular resource. If authors find conflicting pieces of information while conducting research they need to be able to trace the source of the original piece of information so that they can start a process of comparison and validation. Basic prose research notes can help if they include references to sources however such notes are not in general searchable in anything but a free-text manner. Given that the original notes will have a particular focus e.g. a species will not necessarily be accessible if information on a habitat or particular behaviour is sought.

b) As part of the quality control processes subject experts may review and audit the accuracy and other aspects of the data presented on the ARKive website or in any particular multimedia object or composite resource. If they have doubts about the validity of a particular piece of information presented on the site or in other media, if it is not linked to the original source then the expert will have to conduct their own research to validate it.

c) If an author (of any kind of multimedia resource) is producing a resource focused on a particular species, group of species, type of behaviour or habitat, it is useful to have all relevant information to hand in a searchable database along with the source of the pieces of information so that they can ensure for example, that the level of scientific accuracy is appropriate to the intended audience along with any necessary qualifications e.g. "in 199? the 'x' survey conducted by 'a' found that 'y' was 'z'.", rather than simply "'y' is 'z'."

3.2 System Maintenance

d) The tracking of any Intellectual Property (IP), rights management and citation issues related to the pieces of information will be very problematic. If however the source and use of any pieces of information are linked such tracking becomes trivial.

e) If any piece of information is found to have changed or is inaccurate, all assets that utilise that piece of information need to be updated. Without tracking of the 'use' of the original pieces of information it may be impossible to locate and update all the multimedia resources. This is particularly the case for time-based media where finding the point where a pieces of information is used manually may involve watching or listening to a whole 'programme'.

f) Even if were possible to find all the resources that utilised the piece of information by 'hand', the up-dating of resources could not be automated (e.g. in the case of direct use of text) or semi-automated (e.g. in the case where the piece of information is embedded in another object such as an audio file)

g) As new pieces of information are found they need to be captured and indexed so that they become available for any future authoring. An added advantage of this would be that as new information is indexed the system could automatically highlight resources that may need up dating to accommodate the new information.

This list is not exhaustive and there may be many others. As these examples show an ARKive type system that makes full use of the ability to present information in different modes and contexts, will have significant resource, authoring and maintenance overheads and problems if they are not able to efficiently retrieve and track the use of, original pieces of information.

4. Capturing, Tracking and Indexing of Pieces of Information

The capture and tracking of pieces of information is not new, for example the references at the end of this document are examples of such tracking; the quote or statement in the text is the 'piece of information' and the citation is the link to the reference which is the source to which it is attributed by the author.

However tracking alone is not sufficient. For example in the case of references and citations, as researchers we may have hundreds or thousands of references on different aspects of our research subject domains, it quickly becomes necessary to classify references so that we can find and retrieve them when we wish to author a new paper, report or refer to a particular methodology. Without such classification (or possibly the ability to search the original source documents directly, using sophisticated search tools) it quickly becomes impossible to locate references (i.e. resources) effectively or efficiently.

In the case of a large and information rich multimedia repurposing architectures however the tracking of pieces of information is somewhat more complex, especially in the case of time-based media e.g. a piece of information may be used to script a narration of a video clip.

In a project such as ARKive there will be many millions of pieces of information - most often embedded in larger information objects such as books, videos and Web sites. Each needs to be scientifically validated, or at minimum linked to its source, and each may be used in many different contexts e.g. a species may exhibit a particular behaviour, that piece of information may be used to produce text about the species or in a text article about that kind of behaviour or about a superset of that type of species. An author producing any new resource would want to be able to call up all the pieces of information about the topic(s) that they must author materials about.

This type of retrieval is based on the content of the media i.e. the subject matter. There are many practical and theoretical issues over exactly how to produce and manage such descriptive metadata (Shabajee 2002 and Shabajee et al 2002) and there are other types of metadata (see Gilliland-Swetland, 1998) that would be required to keep effective track of the pieces of information e.g. administrative, preservation, technical and use. Existing standards in these areas are likely to meet the tracking needs (see for example OCLC/RLG Working Group on Preservation Metadata. 2001 and Digital Library Federation 2002) the complementary developments such as the ABC Ontology developed by Lagoze and Hunter (2002) provides an very interesting means of representing and tracing many aspects of 'change' and evolution of the use of an object over time.

One serious resource issue that leads on from these approaches is that there is a high initial overhead in entering and indexing the pieces of information especially when compared to traditional systems. However this is the same for any kind of metadata, it is a matter of balancing short term investment and likely long term benefit and cost saving in maintenance, support and development costs. Further if they are viewed as important 'objects' necessary to ensure that the collection can fulfil its full potential then the costs of its accessioning, indexing and storage etc… would be built into the initial project planning as is common practice now with 'traditional' metadata (see for example NOF technical standards and guidelines, UKOLN 2002). Of course the basic linking of a piece of information to its original source is essential to the integrity of any information system. As we discuss below evolving technologies can help support the accessioning process, hopefully making the process more efficient.

As Shabajee et al (2002) point out, direct indexing via metadata is likely to be only one of a set of tools used to support the information retrieval, free text searching, concept extraction (from text or audio files via voice recognition), Content Based Image Retrieval (CBIR) and other means will probably need to be combined to provide a suite of tools to support truly comprehensive retrieval of any kind of information.

5. Describing Pieces of Information - Descriptive Metadata & Knowledge Bases

One critical question with regard to any implementation of a capture, re-purposing and tracking system outlined above, is whether the use of descriptive metadata used to describe the pieces of information would be sufficient (with respect to the content description) to enable effective and efficient capture, re-purposing and tracking of the pieces of information and if so, is it the same, a sub-set or super-set of that used to describe other content in a system. Below is a brief review of key issues in the use of descriptive metadata and its limitations in re-purposing.

Vocabularies for descriptive metadata used to catalogue and index are critical to effective repurposing and there are many existing descriptive metadata standards focused on particular subject domains (Shabajee et al 2002). The use of existing external standards has many advantages in particular that of interoperability, i.e. the ability to inter-operate with other information systems and tools which are 'aware' of the metadata standard(s) used. Thus the publication of meaningful metadata which provides the basis for effective third party services e.g. cross searching with other related resource and data aggregation become possible. Many such technologies and services have been and are being developed, the Open Archives Initiative (Open Archives Initiative, 2002) is a good example of this type of technology and approach.

However in many cases a single descriptive metadata standard will not be sufficient for the needs of a particular project, in which case an application profile approach (Heery and Patel 2000) can provide a means of balancing the needs for effective rich descriptive vocabularies and interoperability, by utilising terms from a number of external standards.

Early experimentation with developing a comprehensive descriptive metadata vocabulary for ARKive, lead to the realisation that existing external metadata standards and thesauri would not be sufficient to meet the needs of project or potential users. For example, a small survey of university lecturers identified the desire to be able to locate media related to particular animal behaviours of interest to a small specialist community who might use a very specialist vocabulary. Not only would it be impractical in time and resource terms to index media (especially time-based media) to that level of detail, only the specialists themselves would be able or qualified to accurately identify the particular behaviours for classification. There is a direct link here with the idea and potential benefits of enabling, and providing systems and tools to support, specialist communities of users to annotate and thus add value to media with terms or pieces of information from their own vocabularies and domains (Shabajee et al 2002).

The issues related to the creation of appropriate vocabularies, the relationships between them and their 'meaning' or semantics, are complex and the implementation yet more problematic, in particular creating extensible vocabularies or ontologies (see below). However it is clear from the study of the ARKive authoring processes that as content based research is conducted, new 'terms' and concepts are identified that would usefully be used to 'index' (i.e. enable retrieval of) the pieces of information and media objects for use as sources to author new materials. Thus the information research and creation of the descriptive vocabularies to describe the pieces of information are cyclic and iterative (see below for more detail).

6. Short Term Solutions and Implementation - Pieces of Information as Objects?

In the short term it may be that simple structured metadata vocabularies could provide a basis for the indexing of pieces of information' and with simple query tools the ability to locate them. One way to do this at a basic level it would be to treat pieces of information in essentially the same way as any multimedia object. Pieces of information are in many ways similar to other types of media object; this can be illustrated by studying some of their characteristics that mirror those of media objects.

1) They must be accessioned or sourced (i.e. located and obtained) and the sources documented.

2) They can be 'digitised', insofar as they can be written, drawn or represented in some kind of media, either digital or that can be digitised.

3) They are used in many different places, in many forms and 'embedded' in many different objects.

4) Re-use may require that the piece of information is manipulated in some way, e.g. in the case of a textual piece of information, from one language to another, to suit different target audiences, turned into graphical or audio data.

5) They may, as with cropped images or time based media clips be part of a larger object, e.g. a sentence in a text document or a line on a graph.

6) They must have metadata linked to them in order to facilitate retrieval from the system/database i.e. it is not viable to simply enter facts into a database without some categorisation e.g. this is about x, or this was created by 'abc'. …

7) If the management and maintenance of data is to be automated or semi-automated the use of pieces of information needs to be tracked in any large information rich repurposing system, e.g. ARKive.

In the case of reuse or repurposing (see 3. above), the way in which pieces of information are 'used' is qualitatively different to that of other media objects i.e. an original piece of information may be in the form of a graph and the actual representation that a user sees might be a table of data or a piece of prose text. Whereas in the case of an image, even if it is manipulated, it is some part (albeit possibly unrecognisable) of the actual image that is eventually shown to the user. However that difference is not significant with regard to tools used to describe or track the use of multimedia objects.

Broadly speaking the metadata elements (see 4, above) that can be applied to the pieces of information are the same as those of other multimedia objects, with the possible exception of descriptive metadata as discussed above - a combination of existing external standards and static or extensible project specific descriptive vocabularies might provide a workable basis for these (see below for a more comprehensive and flexible approach).

As with other media, some metadata elements are not appropriate e.g. 'colour management' is appropriate for images but not audio files or pieces of information. The precise metadata required to accession, track and maintain pieces of information depends on the application. If it is the case that pieces of information can be treated in the same way as other multimedia objects, the problem of 'representation' will be the similar to those of other objects. Indeed if the above approach is taken the pieces of information are other types of media or a pointer to one, e.g. a piece of text, graph and a URL (Uniform Resource Locator) pointing to a resource. This is complicated by the need to link directly to (in general) a particular part of an object e.g. a sentence in a larger text document.

There might be no need to create or add new technology to existing CMS and repurposing architectures. It may even be possible to use existing systems. For example:

By creating a metadata element for 'pieces of information' with values equal to the text of a piece of information or pointer to another resource and then using existing descriptive and other metadata elements to describe it - the only requirement is that the system allows the application of all types of metadata to other pieces of metadata. Such an approach would enable implementation of the capture of pieces of information and tracking with existing systems as is.

Another approach could be to produce all the pieces of information as individual 'documents' or objects, which are then treated as other media objects. However this would require an authoring environment for the pieces of information either as part of the existing system, or independent of it. Also extracting a pieces of information from a larger object (e.g. a text document) constitutes a significant time and resource overhead when compared to a system of annotating parts of the original object.

However such a solution is likely to be problematic to use in practice, since existing interfaces are not in general designed explicitly to support the retrieval of metadata (piece of information) via descriptive or other metadata, which is then applied as metadata to a media or text object or part of an object. It may be possible to customise some existing systems to do this however.

A more bottom up approach might be to design a system that provides such tools as an integral element of the user interface. As with any existing system different classes of users are likely to require different 'views' onto the data e.g. a researcher conducting primary research will require an interface tailored to collecting, indexing and annotating new pieces of information and marking and logging the sources and particular part of the source object, while a educational content author will need an interface which affords easy retrieval of the pieces of information and a different, educational content focused, authoring environment.

The integration of these tools with existing MAM and CMS systems could provide the necessary functionality and ease of use for the initial capture of the pieces of information. However the design of this and other interfaces will depend on the particular tasks, class of users, environments and contexts of individual projects.

As stated above other complimentary means such as free text searching and CBIR would also facilitate the retrieval of pieces of information the particular set of tools depends on the types of media and nature of the subject matter.

7. Flexible Metadata and RDF

This approach of treating pieces of information as atomic assets within a system, indexed by conventional metadata, is unlikely to solve the problem completely. Later we will consider how we can remove the limitation that pieces of information treated this way are impenetrable black boxes. For now let us just consider the issues of capture and management of the metadata for these pieces of information.

If this approach is to be viable we must make it as easy as possible to discover and reuse relevant pieces of information. The content authors will need to identify, isolate, capture and index all relevant pieces of information while performing their main job of content authoring. If the tools are not adequate then this will become an unacceptable overhead and important pieces of information will either never be captured at all or just captured for one use and other manifestations of the same information will not be indexed. Facilitating discovery and reuse so that the information captured is exploited maximally is a key to making the additional burden manageable. This means that that descriptive metadata used to for this indexing and discovery needs to be particularly effective.

However, there is a problem here. The concepts relevant to the pieces of information are, as discussed above, likely to be richer than those needed to describe the content assets. In the Arkive example, they are likely to include information on complex subjects such as habitats, climate change, behaviour types, demographic trends and so forth - not just species and media information. Furthermore, the pieces of information will need to be captured from the very start of a system development before the overall structure and terminologies for the content data have been fully worked out. Thus the descriptive terms will evolve over time.

One very important, and rapidly evolving, approach to representing such very rich and extensible metadata is RDF (Resource Description Framework - W3C 2001b). In more conventional metadata approaches a fixed schema is defined which enumerates the set of descriptive terms that can be used their legal values. Storage systems are designed to store metadata that conforms to the appropriate schemas. In contrast in RDF the metadata is broken down in a set of atomic assertions such as <info-item-1, aboutSpecies, mammal>. Each assertion attaches a property and its value to some subject (for example an instance of a piece of information). In RDF the identity of the items being described and the properties used to describe them are both URIs (Uniform Resource Identifiers - W3C 2002a). This use of URIs to provide a single global namespace is critical to the open extensibility of RDF. At a stroke this enables different metadata schemes to be freely composed and allows the metadata itself to be gathered from multiple sources and easily integrated and merged. It also allows descriptive terms to be added and refined over time supporting the evolution of metadata which we see as critical.

RDF in turn is but the first step in a larger movement towards a "Semantic Web" (W3C 2001a). This aims to extend the web to support greater semantic interoperability of data and information systems. As well as the representation of base level facts that RDF provides, the Semantic Web aims to support explicit capture of domain models (ontologies, discussed further below) and trace-ability of the provenance and basis for assertions.

If we use an RDF-enabled metadata system for the description and indexing of our pieces of information then we can enable authors to easily pick descriptive terms and concepts from any relevant vocabularies across the (semantic) web. Furthermore, we can use the self-descriptive powers of RDF (reification) to capture provenance information concerning this metadata thus simplifying future validation and correction of the metadata.

8. Medium Term Solutions - use of explicit ontologies

The metadata that we use to index our pieces of information in the above approach is, in fact, itself an important part of the domain information that we are capturing. The vocabulary of properties and descriptive values (concepts) that we develop for indexing the pieces of information is the beginning of a deeper conceptual domain model. We can go one stage further and begin to explicitly capture the relationships between the properties and concepts. For example, noting that all felines are also mammals or that pacific atolls are special cases of island habitats and the Chagos islands are pacific atolls. So then later if a piece of information is associated with island habitats (for example that they are at risk from sea level rises due to global warming) we know that will apply to pacific atolls such as the Chagos islands.

Such a structured vocabulary of terms with explicit representation of their interrelationships is called an ontology. Thus we can see that the process of capturing and annotating our pieces of information involves the development of a suitable domain ontology to support that annotation.

The information embodied in our domain ontology is itself part of the information that we wish to capture. For example:

That most birds fly; this is not a piece of information that would be likely to be used in any text about a particular bird, however it is very useful and possibly necessary, if you wish to query the database to find 'all the animals that fly', or to help to index media objects or new pieces of information. Without such a searchable knowledge base such a query would be likely to return many species with the word 'fly' in their name or in the descriptive text about them, for example the 'flying fox', or 'penguins' (ironically because they don't fly) or other special cases, but not in those of birds that do actually fly.

This type of ability is clearly useful for end users but also for authors of educational or other content, especially where the topic about which the work is being authored is not one that would traditionally have had appropriate metadata terms or vocabularies available to describe them e.g. where the topic is not directly related to the 'core' subject of a collection such as fashion in a science focused collection.

Using an explicit domain ontology to structure our metadata and explicitly capture some of the pieces of information gives us several benefits.

Firstly, we can use simple inferencing, based on the property and concept hierarchies to support indexing and retrieval such as in the flying animal example above.

Secondly, this same capability can be seen as a benefit at capture time. We need only capture a piece of information at the higher level in the hierarchy and let the inheritance processing machinery complete the task of attaching it to lower levels - as in the pacific atoll example above. This reduction in the capture workload is critical to rendering explicit capture of pieces of information practical.

A third benefit is the ability to use the explicit ontology constraints to improve the user interface of the information capture tools by only offering elements and terms that are appropriate to the object being indexed and by providing default values where possible. A relevant example of this type of approach can be found in Schreiber (2001).

Finally, in directly capturing some of the relevant information in the ontology itself rather than in the black box objects that we are indexing we are laying the foundations for more explicitly representation of the information - we are starting to open up the black boxes.

The notion of capturing explicit structured domain models using ontologies is not a new one but substantial progress in standardisation of tools and languages for this have been made recently - partly as a result of the W3C semantic web initiative noted above. Several ontology languages which are compatible with the semantic web RDF layer have been developed over recent years - notably the OIL Ontology Interchange Language developed by the European On-to-knowledge project (Fensel et al 2000) and the DAML DARPA Agent Markup Language (DARPA 2002) developed under the DAPRA program in the US. These languages have merged into a single candidate standard ontology language (termed DAML+OIL) and form the basis for a new W3C working group which is defining the standard web ontology language (currently termed OWL W3C 2002b).

Difficulties still remain, however, and we should see this use of explicit ontology-based indexing as promising leading edge technology rather than common industry practice. In particular, choosing how to best structure the ontology for a given domain remains a specialist task and the engineering advice for guiding and evaluating the construction of such ontologies is still being developed. This is true even for stable well understood domains and so handling evolving ontologies in ill-structured domains is harder still - raising issues of how to codify conflicting or controversial knowledge and how to maintain consistency and integrity of any ontology that is being created by a diverse range of individuals over an extended period of time. For further discussion on some of these issues see (Shabajee et al 2002).

Even for stable, well understood domains the representational power of ontology languages is, deliberately, limited and even the sort of default reasoning suggested by the most birds fly example above is beyond the built in capabilities of the emerging standards and would require either custom processing tools or a shift to a richer representation mechanism.

9. A Comprehensive and Long Term Solution?

So far we have moved from representing pieces of information as simple indexed objects through to capturing the domain models behind those indexes (and thus some of the information itself) using explicit ontologies. The final stage in this progression is to attempt to encode the knowledge directly using some richer knowledge representation language. Replacing our opaque objects completely by machine processable knowledge structures.

If possible this would support substantially more sophisticated queries for navigating our repositories and would enable far greater reuse of codified information in related repositories.

Formalisms for knowledge representation and techniques for eliciting and processing such knowledge have been studied in the Artificial Intelligence field for many decades (and in the fields of philosophy and mathematical logic for much longer than that). These include formal approaches based either directly or indirectly on mathematical logic (for example, predicate calculus, conceptual graphs, semantic networks, frame systems) through to less formal approaches such as forward product rule systems or case-based reasoning systems (see for example Sowa 2000, Brachman and Levesque 1985 and Negnevitsky 2001). Many of these approaches stumble when dealing with the uncertainty of the real world - problems which can sometimes be side stepped by moving to alternative formalisms which can handle such uncertainty (such as non-monotonic, paraconsistent or fuzzy logics) or by directly adopting probabilistic modelling techniques (e.g. Bayesian belief networks).

However, practical application of such knowledge representation techniques to the sorts of pieces of information that we have been discussing remains very challenging.

Firstly, there is the issue of simple representational power - even the default reasoning capabilities implied by our apparently simple most birds fly example is a challenge to most representations. Such challenges are compounded in the applications we have been discussing by the instability in the domains and the amount of partial, changing, conflicting and controversial knowledge that is to be captured.

Secondly, there is the question of the sheer cost of capturing such knowledge. AI researchers have developed a solid body of techniques for eliciting knowledge from domain experts, and engineering guidelines for the overall process (KADS for example - Schreiber et al 1993). Nevertheless, this remains a difficult and time-consuming process typically requiring the assistance of knowledge elicitation specialists. Such an overhead may not be justified by the benefits (authoring, ease of maintenance, ease of navigation and retrieval) that we are attempting to achieve. Some have suggested that if we could capture a sufficient body of basic world knowledge then the task of building specific knowledge bases as extensions to that common foundation would be much simplified. However, despite the impressive work by groups such as CYC (Lenat and Guha 1990 and Cycorp 2002) practical demonstrations of such common sense reasoning actually reducing the cost of useful knowledge engineering are rather rare.

Fortunately, a compromise is possible. In taking the approach we advocate above, of concentrating initially on the metadata used to index opaque information objects we argued that we can begin to build explicit ontologies for describing our domains. These ontologies, in turn, can form the foundations for deeper knowledge representation when useful. The domain ontology provides the backbone of agreed vocabulary of concepts and relations that can be referenced in rules or other knowledge representation structures. This allows us to incrementally add to the representation power of our system rather than having to chose up front an all or nothing representation approach.

10. Conclusions

The capturing of pieces of information is an essential aspect in the production and maintenance of information rich projects where information is re-presented in many ways using diverse media. While not essential, the ability to track and index these offers a range of significant advantages in content authoring and maintenance. However the authors would argue strongly that any new large rich media projects which aim to do large-scale re-purposing of their assets for educational or information provision to diverse audiences should consider building such requirements into their system design.

It may be possible for existing MAM (Multimedia Asset Management ) and CMS (Content Management Systems) to be customised to provide this kind of functionally by treating pieces of information in a similar manner to other media objects. However any such solution is likely be problematic as 1) the underlying knowledge representation does not solve the basic problems or complexities outlined above and 2) interfaces have not been designed with this in mind.

The integration of these kinds of 'knowledge management' tools along side other information retrieval technologies such as concept extraction tools and CBIR, into existing MAM and CMS systems either as standard or as optional components would provide developers and uses with a more complete and appropriate solution.

A intermediate solution could utilise the evolving Semantic Web technologies to represent the indexing terms and relationships between them. These would provide the basis for efficient tools to support the capture and index the pieces of information and the ability to conduct more semantically rich querying of a system. However this still has some limitations especially with respect to queries that require a sophisticated level of inferencing over a knowledge base represented collectively by the pieces of information.

Central to any totally comprehensive solution to capture, retrieve, reuse and track these pieces of information is the ability to represent the 'knowledge' in a machine-readable form. We have argued that full machine-processable representation is difficult and unlikely to be cost effective but that an incremental approach is possible - starting from a domain ontology for indexing opaque information items, adding simple ontology-based inferencing and only moving to full representation of this knowledge at the technologies and the domain understanding mature. While such technologies exist, and are under constant development, there are still a number of fundamental issues that remain un-solved especially the means for inexperienced end users to continually add and integrate new 'knowledge' and a number of problems related to robust, comprehensive and useful inferencing over inconsistent, and partial knowledge bases.

Acknowledgements

The ARKive-ERA project is funded by HP Labs, Bristol. The authors would like to thank Andy Morgan and Neil MacDougall of HP Labs Bristol and Andy Dingley of Codesmiths, Bristol, for their help and support in exploring the ideas outlined in this paper.

References

ARKive-ERA. (2002) ARKive ERA (Educational Repurposing of Assets) Homepage. Available: http://www.ilrt.bris.ac.uk/projects/project?search=arkive_era.

BBC (2002) Blue Planet Homepage, Available: http://www.bbc.co.uk/nature/programmes/tv/blueplanet/.

Berners-Lee, T., Hendler, J. and Lassila, O. (2001) The Semantic Web, Scientific American, May 2001.

Brachman, R. J. and Levesque, H. J. (Eds.) (1985) Readings in knowledge representation, Morgan Kaufmann, Los Altos, CA.

Brennan, Michael., Funke, Susan. and Anderson, Cushing (2001) The Learning Content Management System, A New eLearning Market Segment Emerges, An IDC White Paper, IDC, Available: http://www.internettime.com/itimegroup/lcms/IDCLCMSWhitePaper.pdf

Browning, P. and Lowndes, M. (2001) Technology and Standards Watch Reports: Content Management SystemsJISC.

Cycorp (2002) Cycorp Company Overview Available: http://www.cyc.com/overview.html

DARPA. (2002) The DARPA Agent Markup Language Homepage. DARPA. Available: http://www.daml.org/.

Dublin Core Metadata Initiative. (2002a) Dublin Core Metadata Initiative (DCMI) Homepage. Available: http://dublincore.org/.

Dublin Core Metadata Initiative. (2002b) DCMI Citation Working Group Homepage. Available: http://dublincore.org/groups/citation/.

Fensel, D., Horrocks, I., Van Harmelen, F., Decker, S., Erdmann, M. and Klein, M. (2002) OIL in a nutshell In: Knowledge Acquisition, Modeling, and Management, Proceedings of the, In European Knowledge Acquisition Conference (EKAW-2000), Lecture Notes in Artificial Intelligence (Ed, (eds.), R. D. e. a.) Springer-Verlag, Juan-les-Pins, France. Available: http://www.cs.vu.nl/~ontoknow/oil/downl/oilnutshell.pdf.

Gilliland-Swetland, Anne J. (1998) "Setting the Stage: Defining Metadata" in Introduction to Metadata: Pathways to Digital Information, Murtha Baca, ed. Los Angeles: Getty Information Institute, Available on-line: http://www.getty.edu/research/institute/standards/intrometadata/2_articles/index.html.

Heery, R. and Patel, M. (2000) Application profiles: mixing and matching metadata schemas, Ariadne 25.

HP Labs (2002) HP Labs - ARKive Homepage, Available: http://www.hpl.hp.co.uk/arkive/

IEEE Learning Technology Standards Committee (LTSC). (2002) Draft Standard for Learning Object Metadata, ver. 6.4. Available: http://ltsc.ieee.org/doc/wg12/LOM_WD6_4.pdf.

JISC. (2002) JISC Circular 2/02: Exchange for Learning Programme (X4L). JISC. Available: http://www.jisc.ac.uk/pub02/c02_02.html.

Lagoze, C. and Hunter, J. (2001) The ABC Ontology and Model, Journal of Digital information, 2(2) Available: http://jodi.ecs.soton.ac.uk/Articles/v02/i02/Lagoze/

Lenat, D. B. and R. V. Guha. (1990)Building Large Knowledge Based Systems. Reading, Massachusetts: Addison Wesley.

Mumford, A. and Grout, C. (2002) X4L Programme: "Exchange for Learning": TOWN MEETING: 1st February 2002, 10.30am. JISC. Available: http://www.caret.cam.ac.uk/pdfs_ppts/X4L_TownMeeting.ppt.

Negnevitsky, Michael. (2001) Artificial Intelligence: a Guide to Intelligent Systems, Addison Wesley, London

OCLC/RLG Working Group on Preservation Metadata. (2001) Preservation Metadata for Digital Objects: A Review of the State of the Art: A White Paper by the OCLC/RLG Working Group on Preservation Metadata. January 31, 2001. OCLC/RLG Working Group on Preservation Metadata. Available: www.oclc.org/research/pmwg/presmeta_wp.pdf.

Open Archives Initiative. (2002) Open Archives Initiative Homepage. Open Archives Initiative. Available: http://www.openarchives.org/.

Shabajee, P. (2002) 'Educational Metadata' a Fundamental Dilemma for Developers of Multimedia Archives, D-Lib Magazine, 8(6). Available: http://www.dlib.org/dlib/june02/shabajee/06shabajee.html

Shabajee, P., Miller, L. and Dingley, A. (2002) Adding value to large multimedia collections through annotation technologies and tools: Serving communities of interest., In Museums and the Web 2002: Selected Papers from an International Conference(Eds, Bearman, D. and Trant, J.) Archives & Museums Informatics, Boston, USA. Available: http://www.archimuse.com/mw2002/papers/shabajee/shabajee.html

Schreiber, A. T., Dubbeldam, B., Wielemaker, J. and Wielinga, B. (2001) Ontology-Based Photo Annotation, IEEE Inteligent Systems, May/June 2001, 2-10. Available: http://www.swi.psy.uva.nl/usr/Schreiber/papers/Schreiber01a.pdf

Schreiber A, Weilinga B, Breuker J (eds). (1993) KADS: A principled approach to knowledgebased system development. London: Academic Press.

Sowa, J. F. (2000) Knowledge Representation: Logical, Philosophical, and Computational Foundations, Brooks Cole Publishing Co., Pacific Grove, CA.

UKOLN (2002) nof-digitise Technical Standards and Guidelines, New Opportunities Fund (NOF), Available: http://www.peoplesnetwork.gov.uk/content/ts_index.asp.

W3C. (2001a) Semantic Web. http://www.w3.org/2001/sw/.

W3C. (2001b) Resource Description Framework (RDF). Available: http://www.w3.org/RDF/

W3C. (2002a) Naming and Addressing: URIs, URLs, ... W3C. Available: http://www.w3.org/Addressing/

W3C. (2002b) Web-Ontology (WebOnt) Working Group Homepage. W3C. Available: http://www.w3.org/2001/sw/WebOnt/.

Wildscreen Trust (2002) ARKive, Available online: http://www.wildscreen.org.uk/arkive.htm

Notes

[1] Other terms for these might be facts, beliefs, assumptions or assertions however each of these have additional connotations which we would prefer to avoid in this initial exploration - hence our use of the longer but more neutral term pieces of information.