You are on page 1of 8

Data sharing and retrieval using OAI-PMH

Ranjeet Devarakonda 1 , Giri Palanisamy 1 , James M. Green 2 , Bruce E. Wilson1

1
Oak Ridge National Laboratory PO Box 2008 M S 6407, Oak Ridge, TN 37831 USA
2
Information International Associates Oak Ridge, TN 37831 USA

ABSTRACT

There is a growing consensus for the need to store and archive digital data, particularly for publicly funded research. Long -

term preservation of that data generally requires some form of institutional archive, such as those directed to part icular

scientific co mmun ities of practice. Given that data is often of use to multip le co mmun ities of practice, wh ich may have

differing norms for data and metadata structure and semantics, effective standards for data and metadata exchange are

important factors for users to be able to find and retrieve data. Toolsets that provide a coherent presentation of information

across multip le standards are important for data search and access. One such toolset, Mercury, is a open -source metadata

harvesting, data discovery, and access system, built for researchers to search for, share and obtain spatiotemporal data used

across a range of climate and ecological sciences.

Mercury is used across mult iple p rojects to provide a coherent search interface for spatiotempora l data described in any of

several metadata formats. Mercury has recently been extended to enable harvesting and distribution of metadata using the

Open Archive Initiative Protocol for Metadata Handling (OAI-PMH). In this paper we describe Mercury’s capabilit ies with

mu ltip le metadata formats, in general, and, more specifically, the results of our OAI-PMH imp lementations and the lessons

learned.

KEYWORDS

Mercury Search System, Scientific data search, OAI-PMH, jOAI, Data sharing, Metadata, Ecological Informatics, Climate

change, Environ mental informat ics, Spatiotemporal data,

1
1. INTRODUCTION

A key conclusion in a recent United States Interagency Working Group on Digital Data (IAW GDD) report on harnessing the

power of digital data for science and society is the role of communities of practice in effective quality control, preservation,

distribution, interpretation, and use of digital data (NSTC, 2009). A g iven researcher may, however, participate in mu ltiple

communit ies of practice, and may also need to draw on data from co mmunit ies outside those he or she normally participates

in. The data generated by a researcher, or by any other data generator, is potentially of use to mult iple co mmunit ies of

practice and scientific disciplines. While that data may b e archived in a particular repository serving one or more part icular

communit ies, that data may need to be discoverable and usable by mult iple co mmun ities.

General-purpose search engines, such as Google and Bing, are currently generally ineffective at disc overing scientific data, in

part because of the specific semantics associated with a particular search and because those search engines generally perform

full text, rather than fielded searches of structured documents. A Google search on the term “Eag les” lacks the context to

distinguish between multip le different meanings, whereas a data repository serving a biology community can presume that

the searcher is referring to members of the genus Haliaeetus.

Advances in the practice of scientific data management, the tools for managing data, the standards for data formats and

metadata formats, and the understanding of the value of digital data have created a wide range of digital repositories focused

on different applications. Nor are these repositories necessarily distinct. There may be a number of different repositories

serving field ecologists, with distinctions based on funding agency, country, organizational affiliatio n, or other artifacts of

historical origin. These repositories generally have search tools that work within their particular hold ings, but are often

unable to search across the holdings of other repositories, due to various technical and sociological fact ors. Fro m the end

user perspective, this situation is problemat ic, as a co mprehensive search for available dig ital data relevant to a research topic

is nearly impossible, requiring knowledge of mu ltiple repositories and the particular search interfaces of those repositories.

Multiple approaches have been used for enabling search across multip le repositories, such as the Z39.50 (Information

Retrieval Standard, 1997) distributed search method. Distributed searches, however, can be problematic, both for re sponse

time and uptime. Search results can only be presented to the user as quickly as the slowest search agent returns (plus some

processing time if the results are to be integrated) and the composite uptime is the product of the individual uptimes.

As a result of the problems with distributed searches, repositories have turned to a harvest and index approach as a means to

ensure rapid response, enable full integration of metadata fro m mu ltip le sources, and provide acceptable uptime. However,

2
harvesting can be an inefficient process, particularly if the metadata are completely reharvested regularly as a means to

ensure that source changes are propagated into the search results.

Mercury (Devarakonda, 2010) is an open-source toolset for metadata authoring, harvesting, indexing, and searching which

implements a variety of harvesting protocols and provides a coherent view of metadata across a range of metadata standards,

including Federal Geographic Data Co mmittee Content Standard for Digital Geospatial Met adata (FGDC CSDGM),

Ecological Markup Language (EML), Global Change Master Directory’s Directory Interchange Format (GCMD DIF),

Dublin Core and ISO 19115. Mercury’s architecture includes 1) a harvesting engine to collect various metadata records from

publically available folders, web sites, ftp sites, and other network accessible locations; 2) a powerful indexing engine based

on Apache Lucene and SOLR that can index b illions of records ; and 3) a service oriented architecture based search engine,

which can perform searches and distribute results through web user interfaces, web services, RSS feed, and portlets.

Recently, we added the Open Archive Initiative Protocol for Metadata Handling (OAI-PM H; OAI, 2010; Van de Sample,

2004) as a means for both harvesting metadata from other repositories and enabling the distribution and reuse of metadata

fro m repositories using the Mercury toolset. This new feature is an extension to the Mercury’s harvester. OAI-PMH is a

standard that is seeing increased use as a means for exchanging structured metadata. OAI-PMH p roviders must support

Dublin Core as a metadata standard, with other metadata formats as optional. We have developed tools that enable Mercury

to both consume and distribute metadata using OAI-PMH services in any of the metadata formats we support. By

implementing these tools, we seek to at least significantly lower the technical barriers for users to be able to find and use

relevant data, regardless of the particular repository that is the authority for that da ta and the associated metadata.

2. METHODS AND TECHNIQUES

Mercury harvests metadata records from several data providers around the world and builds a centralized index and makes it

searchable via Mercury’s search interface (See Figure 1). Once the records are harvested by the default harvester, they are

then exposed to the new OAI-PMH based Mercury harvester.

3
Figure 1. Mercury Metadata Search

The Mercury OAI-PMH Handler is implemented using a Java-based, open source Open Archives Initiative software package

(jOAI), developed by Digital Learn ing Sciences at the University Corporation for Atmospheric Research (UCAR, 2010).

This package allo ws metadata records from a file system to be exposed as items in an OAI data repository and made available

to the data provider for harvesting. Remote harvests can monitor the OAI data repository can effectively mirror the files or

harvest them incrementally. For examp le, NASA ’s Global Change Master Directory’s (GCM D) PMH handler consumes this

structured metadata via ORNL’s OAI harvester service. Figure 2 describes the high level metadata flow fro m ORNL to

GCM D and also shows other potential metadata distribution standards.

For a nu mber of reasons, our OAI-PMH provider is generally configured to expose only the metadata for which that

repository is authoritative. A given repository may be harvesting from multip le different locations, for purposes of providing

a coherent view to the user. That repository may not, however, have permission to redistribute the harvested metadata.

Further, red istribution brings in a number of technical challenges. If repository A harvests from B, which harvests from C,

which harvests from D, then an update to a metadata record at repository D will take three update cycles to reach users of

repository A, which could be a significant delay, depending on the harvest frequency. Furthermore, if repository A harvests

fro m B and fro m C, while both B and C harvest fro m D and D harvests from A, avoid ing a perpetual update cycle and

determining the authoritative instance for a particular metadata record may prove problematic. While this type of repository

harvesting cyclic arrangement may seem contrived, the authors are familiar with a number of cases where such situations

could occur.
4
2.1 OAI-PMH Overview

Metadata are exchanged among the data or service providers as XML documents transmitted over HTTP. There can be

mu ltip le data providers and service providers; each service provider harvests the data from several different data providers.

These transfers are carried over by simple HTTP Requests and Responses . There are six different types of Requests (See

Figure 3). It ’s not mandatory for the harvester to use all the requests.

ORNL ORNL’s OAI-PMH


DAAC Handler Service
Metadata
Records

DIF FGDC ISO DC

GCMD’s OAI -PMH

GCMD Data
Discovery Service

Figure 2. ORNL’s OAI-PMH Metadata flow

2.2 ORNL’s OAI-PMH Harvester

Generally, an OAI-PMH provider stores metadata in an autonomous OAI-PMH repository. This repository has a unique,

persistent baseURL, and the http address BaseURL(n). To monitor the metadata revisions, an OAI-PMH harvester can read

when the record was added, modified or deleted, which helps in synchronization between data provider and the harvester. It

typically uses datestamp for this purpose, which by definition is the data and time of creat ion or modification of the Dublin

Core metadata record. However, updating a resource does not necessarily reflect a modification of Dublin Core record, thus

datestamp might not be the most reliab le basis for incremental harvesting approach. In the previous harvesting approach,

incremental harvesting was unavailable, resulting in long network connections and slowing down the processes until entire

load of metadata are downloaded.

5
Figure 3. OAI-PMH Overview

In general, however, the OAI-PM H protocol provides reliable informat ion on the revision date for metadata records, which

ensures that the harvester only retrieves the records which have changed since the last harvest. This places less strain on both

the PMH provider and the PM H client, and allows for more rapid update cycles.

A metadata crosswalk module manages the available metadata formats. This component helps in conversion of one metadata

format to another. Though it supports multip le formats, Dublin Core is a mandatory for interoperability and standards

compliance. This is yet another new value added module to the Mercury system. Data providers or Data sources can have

more flexib ility in choosing the metadata standard. They can concentrate more on the content than the style of presentation.

Once metadata are harvested, by whatever means, Mercury then extracts available informat ion fro m the metadata records, to

form a co mmon representation used as the basis for the Lucene indexing. The full metadata record is also full-text indexed

and available for the end user to examine as part of the search results.

3. FUTURE DIRECTIONS

While OAI-PMH provides a significant imp rovement in repositories to consolidate metadata from mu ltip le different sources,

providing users at least discovery-level metadata to enable scientific research, users still are likely to need to search multiple

repositories. And once records have been located, the data access mechanisms for various repositories can be quite different .

Some metadata specifications, such as ISO 19115 and GCMD DIF, provide means for data providers to indicate data access

services for standard methods, such as Data Access Protocol (DAP) methods or Open Geospatial Consortiu m (OGC) web

services. As data providers expand metadata to provide such informat ion, and as metadata becomes more mach ine
6
processable, users will be better able to directly access data without having to have as much understanding of differing data

locations and access methods. The Mercury development team is actively engaged in working with data providers to enable

more transparent data access using these types of service descriptors.

When working across communit ies of practice, there are also issues where different terms are used for the same concept and

the same term is used for different concepts. Semantic med iation is one method for addressing this type of problem.

4. CONCLUSION

OAI-PMH enhancement was a useful addition to the harvesting protocols in use for distributing metadata. Metadata

exchanges between agencies are now being carried over more readily. In our implementation, which is based on the standard

and on open-source tools, we are able to supply metadata in multip le formats, based on transformations from our internal

metadata structure. This enables distribution to multiple collaborating repositories in an efficient method and one which best

enables the native capabilit ies of those collaborating institutions.

OAI-PMH focuses on the transfer of metadata between data providers, and other common services like metadata searching

are outside its scope. Integration of Mercury with OAI-PM H is filling a key gap in s earching, sharing and obtaining

spatiotemporal data across the scientific co mmunity, thus boosting its overall performance and usage.

While the specification requires that Dublin Core metadata be an option, this can be a very limited metadata structure,

particularly for co mplex scientific datasets. Metadata exchanges are asynchronously carried out via simp le HTTP requests

and responses that also prove the simplicity of the protocol.

5. ACKNOWLEDGMENTS

Mercury development has been funded by multip le different projects from the National Aerospace and Space Administration

(NASA), the United States Geologic Survey (USGS), and the Department of Energy (DOE). Oak Ridge National Laboratory

is managed by the UT-Battelle, LLC, for the U.S. Depart ment of Energy under contract DE-AC05-00OR22725.

6. REFERENCES

Devarakonda, R., Palan isamy, G., Wilson B. E., James M. Green., (2010) Mercury: reusable metadata management, data

discovery and access system. Earth Science Informatic. 3:(87-94) doi:10.1007/s12145-010-0050-7.

7
NAS (2010) Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age (ISBN 0-309-13685-

7). The Nat ional Academies Press.

NSTC (2009) “Harnessing the Power of Digital Data for Science and Society” Report of the Interagency Working Group on

Dig ital Data to the Co mmittee on Science of the National Science and Tech nology Council. Washington, D.C. Available

at: http://www.nitrd.gov/About/Harnessing_Power_Web.pdf

OAI (2010) Open Archives Initiative Protocol for Metadata Harvesting. Interoperability through Metadata Exchange.

Retrieved May 2010, Avialable at: http://www.openarchives.org/pmh/

Suleman, H., and Fo x, E. A. A (2001) “Framework for Building Open Dig ital Libraries” D-Lib Magazine 7#12.

Lynch, C. A. "The Z39.50 Informat ion Retrieval Standard. Part I: A Strategic View of Its Past, Present and Future." D-Lib

Magazine, April 1997.

UCAR (2010) jOAI software, developed by Digital Learning Sciences (DLS) (http://www.dlsciences.org/) at the University

Corporation for Atmospheric Research (http://www.ucar.edu/). Retrieved May 2010, Avialab le at:

http://www.dlese.org/dds/services/joai_software.jsp

Van de So mpel, H., et al. (2004) “Resource Harvesting within the OA I-PM H Framework,” D-Lib Magazine 10#2.

You might also like