Professional Documents
Culture Documents
1
Oak Ridge National Laboratory PO Box 2008 M S 6407, Oak Ridge, TN 37831 USA
2
Information International Associates Oak Ridge, TN 37831 USA
ABSTRACT
There is a growing consensus for the need to store and archive digital data, particularly for publicly funded research. Long -
term preservation of that data generally requires some form of institutional archive, such as those directed to part icular
scientific co mmun ities of practice. Given that data is often of use to multip le co mmun ities of practice, wh ich may have
differing norms for data and metadata structure and semantics, effective standards for data and metadata exchange are
important factors for users to be able to find and retrieve data. Toolsets that provide a coherent presentation of information
across multip le standards are important for data search and access. One such toolset, Mercury, is a open -source metadata
harvesting, data discovery, and access system, built for researchers to search for, share and obtain spatiotemporal data used
Mercury is used across mult iple p rojects to provide a coherent search interface for spatiotempora l data described in any of
several metadata formats. Mercury has recently been extended to enable harvesting and distribution of metadata using the
Open Archive Initiative Protocol for Metadata Handling (OAI-PMH). In this paper we describe Mercury’s capabilit ies with
mu ltip le metadata formats, in general, and, more specifically, the results of our OAI-PMH imp lementations and the lessons
learned.
KEYWORDS
Mercury Search System, Scientific data search, OAI-PMH, jOAI, Data sharing, Metadata, Ecological Informatics, Climate
1
1. INTRODUCTION
A key conclusion in a recent United States Interagency Working Group on Digital Data (IAW GDD) report on harnessing the
power of digital data for science and society is the role of communities of practice in effective quality control, preservation,
distribution, interpretation, and use of digital data (NSTC, 2009). A g iven researcher may, however, participate in mu ltiple
communit ies of practice, and may also need to draw on data from co mmunit ies outside those he or she normally participates
in. The data generated by a researcher, or by any other data generator, is potentially of use to mult iple co mmunit ies of
practice and scientific disciplines. While that data may b e archived in a particular repository serving one or more part icular
communit ies, that data may need to be discoverable and usable by mult iple co mmun ities.
General-purpose search engines, such as Google and Bing, are currently generally ineffective at disc overing scientific data, in
part because of the specific semantics associated with a particular search and because those search engines generally perform
full text, rather than fielded searches of structured documents. A Google search on the term “Eag les” lacks the context to
distinguish between multip le different meanings, whereas a data repository serving a biology community can presume that
Advances in the practice of scientific data management, the tools for managing data, the standards for data formats and
metadata formats, and the understanding of the value of digital data have created a wide range of digital repositories focused
on different applications. Nor are these repositories necessarily distinct. There may be a number of different repositories
serving field ecologists, with distinctions based on funding agency, country, organizational affiliatio n, or other artifacts of
historical origin. These repositories generally have search tools that work within their particular hold ings, but are often
unable to search across the holdings of other repositories, due to various technical and sociological fact ors. Fro m the end
user perspective, this situation is problemat ic, as a co mprehensive search for available dig ital data relevant to a research topic
is nearly impossible, requiring knowledge of mu ltiple repositories and the particular search interfaces of those repositories.
Multiple approaches have been used for enabling search across multip le repositories, such as the Z39.50 (Information
Retrieval Standard, 1997) distributed search method. Distributed searches, however, can be problematic, both for re sponse
time and uptime. Search results can only be presented to the user as quickly as the slowest search agent returns (plus some
processing time if the results are to be integrated) and the composite uptime is the product of the individual uptimes.
As a result of the problems with distributed searches, repositories have turned to a harvest and index approach as a means to
ensure rapid response, enable full integration of metadata fro m mu ltip le sources, and provide acceptable uptime. However,
2
harvesting can be an inefficient process, particularly if the metadata are completely reharvested regularly as a means to
ensure that source changes are propagated into the search results.
Mercury (Devarakonda, 2010) is an open-source toolset for metadata authoring, harvesting, indexing, and searching which
implements a variety of harvesting protocols and provides a coherent view of metadata across a range of metadata standards,
including Federal Geographic Data Co mmittee Content Standard for Digital Geospatial Met adata (FGDC CSDGM),
Ecological Markup Language (EML), Global Change Master Directory’s Directory Interchange Format (GCMD DIF),
Dublin Core and ISO 19115. Mercury’s architecture includes 1) a harvesting engine to collect various metadata records from
publically available folders, web sites, ftp sites, and other network accessible locations; 2) a powerful indexing engine based
on Apache Lucene and SOLR that can index b illions of records ; and 3) a service oriented architecture based search engine,
which can perform searches and distribute results through web user interfaces, web services, RSS feed, and portlets.
Recently, we added the Open Archive Initiative Protocol for Metadata Handling (OAI-PM H; OAI, 2010; Van de Sample,
2004) as a means for both harvesting metadata from other repositories and enabling the distribution and reuse of metadata
fro m repositories using the Mercury toolset. This new feature is an extension to the Mercury’s harvester. OAI-PMH is a
standard that is seeing increased use as a means for exchanging structured metadata. OAI-PMH p roviders must support
Dublin Core as a metadata standard, with other metadata formats as optional. We have developed tools that enable Mercury
to both consume and distribute metadata using OAI-PMH services in any of the metadata formats we support. By
implementing these tools, we seek to at least significantly lower the technical barriers for users to be able to find and use
relevant data, regardless of the particular repository that is the authority for that da ta and the associated metadata.
Mercury harvests metadata records from several data providers around the world and builds a centralized index and makes it
searchable via Mercury’s search interface (See Figure 1). Once the records are harvested by the default harvester, they are
3
Figure 1. Mercury Metadata Search
The Mercury OAI-PMH Handler is implemented using a Java-based, open source Open Archives Initiative software package
(jOAI), developed by Digital Learn ing Sciences at the University Corporation for Atmospheric Research (UCAR, 2010).
This package allo ws metadata records from a file system to be exposed as items in an OAI data repository and made available
to the data provider for harvesting. Remote harvests can monitor the OAI data repository can effectively mirror the files or
harvest them incrementally. For examp le, NASA ’s Global Change Master Directory’s (GCM D) PMH handler consumes this
structured metadata via ORNL’s OAI harvester service. Figure 2 describes the high level metadata flow fro m ORNL to
For a nu mber of reasons, our OAI-PMH provider is generally configured to expose only the metadata for which that
repository is authoritative. A given repository may be harvesting from multip le different locations, for purposes of providing
a coherent view to the user. That repository may not, however, have permission to redistribute the harvested metadata.
Further, red istribution brings in a number of technical challenges. If repository A harvests from B, which harvests from C,
which harvests from D, then an update to a metadata record at repository D will take three update cycles to reach users of
repository A, which could be a significant delay, depending on the harvest frequency. Furthermore, if repository A harvests
fro m B and fro m C, while both B and C harvest fro m D and D harvests from A, avoid ing a perpetual update cycle and
determining the authoritative instance for a particular metadata record may prove problematic. While this type of repository
harvesting cyclic arrangement may seem contrived, the authors are familiar with a number of cases where such situations
could occur.
4
2.1 OAI-PMH Overview
Metadata are exchanged among the data or service providers as XML documents transmitted over HTTP. There can be
mu ltip le data providers and service providers; each service provider harvests the data from several different data providers.
These transfers are carried over by simple HTTP Requests and Responses . There are six different types of Requests (See
Figure 3). It ’s not mandatory for the harvester to use all the requests.
GCMD Data
Discovery Service
Generally, an OAI-PMH provider stores metadata in an autonomous OAI-PMH repository. This repository has a unique,
persistent baseURL, and the http address BaseURL(n). To monitor the metadata revisions, an OAI-PMH harvester can read
when the record was added, modified or deleted, which helps in synchronization between data provider and the harvester. It
typically uses datestamp for this purpose, which by definition is the data and time of creat ion or modification of the Dublin
Core metadata record. However, updating a resource does not necessarily reflect a modification of Dublin Core record, thus
datestamp might not be the most reliab le basis for incremental harvesting approach. In the previous harvesting approach,
incremental harvesting was unavailable, resulting in long network connections and slowing down the processes until entire
5
Figure 3. OAI-PMH Overview
In general, however, the OAI-PM H protocol provides reliable informat ion on the revision date for metadata records, which
ensures that the harvester only retrieves the records which have changed since the last harvest. This places less strain on both
the PMH provider and the PM H client, and allows for more rapid update cycles.
A metadata crosswalk module manages the available metadata formats. This component helps in conversion of one metadata
format to another. Though it supports multip le formats, Dublin Core is a mandatory for interoperability and standards
compliance. This is yet another new value added module to the Mercury system. Data providers or Data sources can have
more flexib ility in choosing the metadata standard. They can concentrate more on the content than the style of presentation.
Once metadata are harvested, by whatever means, Mercury then extracts available informat ion fro m the metadata records, to
form a co mmon representation used as the basis for the Lucene indexing. The full metadata record is also full-text indexed
and available for the end user to examine as part of the search results.
3. FUTURE DIRECTIONS
While OAI-PMH provides a significant imp rovement in repositories to consolidate metadata from mu ltip le different sources,
providing users at least discovery-level metadata to enable scientific research, users still are likely to need to search multiple
repositories. And once records have been located, the data access mechanisms for various repositories can be quite different .
Some metadata specifications, such as ISO 19115 and GCMD DIF, provide means for data providers to indicate data access
services for standard methods, such as Data Access Protocol (DAP) methods or Open Geospatial Consortiu m (OGC) web
services. As data providers expand metadata to provide such informat ion, and as metadata becomes more mach ine
6
processable, users will be better able to directly access data without having to have as much understanding of differing data
locations and access methods. The Mercury development team is actively engaged in working with data providers to enable
When working across communit ies of practice, there are also issues where different terms are used for the same concept and
the same term is used for different concepts. Semantic med iation is one method for addressing this type of problem.
4. CONCLUSION
OAI-PMH enhancement was a useful addition to the harvesting protocols in use for distributing metadata. Metadata
exchanges between agencies are now being carried over more readily. In our implementation, which is based on the standard
and on open-source tools, we are able to supply metadata in multip le formats, based on transformations from our internal
metadata structure. This enables distribution to multiple collaborating repositories in an efficient method and one which best
OAI-PMH focuses on the transfer of metadata between data providers, and other common services like metadata searching
are outside its scope. Integration of Mercury with OAI-PM H is filling a key gap in s earching, sharing and obtaining
spatiotemporal data across the scientific co mmunity, thus boosting its overall performance and usage.
While the specification requires that Dublin Core metadata be an option, this can be a very limited metadata structure,
particularly for co mplex scientific datasets. Metadata exchanges are asynchronously carried out via simp le HTTP requests
5. ACKNOWLEDGMENTS
Mercury development has been funded by multip le different projects from the National Aerospace and Space Administration
(NASA), the United States Geologic Survey (USGS), and the Department of Energy (DOE). Oak Ridge National Laboratory
is managed by the UT-Battelle, LLC, for the U.S. Depart ment of Energy under contract DE-AC05-00OR22725.
6. REFERENCES
Devarakonda, R., Palan isamy, G., Wilson B. E., James M. Green., (2010) Mercury: reusable metadata management, data
7
NAS (2010) Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age (ISBN 0-309-13685-
NSTC (2009) “Harnessing the Power of Digital Data for Science and Society” Report of the Interagency Working Group on
Dig ital Data to the Co mmittee on Science of the National Science and Tech nology Council. Washington, D.C. Available
at: http://www.nitrd.gov/About/Harnessing_Power_Web.pdf
OAI (2010) Open Archives Initiative Protocol for Metadata Harvesting. Interoperability through Metadata Exchange.
Suleman, H., and Fo x, E. A. A (2001) “Framework for Building Open Dig ital Libraries” D-Lib Magazine 7#12.
Lynch, C. A. "The Z39.50 Informat ion Retrieval Standard. Part I: A Strategic View of Its Past, Present and Future." D-Lib
UCAR (2010) jOAI software, developed by Digital Learning Sciences (DLS) (http://www.dlsciences.org/) at the University
Corporation for Atmospheric Research (http://www.ucar.edu/). Retrieved May 2010, Avialab le at:
http://www.dlese.org/dds/services/joai_software.jsp
Van de So mpel, H., et al. (2004) “Resource Harvesting within the OA I-PM H Framework,” D-Lib Magazine 10#2.