Knowledge Discovery in Digital Libraries of Electronic Theses and Dissertations: An NDLTD Case Study

Int J Digit Libr (2008) 9:163171 DOI 10.
1007/s00799-008-0046-9
REGULAR PAPER
Knowledge discovery in digital libraries of electronic theses and dissertations: an NDLTD case study
W. Ryan Richardson Venkat Srinivasan Edward A. Fox
Published online: 7 October 2008 Springer-Verlag 2008
Abstract Many scholarly writings today are available in electronic formats. With universities around the world choosing to make digital versions of their dissertations, theses, project reports, and related les and data sets available online, an overwhelming amount of information is becoming available on almost any particular topic. How will users decide which dissertation, or subsection of a dissertation, to read to get the required information on a particular topic? What kind of services can such digital libraries provide to make knowledge discovery easier? In this paper, we investigate these issues, using as a case study the Networked Digital Library of Theses and Dissertations (NDLTD), a rapidly growing collection that already has about 800,000 Electronic Theses and Dissertations (ETDs) from universities around the world. We propose the design for a scalable, Web Services based tool KDWebS (Knowledge Discovery System based on Web Services), to facilitate automated knowledge discovery in NDLTD. We also provide some preliminary proof of concept results to demonstrate the efcacy of the approach. Keywords Large digital libraries NDLTD Knowledge discovery Web services Concept maps Electronic theses and dissertations Cross language information retrieval
1 Introduction Since the late 1990s many universities have required students to submit their theses and dissertations electronically, produced as part of their graduate studies, to be archived by their home institution. NDLTD [1] was formed in 1996, and incorporated in 2003, to support the collection, dissemination, and searching/browsing of ETDs. Currently NDLTD has over 70 fully participating members (see Table 1), and contains ETDs in at least 12 languages. Since the global yearly production of theses and dissertations is well over 100,000 (adding at least 100 Gigabytes of text information, and Terabytes of multimedia content), and since more and more universities are involved in ETD initiatives, the collection is poised to grow dramatically. ETDs today often contain much valuable information that is not in text form, such as full color images and links to multimedia les. While NDLTD has made access to ETDs much easier than before, there are many hurdles to overcome in order to make it more useful. A large number of ETDs available on a particular topic can make it difcult for users to determine which dissertations (or sections thereof) to read. Add to this the fact that ETDs generally contain a lot of content, typically 100 or more pages, and thus browsing through them can be time consuming. Moreover, ETDs are written in many different languages, and so it can be difcult for a user to determine if a given ETD (written in a language the user is not familiar with) should be translated to meet his or her information needs. In an effort to provide additional services to make ETDs more widely usable, we propose the design for a web servicesbased knowledge discovery system, KDWebS, in order to address these issues in the context of NDLTD. We also present some preliminary proof of concept results, especially to demonstrate the scalability of this approach. The approach
W. R. Richardson V. Srinivasan E. A. Fox (B) Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA e-mail: fox@vt.edu W. R. Richardson e-mail: ryanr@vt.edu V. Srinivasan e-mail: svenkat@vt.edu
123
164 Table 1 List of 12 leading universities/ organizations from around the world participating in NDLTD, with counts from 2008 Institution OCLC Australasian Digital Theses Program IBICT Brazilian ETDs Library & Archives Canada ETDs Repository M.I.T. Theses and Dissertation Ohiolink Hong Kong University University of Tennessee Library Virgina Tech NSYSU CCSD theses-EN-ligne, France Upsalla University No. of ETDs 287153 152552 71296 51489 28123 16473 16088 14271 11254 11159 10633 9575
W. R. Richardson et al.
summaries (which only aim to indicate the general topic of a document), poor readability can detract from information content. 2.2 Concept maps and their applications As envisaged by Novak [7], concept maps are themselves a special case of a semantic network [8], differentiated from semantic networks in that their links can have any labels or semantics that the author wishes. There are many studies on the use of concept maps as learning tools. The key idea is that the learner assimilates new concepts and propositions into his or her existing conceptual and prepositional frameworks. Malone and Dekkers [9] aptly described concept maps as windows to the mind of students, for seeing in (by the teacher and other students), for seeing out (by the student) and for reecting on ones own perceptions (by everybody). Concept maps encourage students to use meaningful-mode learning patterns [10]. Wallace and Mintzes have shown that they can be useful for demonstrating the changes that occur in a students knowledge structure as students integrate new knowledge with existing knowledge [11]. Concept maps are now widely used as a learning tool in the United States and by Spanish-speakers in Latin America [12]. One possible reason for the effectiveness of concept mapping in promoting learning and understanding is that it takes advantage of the fact that a large percentage of the population are visual learners (between 52 and 94%, depending on the discipline [13]). Studies have shown that presenting information to learners in their preferred learning style can improve comprehension [14]. The rst work involving the automatic generation of concept maps was done by Gaines and Shaw in 1994 [15]. Their system, called GNOSIS, produced concept maps purely based on term co-occurrence, a technique commonly used in information retrieval systems [16,17]. Their system was able to nd sets of related terms that would have been discovered by a careful reader, but which typically differ from the list of important topics found in the papers table of contents. Since then, other teams, notably Rajaraman and Tan [18], have produced semantics maps which are similar in some ways to concept maps. The main difference between these systems and the one we present is that we make a serious effort to provide link labels, not just node labels, using information that is based largely on Natural Language Processing (NLP). The concept maps generation draws ideas from research in document summarization (as described in Sect. 2.1) to ensure that a good representative summary of the dissertation is presented to the user so as to facilitate relevance judgment. In some cases, the generated concept maps also link back to the original dissertation so that the user can simply navigate to, and just read the relevant portion of the dissertation (see Sect. 4).
and the underlying principles can easily be adopted to other information retrieval and digital library systems as well.
2 Research underpinnings 2.1 Document summarization Attempts at automatic text summarization date back to the 1950s. They have shown that high quality text summarization is difcult since it requires discourse understanding, abstraction, and language generation [2]. Most modern systems have adopted a simplied approach, which involves identifying spans of related text in the document, and extracting sentences representative of each textspan, producing what is called an extractive summary. In this context summarization can be dened as the selection of a subset of the document sentences which is representative of the document content [3]. This is done by ranking document sentences based on relevance to a particular topic, and also by assessing redundancy based on considering other sentences in the document. The selected sentences are the ones with the highest relevance and lowest redundancy. This is often computed using a Maximal Marginal Relevance (MMR) measure, rst used in this context by Carbonell and Goldstein [4]. The text-spans are usually sentences, but paragraphs also have been considered [5]. While from the readers perspective, the quality of an automatically generated abstract summary might not be as good as that of an extractive summary (which is in the authors own words), it is often good enough for the reader to understand the main ideas of the document [3]. However, McDonald and Chen [6] have noted that even with indicative
123
Knowledge discovery in digital libraries of electronic theses and dissertations: an NDLTD case study Fig. 1 Architecture diagram for web services-based KDWebS system
ETD
165
Data Access
User
OAI-PMH Harvester
Data (CMap, Citations etc.) ETD Poll and Alert Web Services Web UI ETD Related Metadata
Submission
NDLTD
ETD and/or Data Request ETD
Access Layer
ETD Data
ETD Data
CMap Creator
Citation Parser
Recommender System
Metadata Parser
Storage/Access
Routines
Logic Layer
Processed ETD
Data
ETD Data
CMap
Citations
Metadata
Other Data
Storage Layer
KDWebS
2.3 Web services The Web services paradigm [19] came into existence to facilitate interoperability between machines, over a network. Web Services can simply be, and often are, a collection of APIs hosted on a remote system, that are callable over the internet. The caller can be using a different programming language from the one in which the Web Services are implemented, or can be calling from a different operating system. Web Services thus also attempt to achieve standardization, by facilitating inter-working of vastly different software systems. Web Services-based architectures have been applied to some large digital library systems of late, most notably to ensure component reuse, standardization and to achieve semantic interoperability. A classic case in point is the development of a Service-Oriented Architecture (SOA) for CiteSeer [20], wherein many different functionalities provided by CiteSeer, like citation extraction, are implemented as remotely callable Web Services. We propose a similar approach for NDLTD, specically from the point of view of developing and deploying Web Services for electronic dissertations. NDLTD supported services provide basic search and browse functionalities. Our approach can provide considerable value addition to NDLTD by creating and deploying services that make the knowledge discovery task easier, whilst also ensuring that these services are amenable to being used by others.
3 System architecture We propose a multi-layered architecture for the KDWebS system (see Fig. 1). The Storage Layer is the storehouse for different kinds of data extracted from ETDs from the NDLTD collection. The different kinds of data that can be housed are concept maps, citations, metadata related to dissertations, and other forms of data like images, multimedia les, scientic data sets, etc. The Logic Layer consists of parsers to extract different kinds of information from ETDs. The CMap creator consists of parsers to understand different ETD le formats (e.g., .pdf or .ps) and generate concept maps accordingly. Citation and Metadata parsers respectively extract citations (references) in different formats, and metadata (author, institution, keywords, etc.), from ETDs. A Recommender System is also in place that would provide a list of dissertations of possible interest based on a user query. The Access Layer hosts web services which facilitate interaction with the system in an automated fashion. The web services make the KDWebS system amenable to access through software agents, like the OAI-PMH harvester [21,22] which can then harvest the NDLTD metadata. The interaction with NDLTD is also mediated through these Web Services. The system polls the NDLTD Union Catalog (the superset of all ETDs contained at all NDLTD member sites) periodically to check for additional or changed ETDs, or other NDLTD software can itself alert the system of any new works that become available. The KDWebS
123
166
Fig. 2 Plot of processing time versus size of ETD
system then obtains the dissertation and parses it to extract the relevant information. Users can interact with the system either through the Web user interface (from some NDLTD-related system or at our KDWebS system) or by calling the Web Services. We have implemented the concept mapping system, and conducted user testing analysis of the same for both English and Spanish (see Sects. 4 and 5). This represents an important rst step towards implementing a complex digital library system like KDWebS. Our preliminary results with the concept mapping system are encouraging and we plan to implement the other KDWebS services too. In the next section we describe our concept mapping system in detail.
4 Concept mapping system Our system works by taking a collection of ETDs from NDLTD, which are converted to plain text via pdftotext, and then processed by custom scripts to remove common PDF conversion errors. At this point we are only using born-digital ETDs, so we do not have to deal with OCR errors. The plain text ETDs are then fed into a natural language processing tool, called RelEx [23]. In initial experiments to produce concept maps, we tried using minimal NLP knowledge, and relied almost entirely on mathematical measures such as t score, 2 , and association rules to nd keywords in the ETDs [24]. However the results were less than satisfactory. Therefore we adopted the approach of using an ontology of terms related to a specic eld, and doing more NLP processing. Currently we only process ETDs in the computing eld, since the RelEx tool is using an ontology of computing science developed at Villanova University as part of an ACM curriculum effort [25]. To support another eld of study, we could substitute an ontology for that eld into RelEx. RelEx performs part-of-speech tagging, dependency parsing using a link grammar [26], anaphora resolution using the
Hobbs algorithm [27], and named entity recognition on the ETDs. Named entity recognition of people, organizations, places, and computing concepts is done using the GATE toolkit [28,29]. RelEx and GATE communicate with each other via IBM Unstructured Information Management Architecture (UIMA) components [30]. The concept maps are displayed in CMapTools [31], which we selected since it provides many useful features. Examples are a list view, in which the concept map is attened so that it can be viewed as a series of propositions, and the ability to embed tables, gures, and multimedia objects in the nodes of the map. Using a tool like TableSeer [32] and the capabilities of CMapTools, we can extract tables from the ETDs and insert them directly into the concept maps to give more information about a particular node. Figure 2 shows the computation time as a function of size of the ETD. The relation is roughly linear. Note that the largest ETD shown on this graph is over 300 KB, which is very large (the average size of an ETD in our collection is 188 KB). We combine this with information extracted directly from the ETDs, such as title, chapter headings, and section headings, to produce concept maps for each ETD. Due to the size of the concept maps produced, our software produces an overview map providing a link to a concept map for each chapter. We selected this arrangement based on a user study comparing concept maps produced to represent an entire document, versus using multiple concept maps which each represented a part of a document (in our case, chapters). Users preferred the concept maps based on chapters by a statistically signicant margin [24]. Thus, Fig. 3 is the overview map produced for a thesis that has ve chapters. There also is a map produced for each of those chapters. Figure 4 gives the concept map for Chapter 3. The chapter-level concept maps include information about where in the original ETD each displayed relationship came from. By hovering over the link text, users are shown the original sentence in the ETD. Taking the concept maps one step further, we have used machine translation to translate the maps into Spanish, so that Spanish-speakers who cannot (easily) read English can look at them to determine if the original ETDs are relevant. Since off-the-shelf machine translation tools will not support the vocabulary for technical CS-related ETDs, we have used text mining to search for Spanish translations of English phrases in a Spanish ETD collection [33]. Figure 5 illustrates the results of our processing, showing the Spanish version of the English concept map that was presented in Fig. 4. Since the translation is done automatically, it does contain errors, so we conducted a user study to determine how badly those errors degrade understanding. This cross-language (English-to-Spanish) study was conducted at the University of the Americas, Mexico, in which we compared concept maps automatically generated
123
Knowledge discovery in digital libraries of electronic theses and dissertations: an NDLTD case study
167
3. Automatically generated concept maps can be translated via MT well enough that they can be used as a crosslanguage discovery tool. 5.1 Design of experiment Twenty-two subjects from UDLA participated. There were six treatment conditions (see Table 2), with all subjects seeing all six conditions, making it a fully within-subjects design. Subjects were asked six questions, presented in varying order, with one question in each treatment condition. The experimental materials were based on 30 dissertations from the Virginia Tech CS-ETD collection, all originally in English. For 15 of these, the abstracts and concept maps were translated from English to Spanish by a professional translator familiar with IT/computing. For the other 15, the abstracts and concept maps were processed using machine translation software. For the machine translation condition, the abstracts were translated solely by Systran. For this experiment, we decided not to use the CMapTools representations of the concept maps, but instead chose a simpler implementation using AT&Ts Graphviz software. This produced JPEG images, embedded in HTML pages. The main reason for this was that the subjects at UDLA had never used CMapTools before, so they would require a training period before they could use the tool. Since we could not be physically present to conduct a training session and answer the questions they would have, we decided that using a more familiar JPEG/HTML interface would avoid errors caused by lack of familiarity with the interface. Another difference between the concept maps used in this experiment and in the Englishonly experiment was that these did not include the hovertext feature. Although it is possible to make clickable regions of a JPEG image which point to the sentence from the original document (in fact we implemented such a feature as a test), we opted not to do this since the original document is in English, and we did not have sufcient resources available to pay to have all of these sentences translated into Spanish. The design of this cross-language experiment required that everything shown to the subjects be in Spanish, so that subjects who can read English uently did not have an advantage. 5.2 Experimental conditions Subjects were asked six questions about which of ve dissertations are relevant to a particular topic. A domain expert (a graduate student in the Digital Library Research Laboratory at VT) came up with these questions, based on the 30 English ETDs, and listed which ETDs he considered relevant. Two more domain experts made their own independent relevance determinations about these dissertations for comparison purposes. The domain experts looked only at the English
Fig. 3 Overview concept map generated for a sample thesis (by Daniel Rocco of Georgia Tech)
from ETDs, half of which had been translated by a human translator, and the other half by our machine translation approaches. Both types of concept maps were found to be more effective than translated abstracts in helping users make relevance determinations [34]. 5 English-to-Spanish experiment To test our key hypotheses about concept maps, we conducted a cross-language experiment with the Universidad de las Americas (UDLA) in Puebla, Mexico [3]. This crosslanguage experiment was a study involving automatically generated concept maps and also human-written abstracts, conducted with students in a class taught by Dr. Alfredo Sanchez. For the experiment, our three hypotheses were as follows: 1. Automatically generated concept maps can be a useful summary of an ETD. 2. Automatically generated concept maps can augment abstracts in helping subjects determine if a document is relevant to an information need.
123
168 Fig. 4 Concept map for chapter 3 of Roccos thesis. The textbox shows the sentence in the thesis where the relationship was found
Fig. 5 Automatic Spanish translation of chapter 5 of Roccos ETD
versions of the documents, plus the concept maps. In order for the Mexican students to answer the questions, for each question they were provided one of six types of summaries of the dissertations. Each subject answered six questions, one in each of the treatment conditions. For each question, they were allowed to pick from ve dissertations. For instance, they were given a relevance question, were presented with ve machinetranslated concept maps (condition B2 in Table 2), one per dissertation, and were asked which dissertations are relevant to that particular question based just on the concept maps.
Each question had between one and three ETDs relevant to it and the same ve ETDs as possible answers. The six sets of ETDs are disjoint (6 5 = 30, making a total of 30 ETDs). Thus a subject never saw abstracts/concept maps of the same ETD for different questions. Each subject was presented with each ETD exactly once. We randomized (using a computer program) the presentation order of the six questions and six treatment conditions into seven possible presentation orders. Thus each presentation order was seen by about three subjects. These were randomized with the constraint that questions 1, 2, 3 always relate to treatment conditions A1, B1, or
123
Knowledge discovery in digital libraries of electronic theses and dissertations: an NDLTD case study Table 2 Cross-language experiment treatment conditions Experimental conditions 1. Human translated abstract (A1) 2. Human translated concept map (B1) 3. Human translated abstract + Human translated concept map (C1) 4. Machine translated abstract (A2) 5. Machine translated concept map (B2) 6. Machine translated abstract + Machine translated concept map (C2) Type TranslationType Table 3 ANOVA main effect results Source PresentationType Test type
169
Signicance
Sphericity assumed 0.002 Lower-bound Lower-bound Lower-bound 0.013 0.171 0.799 Sphericity assumed 0.171
PresentationType Translation Sphericity assumed 0.935 Computed using alpha = 0.05 Tests of within-subjects effects: measure, agreement
C1 (see Table 2 above), while questions 4, 5, 6 always relate to conditions A2, B2, and C2. This is necessary because, as explained earlier, funding limits led to a design in which machine translation conditions were selected. The experiment was designed to allow a within-subjects ANOVA on the results. All subjects saw six experimental conditions. The numbers compared in the ANOVA were number of agreements between the consensus of the three expert judgments and the subjects judgments. We were careful to select question/ETD sets in which there was perfect inter-judge agreement on which ETDs were relevant and which were not. For each question, subjects were presented with ve dissertations, so agreement could be either 0 (did not agree with judges on relevance of any dissertation), 1, 2, 3, 4, or 5 (agree with experts perfectly on relevance). Since the English prociency of the students could conceivably affect their answers, the subjects were given a pre-task question asking how many years they studied English, at the secondary school, university, and private language school levels. We also asked them to rate their English prociency on a scale from 1 to 5. Further, we asked the students to list how many years they had studied CS, and whether they are familiar with digital libraries (the area to which most of the ETDs are related). Years of computer science study, years of English study, and self-assessment of English prociency were treated as covariates.
Note that there was no signicant difference between the levels of Translation Type (human and machine). Thus it appears that our machine translation is good enough for Spanish speakers to get the gist of the concept map for making relevance determinations. Concerning the covariates, the PresentationType Years English covariate relationship was signicant, the PresentationType SelfAssessEng relationship was signicant, and the TranslationType YearsEnglish relationship was signicant at the p = 0.05 level. Years of studying CS did not have a signicant covariate effect with any of the main effects.
6 Discussion 6.1 Scalability analysis As can be seen in Fig. 2, with our current test code the average time for producing a concept map for an ETD (average size for our collection is 2.4 MB for the PDF les, or 188 KB when converted to plain text) is 1520 min, on a single processor Pentium 4 class machine. Due to the time required to process one ETD, we plan to pre-compute the concept maps for the ETD collection beforehand, instead of doing it on an on-demand basis based on user request. Obviously we will have to speed up our code considerably if we plan to use it for a collection with millions of booksize documents. We are working on a new version of the code which eliminates the UIMA components and interfaces RelEx and GATE directly. This will save memory as well as speed up ETD processing by up to an order of magnitude. However, even if we can process an entire ETD in 1 min, it would take about 2 years to process 1 million ETDs on a uniprocessor machine. Thus we need to consider other techniques, such as parallelization. Gross-level parallelization is straightforward, since we divide each ETD into chapters, and process each chapter separately. Thus for an ETD with 10 chapters, we could simply run 10 copies of our code on 10 machines and get a 10 speedup. Reaching greater levels of parallelization, however, will be difcult due to paralleliza-
5.3 ANOVA results As stated before, 21 participants answered all of the six relevance questions ( N = 21). An ANOVA was performed testing two factors, Presentation Type and Translation Type. The three levels of Presentation Type were Abstract, Concept Map, and Abstract plus Concept map. The two levels of Translation Type were human and machine. The ANOVA revealed that the Presentation Type had a signicant effect on expert/subject agreement in relevance judgments. The Translation Type did not have a signicant effect, nor did the Presentation Type Translation Type interaction. See the ANOVA results in Table 3.
123
170
tion limitations of GATE. Thus we can consider using larger numbers of machines to overcome this limitation. Fortunately, Virginia Tech has a number of high-speed clusters, the most well-known of which is System X [35], consisting of over 2000 processors, so we could analyze about 200 ETDs (of 10 chapters length) at the same time. 6.2 Searching An ETD search system could work as follows: 1. Users enter some keywords into a search box; perhaps they could specify key propositions in which they have interest. 2. These keywords are matched against the concept map collection using a network search algorithm like that used in the MARIAN system [36]. This returns a small set of ETDs (10 or so). 3. The concept maps for these 10 ETDs are then displayed to the user, who can peruse them and determine which ETDs she would like to download, or have translated if the original ETDs are not in her preferred language. 4. Optionally, other search methods could be deployed, for example searching against the metadata in the NDLTD Union Catalog, or employing other services supported by NDLTD (see http://www.ndltd.org/browse.en.html). From one of these, or several, or a fusion of the results of a group, additional ETDs related to the users query could be identied. Then, concept maps for those could be provided, as in step 3 above. 5. Many standard digital library systems, like DSpace [37], lack support for such packaging of a variety of services, so our KDWebS system would need to support enhanced user-centered searching covering all these functions.
that the NDLTD collection eventually will grow at the rate of a terabyte per year. While this is not enormous when compared to video collections, or astrophysics data, it is large when compared to many of the projects underway to provide access to book collections. Since NDLTD-related efforts are undertaken in a open manner, researchers should nd our work on this collection to be of keen interest. When one considers how to manage such a large collection, for serious in-depth scholarly study, that requires substantial processing of each work, one sees that there are many very large digital library challenges. This paper surveys our preliminary work in that regard, our Web Services based approach to developing a suitable architecture, and some of the specic services under development. Since applying NLP methods, especially summarization, e.g., to prepare multilingual concept maps, requires very large amounts of processing, we have described solutions that are scalable. We hope this paper can help the digital library community in its planning, engagement in advanced research, specication of architectures, and development of future systems and services. We hope our preliminary work and the proposed KDWebS system can provide guidance in those enterprises. References
1. Fox, E., Hall, R., Kipp, N.: NDLTD: preparing the next generation of scholars for the information age. The New Review of Information Networking (NRIN), pp. 5976 (1997) 2. Sparck-Jones, K.: Discourse modeling for automatic summarizing. University of Cambridge Computer Laboratory, Cambridge, Technical Report 290 (1993) 3. Amini, M., Gallinari, P.: The use of unlabeled data to improve supervised learning for text summarization. In: Proceedings of SIGIR02, Tampere, Finland, pp. 105112 (2002) 4. Carbonell, J., Goldstein, J.: The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of SIGIR, Melbourne, pp. 335336 (1998) 5. Strzalkowski, T., Wang, J., Wise, B.: A robust practical text summarization system. In: Proceedings of the Fifteenth National Conference on AI, pp. 2630 (1998) 6. McDonald, D., Chen, H.: Using sentence-selection heuristics to rank text segments in TXTTRACTOR. In: Proceedings of International Conference on Digital Libraries, Portland, OR, pp. 2835 (2002) 7. Novak, J.D., Gowin, D.B.: Learning How To Learn. Cambridge University Press, Cambridge, UK (1984) 8. Sowa, J.F. : Semantic networks. In: Shapiro, S.C. Encyclopedia of Articial Intelligence, Wiley, New York (1992) 9. Malone, J., Dekkers, J.: The concept map as an aid to instruction in science and mathematics. Sch. Sci. Math, 84(3), 220231 (1984) 10. Mintzes, J.J., Wandersee, J.H., Novak, J.D.: Assessing science understanding: a human constructivist view. Academic Press, San Diego (2000) 11. Wallace, J., Mintzes, J.J.: The concept map as a research tool: exploring conceptual change in biology. J. Res. Sci. Teach. 27 (10), 10331052 (1990) 12. Caas, A.J., Novak, J.D.: Facilitating the adoption of concept mapping using cmaptools to enhance meaningful learning. Knowledge Cartography: Software Tools and Mapping Techniques (2008, to appear)
7 Conclusion Digital libraries may serve individuals, members of a profession, or even global communities. The Networked Digital Library of Theses and Dissertations (NDLTD) aims to support graduate students, as well as researchers at all levels who are interested in electronic theses and dissertations (ETDs), worldwide. Though its current collection now stands at about 800,000 works, the collection should grow rapidly, ultimately with an increase of 100,000 or more works per year. While today most works are PDF les that are roughly a megabyte in size, already there are a substantial number of works with large les containing images, audio recordings, videos, and data sets. As progress towards institutional repositories grows, so should the size of ETDs, as researchers better understand the value of preserving a full packaging of scholarly data and reporting. Thus, it is reasonable to assume
123
Knowledge discovery in digital libraries of electronic theses and dissertations: an NDLTD case study 13. Felder, R.M., Spurlin, J.: Applications, reliability and validity of the index of learning styles. Int. J. Eng. Educ. 21(1), 103112 (2005) 14. Zywno, M.S., Waalen, J.K.: The effect of hypermedia instruction on achievement and attitudes of students with different learning styles. In: Proceedings of American Society for Engineering Education Conference, Albuquerque (2001) 15. Gaines, B.R., Shaw, M.L.G.: Using knowledge acquisition and representation tools to support scientic communities. In: Proceedings of AAAI94: Proceedings of the Twelfth National Conference on Articial Intelligence, Menlo Park, California, pp. 707714 (1994) 16. Sparck-Jones, K.: Automatic Keyword Classication for Information Retrieval. Archon Books, London (1971) 17. Callon, M., Law, J.: Mapping the Dynamics of Science and Technology. MacMillan, Basingstoke, UK (1986) 18. Rajaraman, K., Tan, A.: Knowledge discovery from texts: a concept frame graph approach. In: Proceedings of International Conference on Information and Knowledge Management, McLean, VA, pp. 669671 (2002) 19. W3C Web Services Activity Homepage. http://www.w3.org/2002/ ws/ (2008) 20. Petinot, Y., Giles, C.L., Bhatnagar, V., Teregowda, P.B., Han, H., Councill, I.G.: A Service-Oriented Architecture for Digital Libraries. In: Proceedings of the 2nd International Conference on Service oriented computing, pp. 263268 (2004) 21. Lagoze, C., Van de Sompel, H.: The open archives initiative: building a low-barrier interoperability framework. In: Proceedings of the rst ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 5462 (2001) 22. Lynch, C.A.: Metadata Harvesting and the Open Archives Initiative. ARL: A Bimonthly Report, no. 217, pp. 19 (2001) 23. Ross, M., Pinto, H., Pennachin, C., Goertzel, B., Looks, M., Senna, A., Silva, W.: INLINK: an interactive knowledge-entry and querying tool. Presented at Human Language Technology conferenceNorth American chapter of the Association for Computational Linguistics, New York (2006) 24. Richardson, R., Goertzel, B., Pinto, H., Fox, E.A.: Automatic creation and translation of concept maps for computer science-related theses and dissertations. In: Proceedings of 2nd Concept Mapping Conference 2006, San Jose, Costa Rica, pp. 160163 (2006) 25. Cassel, L.N.: The Ontology Project. http://what.csc.villanova.edu/ twiki/bin/view/Main/OntologyProject (2006) 26. Sleator, D., Temperley, D.: Parsing English with a link grammar. In: Proceedings of the the Third International Workshop on Parsing Technologies, Tilburg, Netherlands & Durbuy, Belgium (1993)
171
27. Hobbs, J.: Pronoun resolution. Lingua 44, 339352 (1978) 28. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: a framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL02), Philadelphia (2002) 29. Bontcheva, K., Tablan, V., Maynard, D., Cunningham, H.: Evolving GATE to meet new challenges in language engineering. Nat. Lang. Eng. 10(3/4), 349373 (2004) 30. IBM: Unstructured Information Management Architecture (UIMA). http://www.alphaworks.ibm.com/tech/uima (2006) 31. Caas, A.J., Hill, G., Carff, R., Suri, N., Lott, J., Gmez, G., Eskridge, T.C., Arroyo, M., Carvajal, R.: CMapTools: a knowledge modeling and sharing environment. In: First International Conference on Concept Mapping, Pamplona, Spain, pp. 1624 (2004) 32. Liu, Y., Bai, K., Mitra, P., Giles, C.L.: TableSeer: automatic table metadata extraction and searching in digital libraries. In: Proceedings of Joint Conference on Digital Libraries, Vancouver, BC, pp. 91100 (2007) 33. Richardson, R., Fox, E.A.: Using bilingual ETD collections to mine phrase translations. In: Proceedings of Joint Conference on Digital Libraries, Vancouver, British Columbia, Canada, pp. 352353 (2007) 34. Richardson, R., Fox, E.A.: Using concept maps in NDLTD as a cross-language summarization tool for computing-related ETDs. In: Proceedings of 10th International Symposium on Electronic Theses and Dissertations, Uppsala, Sweden, pp. 18 (2007) 35. Ribbens, C.J., Varadarajan, S., Chinnusamy, M., Swaminathan, G.: Balancing computational science and computer science research on a terascale computing facility. In: Proceedings of International Conference on Computational Science, pp. 6067 (2005) 36. Goncalves, M.A., France, R.K., Fox, E.A., Doszkocs, T.E.: MARIAN: searching and querying across heterogeneous federated digital libraries. In: Proceedings of First DELOS Workshop on Information Seeking, Searching and Querying in Digital Libraries, Zurich, Switzerland (2000) 37. Smith, M., Barton, M., Bass, M., Branschofsky, M., McClellan, G., Stuve, D., Tansley, R., Walker, J.H.: DSpace: an open source dynamic digital repository. D-Lib Magazine 9, 1 (2003)
123
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Knowledge Discovery in Digital Libraries of Electronic Theses and Dissertations: An NDLTD Case Study

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Knowledge Discovery in Digital Libraries of Electronic Theses and Dissertations: An NDLTD Case Study

Uploaded by

Copyright:

Available Formats

Int J Digit Libr (2008) 9:163171 DOI 10.

Published online: 7 October 2008 Springer-Verlag 2008

ETD and/or Data Request ETD

Fig. 2 Plot of processing time versus size of ETD

Fig. 5 Automatic Spanish translation of chapter 5 of Roccos ETD

You might also like