You are on page 1of 6

JOURNAL OF TELECOMMUNICATIONS, VOLUME 18, ISSUE 1, JANUARY 2013 1

Comparison of existing open-source tools for Web crawling and indexing of free Music
Andr Ricardo and Carlos Serro
Abstract This paper presents a portrait of existing open-source web crawlers tools that also have an indexing component. The goal is to understand what tool is best suited to crawl and index a large collection of music MP3 files freely available in the Internet. In this study each piece of software is briefly described, with an overview, identification of some users, and their main advantages and disadvantages. In order to better understand the most significant differences between the different tools a resume of features like: programming language in which they are written, the platform used for deployment, the type of index used, database integration, front-end capabilities, existence of a plugin system, MP3 and Adobe Flash (SWF files) parsing support, is presented. Finally the tools were classified according to the prospected collection size, being divided into tools to mirror small collections, medium and large collections with software capable of handling large amounts of data. In conclusion, an assessment on which tools are best suited to handle large collections in a distributed way is made.

Index Terms Content Analysis and Indexing, Information Storage and Retrieval, Information Filtering, Retrieval Process, Selection Process, Open Source, Creative Commons, Music, MP3.

1 INTRODUCTION

he objective of this paper is to identify and study the tools that can be used to create a similar index to the ones used by existing commercial music recommendation systems, but with the purpose of indexing all freely available music in the Internet. The paper is primarily focused on the discovery and indexation free music over the Internet as a way to create a huge distributed database with the capability of offering meta-information and recommendation systems. In the first section of this paper, it will be provided an overview of all the existing tools that can be used for to the purpose of indexing and crawling on the Internet considering the software projects that are open-source (Table 1). Also in this section data is presented about all the most important characteristics of such tools such as programming language in which they were developed, the type of index created, database integration, front-end, plugin structure and MP3 and Flash parsing support. Concluding the analysis, for each tool the most relevant key advantages and drawbacks are stated, followed by an overview on how adequate the tool is to solve the problem addressed by this work. Finally some conclusions and future work is presented having into account the major objective of this work: the ability to develop an open and free music recommendation system.

2 TOOLS OVERVIEW
This section presents a summary of the different characteristics of each of the tools that were considered, and resumed in a set of tables to facilitate the tools comparison process. First each tool in analysis is introduced with a short description, stating the most notable users operating with each piece of software, and then an overview, advantages and drawbacks. Considering all the software tools in analysis, Table 1 states the programming language used for their development (language) and the platforms in which they run, if there is some type of indexing done by web crawling tools (index) and finally possible connections to databases are also considered (database).

2.1 ASPSeek The ASPseek tool consists of an indexing robot, a search daemon, and a CGI search frontend. The ASPseek tool (http://www.aspseek.org/) is an outdated tool and its applicability in this scenario it is not a reliable option. The major advantage of this tool is that it supports external parsers. However, as referred before, the tool is outdated and cannot scale for global web crawling, since it is based on a relational database. 2.2 Bixo Bixo (http://openbixo.org) is web-mining toolkit that runs as a series of cascading pipes on top of Hadoop (mostly used by companies/services such as Bebo, EMI Music, Share This and Bixo Labs). Bixo is a tool that might be very interesting to projects looking for a web-mining

Andr Ricardo is with ISCTE Instituto Universitrio de Lisboa (ISCTEIUL), Av. das Foras Armadas, 1649-026 Lisboa, Portugal. Carlos Serro is with the ISCTE Instituto Universitrio de Lisboa (ISCTEIUL), IUL School of Technology and Architecture, Department of Information Science and Technology (ISTA/DCTI), Av. das Foras Armadas, 1649-026 Lisboa, Portugal. 2012 JOT www.journaloftelecommunications.co.uk

framework that can be integrated with existing information systems - for example, to inject data into a datawarehouse system. Based on the Cascading API that runs on a Hadoop Cluster, Bixo is suitable to crawl large collections. In a project that has the need to handle large collections and to input data into existing systems, Bixo is a tool to have a close look at. Bixo major advantages are its orientation to data mining and its capability to support large sets of data, as it was tested with Public Terabyte Dataset [19][18]. The major drawback of Bixo is its little built-in support to create an index.

pages mainly working with HTML with no Flash or Ajax support.

2.3 Crawler4J Java Crawler (http://code.google.com/p/crawler4j/) is a tool, which provides a programmable interface for crawling. It is a piece of source code to incorporate in a project but there are more suitable tools to index content. Its main advantage is that it is easy to integrate in Java projects that need a crawling component. On the other hand, it does not offer support for robots.txt neither for pages without UTF-8 encoding and it is necessary to create the entire complementary framework for indexing. 2.4 DataparkSearch DatapakSearch (http://www.dataparksearch.org/) is web crawler and search engine (used, for instance by News Lookup). DataparkSearch is a tool that benefits from MP3 and Flash parser but unfortunately, due to lack of development, it is still using outdated technology like CGI and does not have a modular architecture making it difficult to extent. The index is not in a format that could be used by other frameworks. The major advantage of this tool is that it offers support for MP3 and Flash parser. On the other hand, it still uses outdated technology and its development seems to have stopped. 2.5 Ebot Ebot (http://www.redaelli.org/matteo-blog/projects/ ebot/) is web crawler written on top of Erlang. There is no proof of concept that Ebot would scale well to index the desired collection. Because Erlang and CouchDB were used to solve the crawl and search problem, people keen on these languages might find this tool attractive. Therefore, Ebot is distributed and scalable [8] however there is only one developer active in the project and there is not a proven working system deployed. 2.6 GNU Wget GNU Wget (http://www.gnu.org/software/wget/) is non-interactive command line tool to retrieve files from the most widely used Internet protocols. Wget is a really useful command line tool to download a simple HTML website, but it does not offer indexing support. It is limited to the mirroring and downloading process. Its main advantage is that with simple commands it is easy to mirror an entire website or to explore the whole site structure. However, there is the need to create the entire indexing infrastructure and it is primarily built for

2.7 GRUB GRUB (http://grub.org/) is a web crawler with distributed crawling. GRUB distributed solution requires a proof of concept that is suitable for a large-scale index. It also requires proving that distributed crawling is a better solution than centralized crawling. GRUB tries a new approach to searching by distributing the crawling process. However, the documentation incomplete and it was banned from Wikipedia for bad crawling behavior. According to the Nutch FAQ distributed crawling may not be a good deal, while it saves bandwidth in the long run this saving is not significant. Because it requires more bandwidth to upload query results pages, making the crawler use less bandwidth does not reduce overall bandwidth requirements. The dominant expense of operating a large search engine is not crawling, but searching. The project development looks to have halted since it lacks news since 2009. 2.8 Heritrix Heritrix (http://crawler.archive.org/) is an extensible, web-scale, archival-quality web crawler project (it is used in the Internet Archive and on Arquivo da Web Portuguesa). Heritrix is the piece of software used and written by The Internet Archive to make copies of the Internet. The disadvantage for Heritrix is the lack of indexing capabilities; the content is stored in ARC files [2]. It is a really good solution to archiving websites and makes copies for future reference. The Heritrix software is a use-case proven by Internet Archive that is really adjusted to make copies of websites. However it needs to process Arc files and the architecture is more monolithic and not designed to add parsers and extensibility. 2.9 Ht://Dig ht://Dig (http://www.htdig.org/) is a search Engine and web Crawler. ht://Dig is a searching system towards generating search for a website. Like a website already built in HTML that wants to add searching functionality. Until 2004, date of the last release, it was one of the most popular web crawlers and search engine, enjoying a large user base with notable sites such as the GNU Project and Mozilla Foundation but with no updates over the time, slowly lost most of the user base to newer solutions. ht://Dig was until 2004 one of the most popular web crawlers and search engine, enjoying a large user base with notable sites such as the GNU Project, Mozilla Foundation however, its development has ceased in 2004. 2.10 HTTrack HTTrack (http://www.httrack.com/) is a website mirror tool. HTTrack is designed to create mirrors from existing sites and not for indexing. A good tool for users unfamiliar with web crawling and that enjoy a good GUI. HTTrack can follow links that are generated with basic JavaScript and inside Applets or Flash [11]. However, HTTrack does not have integration with indexing sys-

tems.

2.11 Hyper Estraier Hyper Estraier (http://fallabs.com/hyperestraier/ index.html) is full-text search engine system (used by the GNU Project). Hyper Estraier has some characteristics like high performance search and P2P support making it an interesting solution to add search to an existing website. The GNU Project is using Hyper Estraier to search its high number of docs making it a good solution when looking at collections approximately 8 thousands documents in size. Using this tool is useful to add search functionality to a site and it offers P2P support. However it has only one core developer. 2.12 mnoGoSearch mnoGoSearch (http://www.mnogosearch.org/) is a web search engine (one of the users of this too is MySQL). mnoGoSearch is a solution for a small enterprise appliance to add search ability to an existing site or intranet. The project is a bit outdated and due to the dependency on a specific vendor other solutions should be considered. One of its major advantages is that MySQL uses it. On the other hand there is little information about scalability and extensibility and it is extremely dependent from the vendor Lavtech for future development. 2.13 Nutch Nutch (http://nutch.apache.org/) is a web search, crawler, link-graph database, parsers and plugin system (it is used on sites such as Creative Commons and Wikia Search). Nutch is one of the most developed and active projects in the web crawling field. The need to scale and distribute the Nutch, lead to Doug Cutting, the project creator, started developing Hadoop - a framework for reliable, scalable and distributed computing. This means that not only the project is developing itself but it also works with Hadoop, Lucene, Tika and Solr. The project is seeking to integrate other pieces of software such as HBase too [5]. Another strong point for Nutch are the existing deployed systems with published case studies [14] and [16]. The biggest drawback in Nutch is the configuration and tuning process, combined with the need to understand how the crawler works to get the desired results. For large scale web crawling, Nutch is a stable and complete framework. The major advantages of Nutch can be resumed in the following: Nutch has a highly modular architecture allowing developers to create plugins for the following activities: media-type parsing, data retrieval, querying and clustering [12]. Nutch works under the Hadoop framework so it features cluster capabilities, distributed computation (using MapReduce) and a distributed filesystem (HDFS) if needed. Built in scalability and cost effectiveness in mind [6]. Support to parse and index a diverse range of

documents using Tika, a toolkit to detect and extract metadata. Integrated Creative Commons plugin. The ability to use other languages such as Python to script Nutch. There is an adaption of Nutch called NutchWAX (Nutch Web Archive eXtensions) allowing Nutch to open ARC files used by Heritrix. Top-level Apache project, high level of expertise and visibility around the project. However Nutch as some complexity and the integrated MP3 parser is deprecated based on Java ID3 Tag Library and did not work when tested in Nutch.

2.14 Open Search Server The Open Search Server (http://www.open-searchserver.com/) is a search engine with support for business clients. Open Search Server is a good solution for small appliances. Unfortunately it is not well documented in terms of how extensible it is. This tool is quite easy to implement and set it running. However it is dependent on the commercial component for development, has a small community, scarce documentation, has some problems handling special characters and there is little information on extending the software. 2.15 OpenWebSpider OpenWebApider (http://www.openwebspider.org/) is a web spider for the .NET platform. This is an interesting project based on the .NET framework and in C# programming for those intended to build a small to middle sized data collection. It supports MP3 indexing and offers crawling and database integration. However it has only one developer, the source is disclosed but since no one else is working on the project and because there is no source code repository, it is not behave as a real open source project. The Mono Framework might constitute a problem for those concerned with patent issues, there is no proof of concept and using relational database might not scale well. 2.16 Pavuk Pavuk (http://pavuk.sourceforge.net/) is a web crawler. Pavuk is a complement to tools like Wget, still it does not offer indexing functionality. Its main advantage is that it complements solutions like Wget and HTTrack with filters for regular expressions and functions alike. However, development has stopped since 2007 and has no indexing features. 2.17 Sphider Sphider (http://www.sphider.eu/) is a PHP search engine. Sphider is a complete solution with crawler and web search that can run on a server just with PHP and MySQL. To add integrated searching functionality for existing web appliances might be a good solution with little requirements. It is easy to setup and integrate into an existing solution. However the index is a relational database and might not scale well to millions of documents.

2.18 Xapian Xapian (http://xapian.org/) is a search engine that uses ht://Dig for crawling. Xapian is a search engine that relies on ht://Dig for crawling. If a project has no problem in using CGI and relying on a outdated crawler, but rather puts the effort in having Linux distros packages, then this software can be an option. Xapian major advantages include: Xapian currently indexes over 50 million mail messages in gmane lists proving that it can handle a connection at least that size; Scaling to large document collections; Still in active development; Packages for some Linux distributions. The major disadvantages of Xapian include that the index can only be used by Xapian, it uses CGI and is totally dependent on ht://Dig for crawling. 2.19 YaCy YaCy (http://yacy.net/) is a free distributed search engine, built on principles of peer-to-peer networks (one of the notable users is ScienceNet). For scientific projects like ScienceNet that have several machines across the world with different architectures it can be considered a good solution. YaCy is a distributed search engine working like the P2P model. It is decentralized, even if one node goes down the

search engine continues to work. It is easy to set YaCy working and it is quick to setup a P2P search network. Nevertheless, YaCy is hard to understand how customizable is outside the existing parameters and P2P search can be slow according to the Nutch FAQ [9].

3 FEATURES COMPARISON
Considering all the different software tools in analysis, Table 1 presents the programming language used for their development and the different platforms in which they run. Also it is considered if there is some type of indexing done by web crawling tools (index) and finally if it is possible connections to databases. The following table (Table 2) summarizes for each tool, the following: Tool front-end capabilities; Tool support for plugins and; MP3 or Adobe Flash parsing support. Flash support is an important issue to address. Due to the architecture of sites, which use exclusively this technology without any HTML link structure to navigate or with links to content directly inside Flash files (SWF files), it is necessary to be able to parse this content. The goal is to understand the extensibility, flexibility and maintainability of each of the different solutions considered.

Table 1. Open source tools for web crawling and indexing: programming language, index type and database
Name Aspseek Bixo crawler4j DataparkSearch Ebot GNU Wget GRUB Heritrix Hounder ht://Dig HTTrack Hyper Estraier mnoGoSearch Nutch Open Search Server OpenWebSpider Pavuk Sphider Xapian YaCy C++ Java Java C Erlang, NoSQL C C# Java Java C++ C/C++ C/C++ C Java C/C++, Java PHP C#, PHP C PHP C++ Java Language Platform Linux Cross-platform Cross-platform Cross-platform Linux Linux Cross-platform Unix Cross-platform Unix Cross-platform Cross-platform Windows Cross-platform Cross-platform Cross-platform Unix Cross-platform Cross-platform Cross-platform Index Relational DB N/A N/A SQL NoSQL File mirror Relational DB Arc files Lucene disk files Mirror files QDBM Relational DB Lucene Lucene Relational DB Mirror files Relational DB Omega NoSQL MySQL MySQL MySQL, PostgreSQL, SQLite MySQL MySQL, PostgreSQL CouchDB Database SQL, binary Possible integration

4 OVERVIEW
From all the open-source tools considered and analyzed, the ones with recent development reveal the trend also to be the ones where scalability is a core issue. Tools like Bixo, Heritrix, Nutch and Yacy are designed to handle large data collections, as the Web grows bigger. According to each Web crawler tool functionalities and

capabilities they can be grouped into three different categories: Mirroring a collection with tools that don't do indexing but only produce integral websites copies; Medium collection crawling; Large collection crawling and indexing.

Table 2. Open source tools for web crawling and indexing: programming language, index type and database
Name Aspseek Bixo crawler4j DataparkSearch Ebot GNU Wget GRUB Heritrix Hounder ht://Dig HTTrack Hyper Estraier mnoGoSearch Nutch Open Search Server OpenWebSpider Pavuk Sphider Xapian YaCy CGI Cascading pipes API N/A Web Services CLI PHP CLI, JSP JSP CGI GUI CGI CGI, PHP, Perl CLI, JSP Web based CLI, Web based CLI PHP CGI, XML Web based External Parsers Uses Omega N/A N/A UltraID3Lib Plugins system API built-in deprecated Uses Nutch Plugins External Parsers Via external parsers Follow links External Parsers Extensible built-in N/A Via external parsers N/A Language Platform external converter programs N/A Index N/A Database

It is important to note that the distinction between a medium and large collection is hard to make. It was considered in this study that a large collection means near whole web crawling (more than 200 million documents) while medium means a subset from the Web (50 to 200 million documents). The differentiation between medium and large was made taking also into account the largest known deployed system (for each tool) because some tools declared to have the ability to perform large web crawling but caressed a proof of concept. Therefore the following classification was made. Mirroring a collection GNU Wget, Heritrix, HTTrack, Pavuk Medium collection Aspseek, crawler4j, DataparkSearch, Ebot, GRUB, Hounder, ht://Dig, Hyper Estraier, mnoGoSearch, Open Search Server, OpenWebSpider, Sphider, Xapian Large collection

Bixo, Nutch and Yacy

5 CONCLUSION
For most of the situations where only one-enterprise intranet or a small specific subset of the web needs to be processed (in this paper referred has medium collection) lighter tools with faster and easier configuration can be sufficient. In this case the range of open-source tools available to make a choice is broad and there is no clear software that is more suitable than others. Programming language and indexing system tend to be the two key factors in choosing the right software for the task and thus the comparison in Table 1. When looking for solutions for large collections, YaCy with a P2P framework is an option. This is interesting software when speed is not crucial, focus goes into a distributed architecture and into an easy setup. To provide reliable, fast and scalable computing Bixo and

Nutch are the best answer. This is supported in part because both rely on Hadoop, an industry wide adopted and proven framework, with several success cases such as Yahoo clusters [22] [20], [21], Facebook [10] Last.fm [7] and Spotify using Hadoop [4] [13]. These are just a few examples of organizations using Hadoop. The list is however much more comprehensive [17]. The main difference between them is that Bixo relies on Cascading to complete the workflow and does not do indexing while Nutch indexes using Lucene. In general, solutions using the Lucene index tend to have fast retrieval times and requiring few space on disk (good characteristics for a search engine) in comparison to other solutions [15]. If the choice has to be made between Bixo and Nutch, it depends on the goal to integrate an existing system and workflow in order to do data mining or related jobs, choose Bixo. If it is to build a system with a search engine to handle a massive document collection, Nutch is the tool of choice [1] [3].

[17] PoweredBy Hadoop Wiki. http://wiki.apache.org/hadoop/PoweredBy. Accessed: 09-27-2010. [18] Public Terabyte Dataset Project Elastic Web Mining | Bixo Labs. http://bixolabs.com/datasets/public-terabyte-dataset-project/. Accessed: 09-25-2010. [19] San Francisco Bay Area ACM , Archive DM SIG ACM Silicon Valley Data Mining Camp on November 1, 2009. http://www.sfbayacm.org/?p=894. Accessed: 09-25-2010. [20] Scalability of the Hadoop Distributed File System (Yahoo! Hadoop Blog). http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_th e_hadoop_dist.html. Accessed: 09-25-2010. [21] Shvachko, K., Kuang, H. et al. 2010. The Hadoop Distributed File system. 26th IEEE (MSST2010) Symposium on Massive Storage Systems and Technologies. (2010). [22] Yahoo! Launches World's Largest Hadoop Production Application (Yahoo! Hadoop Blog). http://developer.yahoo.net/blogs/hadoop/2008/02/yahoo-worldslargest-production-hadoop.html. Accessed: 09-25-2010. Andr Ricardo holds a MSc. in Management and Computer Science (2010) from ISCTE-IUL and a BSc. in Management and Computer Science (2008) also from ISCTE-IUL. Carlos Serro holds a PhD. in Distributed Systems and Computer Architecture (2008) from UPC, a MSc. in Information Systems Management (2004) from ISCTE-IUL and a BSc. in Management and Computer Science (1997) from ISCTE-IUL. Currently he is an Assistant Professor at ISCTE-IUL.

REFERENCES
[1] [2] Tutorial - T6: IR Prototypes and Web Search Hacks with Open Source Tools | SIGIR'09. Internet Archive ARC files. http://crawler.archive.org/articles/developer_manual/arcs.html. Accessed: 09-29-2010. A Comparison of Open Source Search Engines Vik Singh. http://zooie.wordpress.com/2009/07/06/a-comparison-of-opensource-search-engines-and-indexing-twitter/. Accessed: 09-26-2010. Bernhardsson, E. 2009. Implementing a Scalable Music Recommender System. Bialecki, A. 2009. Nutch, web-scale search engine toolkit. ApacheCon 2009, Oakland. Cafarella, M. and Cutting, D. 2004. Building nutch: Open source search. Queue. 2, 2 (2004), 61. Dittus, M. 2008. Hadoop at Last.fm. Ebot | Matteo Redaelli. http://www.redaelli.org/matteoblog/projects/ebot/. Accessed: 09-21-2010. FAQ Nutch Wiki. http://wiki.apache.org/nutch/FAQ#Will_Nutch_be_a_distributed.2C _P2P-based_search_engine.3F. Accessed: 09-17-2010. HDFS: Facebook has the world's largest Hadoop cluster! http://hadoopblog.blogspot.com/2010/05/facebook-has-worldslargest-hadoop.html. Accessed: 09-25-2010. HTTrack Website Copier Offline Browser. http://www.httrack.com/html/faq.html. Accessed: 01-09-2011. Khare, R., Cutting, D. et al. 2004. Nutch: A flexible and scalable opensource web search engine. Oregon State University. (2004). Kreitz, G. and Niemela, F. 2010. SpotifyLarge Scale, Low Latency, P2P Music-on-Demand Streaming. Peer-to-Peer Computing (P2P), 2010 IEEE Tenth International Conference on (2010), 110. Michael, M., Moreira, J.E. et al. 2007. Scale-up x scale-out: A case study using nutch/lucene. IEEE International Parallel and Distributed Processing Symposium, 2007. IPDPS 2007 (2007), 18. Middleton, C. and Baeza-yates, R. A Comparison of Open Source Search Engines. Moreira, J.E., Michael, M.M. et al. 2007. Scalability of the Nutch search engine. Proceedings of the 21st annual international conference on Supercomputing (2007), 12.

[3]

[4] [5] [6] [7] [8] [9]

[10]

[11] [12] [13]

[14]

[15] [16]

You might also like