You are on page 1of 7

International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol.

2, Issue 2 June 2012 57-63

WEB INFORMATION EXTRACTION USING DEPTA


A.SURESH BABUI
1

SADIA NAUREEN2

M.UMAMAHESWARA RAO3

G.SIRISHA4

Assistant Professor, Department of Computer Science and Engineering, JNTU- Pulivendula, AP, India.
2,3,4

B.Tech(4-2), Department of Computer Science and Engineering, JNTU- Pulivendula, AP, India.

ABSTRACT
Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information from unstructured or semi-structured machine-readable documents. It produces structured data ready for post-processing which is crucial to many applications of web mining and searching tools. Information extraction systems are software tools which are designed to generate wrappers. A wrapper usually performs a pattern matching procedure which relies on a set of extraction rules. These are classified into four classes which are manually-constructed, supervised, semi-supervised and unsupervised IE systems. Unsupervised IE systems do not use any labeled training examples and have no user interactions to generate a wrapper. DEPTA (Data Extraction Based on Partial Tree Alignment) is one of the unsupervised IE system that is designed for record-level extraction task. The objective is to segment the data records, extract data items or data fields from them and put them in a database table. Depta consists of two steps are present i.e., identifying individual data records in a page, aligning and extracting data items from the identified data records. Data record extraction is done by using MDR algorithm. It works in three steps i.e., building html tag tree of the page, mining data regions in the page using the tree and identifying data records from each data region. Partial alignment is used for aligning data fields in a pair of data records which are matched with certainty and make no commitment on the rest of data fields.

KEYWORDS: IE, DEPTA, Data Extraction. INTRODUCTION


Information Extraction (IE) is the name given to any process which selectively structures and combines data which is found, explicitly stated or implied, in one or more texts. The final output of the extraction process varies; in every case, however, it can be transformed so as to populate some type of database. Information analysts working long term on specific tasks already carry out information extraction manually with the express goal of database creation. Information extraction is the identification, and consequent or concurrent classification and structuring into semantic classes, of specific information found in unstructured data sources, such as natural language text, making the information more suitable for information processing tasks. Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources. This enables much richer forms of queries on the abundant unstructured sources than possible with keyword searches alone.

A.Suresh Babu, Sadia Naureen, M. Umamaheswara Rao & G.Sirisha

58

Fig. 1 : A list of products Web Document A Web document is defined as something that has a URI and can return representations i.e responses in a format such as HTML or JPEG or RDF of the identified resource in response to HTTP requests. Types of Web Documents There are three basic types of web documents. They are Static, Dynamic and Active Static A static web document resides in a file that it is associated with a web server. The author of a static document determines the contents at the time the document is written. Because the contents do not change, each request for a static document results in exactly the same response. Dynamic: A dynamic web document does not exist in a predefined form. When a request arrives the web server runs an application program that creates the document. The server returns the output of the program as a response to the browser that requested the document. Because a fresh document is created for each request, the contents of a dynamic document can vary from one request to another. Active: An active web document consists of a computer program that the server sends to the browser and that the browser must run locally. When it runs, the active document program can interact with the user and change the display continuously.

EXISTING METHODS
Web data extraction related works can be classified into three categories: 1) wrapper programming languages, 2) wrapper induction, and 3) automatic extraction. The first approach provides some specialized pattern specification languages to help the user construct extraction programs. Visual platforms are also provided to hide their complexities under simple graphical wizards and interactive processes. The second approach is wrapper induction, which uses supervised learning to learn data extraction rules from a set of manually labeled examples. Manual labeling of data is labor intensive and time consuming. Furthermore, for different sites or even pages in the same site, the manual labeling process needs to be repeated because they may follow different templates.

59

Web Information Extraction using Depta

The third approach is automatic extraction. Embley et al. proposes using a set of heuristics and domain ontologies to automatically identify data record boundaries. Buttler proposes additional heuristics for the task without using domain ontologies. However, experimental results show that the performance of this approach is not satisfactory.

PROPOSED METHOD
We propose a novel technique to perform automatic data extraction given a single page with lists of data records. Most existing methods require multiple pages. In our view, requiring multiple pages is a limitation. If a page contains a list of data records, a single page is sufficient for finding patterns from the data records in the list to extract data from them. We will discuss this further in the related work section. In addition, our method is able to extract data from noncontiguous data records which cannot be done with existing techniques. By noncontiguous data records, we mean that the HTML codes of these data records intertwine with one another, but when they are displayed on a Web browser, they appear contiguous to the human user. Our objective is twofold: 1) to automatically identify data records in a page and 2) to automatically align and extract data items from the records. Given a page, the method first segments the page using an enhanced tree matching algorithm to identify each data record without extracting its data items. That is, the algorithm identifies data records by analyzing the DOM tree of the page. Visual information is used in this step in three ways. First, it is used to build the DOM tree. A straightforward way to build a DOM tree is to follow the nested tag structure in the HTML code. However, sophisticated analysis has to be incorporated to handle errors in the HTML code (e.g., missing or ill-formatted tags). Whereas the visual information can be obtained after the HTML code is rendered by a Web browser, it also contains information about the hierarchical structure of the tags. In this work, rather than analyzing and correcting the HTML code, visual information is utilized to infer the structural relationship among tags and to construct a DOM tree. This method leads to more robust tree construction due to the high error tolerance of the rendering engines of Web browsers (e.g., Internet Explorer). As long as a page can be rendered correctly by a browser, its DOM tree can be built correctly. Second, visual information helps reduce the tree matching computation as two objects of very different sizes visually are unlikely to match. Finally, visual information helps to improve the accuracy of data record segmentation by allowing the system to identify a valuable piece of information that is not available in the DOM tree, the space gaps between data records. The information can be used to segment data records as any gap within a data record is typically smaller than that in between data records. A novel partial tree alignment method is proposed to align corresponding data items from the discovered data records and put the data items in a database table. Using tree alignment is natural because of the nested (or tree structured) organization of the HTML code. Specifically, after all data records have been identified, the sub-trees of each data record are rearranged into a single tree as each data record may be contained in more than one sub-tree in the original DOM tree of the page, and each data record may not be contiguous.

A.Suresh Babu, Sadia Naureen, M. Umamaheswara Rao & G.Sirisha

60

The DOM trees of all the data records are then aligned using our partial tree alignment method. By partial tree alignment, we mean that, for each pair of trees (or data records), we only align those nodes (or data items) that can be aligned with certainty, making no commitment on the locations of the unaligned data items. Early uncertain commitments can result in undesirable effects for later alignment involving other data records. This method turns out to be very effective for multiple tree alignment. The resulting alignment enables us to extract items from all data records in the page. It can also serve as a pattern to be used to extract data items from other similar pages.

Fig. 2 : An example DOM tree of a page segment. Data Regions A group of data records that contains descriptions of a set of similar objects are typically presented in a contiguous region of a page and are formatted using similar HTML tags. Such a region is called a data record region. The nested structure of HTML tags in a Web page naturally forms a tag tree. Step 1: Building a HTML tag tree of the page. In the new system, visual (rendering) information is used to build the tag tree. Step 2: Mining data regions in the page using the tag tree. A data region is an area in the page that contains a list of similar data records. Instead of mining data records directly, which is hard, MDR mines data regions first and then finds data records within them. For example, in Figure 2, we first find the single data region below node TBODY. In our new system, again visual information is used in this step to produce better results. Step 3: Identifying data records from each data region. For example, in Figure 2, this step finds data record 1 and data record 2 in the data region below node TBODY.

61

Web Information Extraction using Depta

Fig. 3 : An illustration of generalized nodes and data regions Partial Tree Alignment Our proposed approach aligns multiple DOM trees by progressively growing a seed tree. The seed tree, denoted by Ts, is initially picked to be the tree with the maximum number of data fields. The reason for choosing this seed tree is clear as it is more likely for this tree to have a good alignment with data fields in other data records. Then, for each Tii 6 s, the algorithm tries to find for each node in Ti a matching node in Ts. When a match is found for node Tij_, a link is created from Tij_ to Tsk_ to indicate its match in the seed tree. If no match can be found for node Tij_, then the algorithm attempts to expand the seed tree by inserting Tij_ into Ts. The expanded seed tree Ts is then used in subsequent matching. 1. If njnm have two neighboring siblings in Ti, one on the right and one on the left, that are matched with two consecutive siblings in Ts. which gives one part of Ts and one part of Ti. We can see that node and node d (which are consecutive sibling nodes) in Ti can be inserted into Ts between node b and node e in Ts because node b and node e in Ts and Ti match. The new (extended) Ts .It should be noted that nodes a, b, c, d and e may also have their own children. We did not draw them to save space. This applied to all the cases below. 2. If njnm has only one left neighboring sibling x in Ti and x matches the right most node x in Ts, then njnm can be inserted after node x in Ts. 3. If njnm has only one right neighboring sibling x in Ti and it matches the left most node x in Ts, then njnm can be inserted before node x in Ts.

Fig. 4 : Iterative tree alignment with two iterations

A.Suresh Babu, Sadia Naureen, M. Umamaheswara Rao & G.Sirisha

62

Fig. 5 : Final Data Table (1 Indicates a Data Item)

FUTURE WORK
Labeling Extracted Data Most current work is deficient in providing users the meaning of the attributes of the extracted data. This problem is addressed but the solution proposed is not general enough. Queries are sent out through complex search forms and the search results are used for data extraction and labeling. However, most Web sites do not provide complex search forms and therefore the use of this method is limited. The problem of labeling extracted data can be formally defined as the following: each extracted data record is denoted as R =< i1i2:::iN > where ik(1 6 k 6 N) represents the kth data item. We need to assign an attribute name from the attribute set A = fa1; a2; :::aMg(M N) to each data item ik to determine the corresponding label sequence L =< l1l2:::lN>. Extracting Data from Detail Pages There are mainly two types of "data-rich" pages, one is list pages (the pages containing at least two objects with similar patterns) and another is detail pages (the pages containing detailed information of one object). Our current algorithm is only applicable to list pages. Extracting data from a single detail page heavily depends on the page domain. For example, in the news domain, we can extract news title and news body by utilizing some domain characteristics such as: 1. 2. News bodies usually contain comparatively less hyperlinks and longer text. New titles are usually located (geometrically) on top of new bodies.

It is very difficult to extract data from a single detail page without training or domain knowledge (what do the data describe). Most existing techniques need multiple pages that conform to a common schema as the input, and attempt to infer the schema by comparing these pages. The inferred schema is then used to extract data. How to construct a framework that can be used to extract data from a single page, and how to design algorithms that can be conveniently applied to Web pages in various domains is an interesting research problem. We are currently investigating this problem and will keep pursuing this direction in the future.

CONCLUSIONS
We proposed a new approach to extract structured data from Web pages. Although the problem has been studied by several researchers, existing techniques either are inaccurate or make several assumptions. Our method does not make these assumptions. It only requires that the page contains more than one data record. Our technique consists of two steps: 1) identifying data records without extracting data items in the records and 2) aligning corresponding data items from multiple data records and putting the data items in a database. We proposed an enhanced method based on visual cues for step 1. For step 2, we proposed a novel partial tree alignment technique to align corresponding data fields of multiple

63

Web Information Extraction using Depta

data records. Empirical results using a large number of Web pages demonstrated the effectiveness of the proposed technique.

REFERENCES
[1] A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD, 2003. [2] G. J. Barton and M. J. Sternberg. A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol.,198(2):327337, 1987. [3] D. Buttler, L. Liu, and C. Pu. A fully automated object extraction system for the world wide web. In ICDCS '01: Proceedings of the The 21st International Conference on Distributed Computing Systems, page 361, Washington, DC, USA, 2001. IEEE Computer Society. [4] H. Carrillo and D. Lipman. The multiple sequence alignment problem in biology. SIAM J. Appl. Math., 48(5):10731082, 1988. [5] C. Chang and S. Lui. IEPAD: Information extraction based on pattern discovery. In WWW '01: Proceedings of the tenth international conference on World Wide Web. ACM Press, 2001. [6] W. Chen. New algorithm for ordered tree-to-tree correction problem. J. Algorithms, 40(2):135158, 2001.

You might also like