You are on page 1of 56

iMiner Introduction

2002 Paula Matuszek

iMiner from IBM


Text Mining tool with multiple components Text Analysis tools includ
Language Identification Tool

Feature Extraction Tool


Summarizer Tool Topic Categorization Tool Clustering Tools
http://www-4.ibm.com/software/data/iminer/fortext/index.html
http://www-4.ibm.com/software/data/iminer/fortext/presentations/im4t23engl/im4t23engl1.htm

2002 Paula Matuszek

iMiner for Text 2

Basic technology includes:


authority file with terms heuristics for extracting additional terms heuristics for extracting other features Dictionaries with parts of speech Partial parsing for part-of-speech tagging Significance measure for terms: Information Quotient (IQ).

Knowledge base cannot be directly expanded by end user Strong machine-learning component
2002 Paula Matuszek

Language Identification

Can analyze
an entire document
a text string input from the command line

Currently handles about a dozen language Can be trained; ML tool with input in language to be learned

Determines approximate proportion in bilingual documents


2002 Paula Matuszek

Language Identification

Basically treated as a categorization problem, where each language is a category Training documents are processed to extract terms. Importance of terms for categorization is determined statistically Dictionaries of weighted terms are used to determine language of new documents
2002 Paula Matuszek

Feature Extraction

Locate and categorize relevant features in text Some features are themselves of interest Also starting point for other tools like classifiers, categorizers. Features may or may not be meaningful to a person Goal is to find aspects of a document which somehow characterize it

2002 Paula Matuszek

Name Extraction

Extracting Proper Names


People, places, organizations
Valuable clues to subject of text

Dictionaries of canonical forms Additional names extracted from documents


Parsing finds tokens

Additional parsing groups tokens into noun phrases


Rules identify tokens which are names Variant groups are assigned a canonical name which is the most explicit variant found in document
2002 Paula Matuszek

Examples for Name Extraction

This subject is taught by Paula Matuszek.


Recognize Paula as a first name of a person Recognize Matuszek as a capitalized word following a first name. Therefore Paula Matuszek is probably the name of a person.

This subject is taught by Villanova University.


Recognize Villanova as a probable name based on capitalization.
Reognize University as a term which normally names an institution.. Therefore Villanova University is probably the name of an institution.

This subject is taught by Howard University


BOTH of these sets of rules could apply. So rules need to be prioritized to determine more likely parse.

2002 Paula Matuszek

Other Rule Examples

Dr., Mr,. Ms. are titles, and titles followed by capitalized words frequently indicate names. If followed by only one word, its the last name

Capitalized word followed by single capitalized letter followed by capitalized word is probably FN MI LN. Nouns can be names. Verbs cant.
2002 Paula Matuszek

Abbreviation/Acronym Extraction

Fruitful source of variants for names and terms Existing dictionary of common terms Name followed by ( [A-Z]+ ) probably gives an abbreviation. Conventions regarding word internal case and prefixes. MSDOS matches MicroSoft DOS, GB matches gigabyte.
2002 Paula Matuszek

Number Extraction

Useful primarily to improve performance of other extractors. Variant expressions of numbers


One thousand three hundred and twenty seven thirteen twenty seven 1327

Other numeric expressions


twenty-seven percent 27%

Base forms are easy; most of effort is variants and determining canonical form based on rules
2002 Paula Matuszek

Date Extraction

Absolute and relative dates Produces canonical form.


March 27, 1997 tomorrow a year ago 1997/03/27 ref+0000/00/01 ref-0001/00/00

Similar techniques and issues as for numbers

2002 Paula Matuszek

Money Extraction

Recognizes currencies and produces canonical representation Uses number extractor Examples
twenty-seven dollars 27.000 dollars USA DM 27 27.000 marks Germany

2002 Paula Matuszek

Term Extraction

Identify other important terms found in text Other major lexical clue for subject, especially if repeated. May use output from other extractors in rules

Recognizes common lexical variants and reduces to canonical form -- stemming


Machine learning is much more important here

2002 Paula Matuszek

Term Extraction

Dictionary with parts of speech info for English

Pattern matching to find noun phrase structure typical of technical terms.


Feature repositories:
Authority dictionary: canonical forms, variants, correct feature map. Used BEFORE heuristics
Residue dictionary: complex feature type (name, term, pattern). Used AFTER heuristics

Authority and residue dictionaries trained


2002 Paula Matuszek

Information Quotient

Each feature (word, phrase, name) extracted is assigned an information quotient Represents the significance of the feature in the document TF-IDF: Term frequency-Inverse Document Frequency Position information Stop words

2002 Paula Matuszek

Feature Extraction Demo

Tool may be used for highlighting, etc, on documents to be displayed Features extracted also form basis for other tools Note that this is not full information extraction, although it is a starting point
http://www-4.ibm.com/software/data/iminer/fortext/extract/extractDemo.html

2002 Paula Matuszek

Other Features

Feature Extractor also identifies other features used by other text analysis tools:
sentence boundaries paragraph boundaries document tags document structure collection statistics
2002 Paula Matuszek

Summarizer Tools

Collection of sentences extracted from document Characteristic of document content Works best for well-structured documents Can specify length Must apply feature extraction first

2002 Paula Matuszek

Summarizer

Feature extractor run first

Words are ranked


Sentences are ranked Highest ranked sentences are chosen Configurable: for length of sentence, for word salience

Works best when document is part of a collection


2002 Paula Matuszek

Word Ranking

Words scored IF
Appears in structures such as titles and captions
Occurs more often in document than in collection (word salience)

Occurs more than once in a document

Score is
salience if > threshold: tf*idf (by default)

weighting factor if occurs in title, heading caption

2002 Paula Matuszek

Sentence Ranking

Scored according to relevance in document and position in document. Sum of


Scores of individual words Proximity of sentence to beginning of its paragraph Bonus for final sentence in long paragraph and final paragraph in long documents Proximity of paragraph to beginning of document

All configurable
2002 Paula Matuszek

Summarization Examples

Examples from IBM documentation

http://www-4.ibm.com/software/data/iminer/fortext/summarize/summarizeDemo.html

2002 Paula Matuszek

Some Common Statistical Measures


(a brief digression)

TF x IDF

Pairwise and multiple-word phrase counts


Some other common statistical measures:
information gain: how many bits of information do we gain by knowing that a term is present in a document mutual information: how likely a term is to occur in a document

term strength: likelihood that a document will occur in both of two closely-related documents

2002 Paula Matuszek

Topic Categorization Tool

Assign documents to predetermined categories Must first be trained


Training tool creates category scheme

Dictionary that stores significant vocabulary statistics

Output is list of possible categories and probabilities for each document Can filter initial schema for faster processing
2002 Paula Matuszek

Features Used for Categorizing

Linguistic Features
Uses the features extracted by Feature Extraction tool

N-Grams
letter groupings and short words.

Can be used for non-English, because it doesnt depend on heuristics


Used by Language categorizer

2002 Paula Matuszek

Document Categorizing

Individual document is analyzed for features Features are compared to those determined for categories:
terms present/absent IQ of terms frequencies document structure

2002 Paula Matuszek

Document Categorization

Important issue is determining which features! High dimensionality is expensive. Ideally you want a small set of features which is
present in all documents of one category absent in all other documents

In actuality, not that clean. So:


use features with relatively high separation eliminate feature which correlates very highly with another feature (to reduce dimension space)

2002 Paula Matuszek

Categorization Demo

Typically categorization is a component in a system which then does something with the categorized documents Ambiguous documents (not assigned to any one category with high probability) often indicate a new category evolving.
http://www-4.ibm.com/software/data/iminer/fortext/categorize/categorize.

2002 Paula Matuszek

Clustering Tools

Organize documents without pre-existing categories

Hierarchical clustering
creates a tree where each leaf is a document, each cluster is positioned under the most similar cluster one step up

Binary Relational clustering


Creates a flat set of clusters with each document assigned to its best fit and relations between clusters captured
2002 Paula Matuszek

Hierarchical Clustering

Input is a set of documents

Output is a dendogram
Root Intermediate level leaves
link to actual documents

Slicing is used to create manageable HTML tree


2002 Paula Matuszek

Steps in Hierarchical Clustering

Select Linguistic Preprocessing technique: determines similarity Cluster documents: create dendogram based on similairy Define shape of tree with slicing technique and produce HTML output

2002 Paula Matuszek

Linguistic Preprocessing

Determining similarity between documents and clusters: how do we define similar? Lexical affinity. Does not require any preprocessing Linguistic Features. Requires that feature extractor be run first.

iMiner is either/or; you cannot combine the two methods of determining similiarity

2002 Paula Matuszek

Clustering: Lexical Affinities

Lexical affinities: groups of words which appear frequently close together


created on the fly during a clustering task word pairs stemming and other morphological analysis stop words

Results in documents with textual similiarity being clustered together


2002 Paula Matuszek

Clustering: Linguistic Features

Linguistic features: Use features extracted by the feature extraction tool


Names of organizations Domain Technical Terms Names of Individuals

Can allow focusing on specific areas of interest Best if you have some idea what you are interested in
2002 Paula Matuszek

Hierarchical Clustering Steps

Put each document in a cluster, characterized by its lexical or linguistic features Merge the two most similar clusters Continue till all clusters are merged

2002 Paula Matuszek

Hierarchical Clustering: Slicing


The Dendogram is too big to be useful Slicing reduces the size of the tree by merging clusters if they are similar enough.
top threshold: collapse any tree which exceeds it bottom threshold: group under root any cluster which is lower Remaining clusters make a new tree # of steps sets depth of tree
2002 Paula Matuszek

Typical Slicing Parameters

Bottom
start around 5% or 10% similar 90% would mean only virtually identical documents get grouped

Top
good default is 90% if want really identical, set to 100%

Depth:
Typically 2 to 10 Two would give you duplicates and rest
2002 Paula Matuszek

Binary Relational Clustering

Binary Relational clustering


Creates a flat set of clusters Each document assigned to its best fit Relations between clusters captured

Similarity based on features extracted by Feature Extraction tool

2002 Paula Matuszek

Relational Clustering: Document Similarity

Based on comparison of descriptors


Frequent descriptors across collection given more weight: priority to wide topics Rare descriptors given more weight: large number of very focused clusters Both, with rare descriptors given slightly higher weight: relatively focused topics but fewer clusters

Descriptors are binary: present or absent


2002 Paula Matuszek

Relational Clustering

Descriptors are features extracted by feature extraction tool. Similarity threshold: at 100% only identical documents are clustered Max # of clusters: overrides similiarity threshold to get number of clusters specified

2002 Paula Matuszek

Binary Relational Clustering Outputs

Outputs are
clusters: topics found, importance of topics, degree of similiarity in cluster
links: sets of common descriptors between clusters

2002 Paula Matuszek

Clustering Demo

Patents from class 395: information processing system organization 10% for top, 1% for bottom, total of 5 slices

lexical affinity
http://www-4.ibm.com/software/data/iminer/fortext/cluster/clusterDemo.html

2002 Paula Matuszek

Summary

iMiner has a rich set of text mining tools Product is well-developed, stable No explicit user-modifiable knowledge base -- uses automated techniques and built-in KB to extract relevant information Can be deployed to new domains without a lot of additional work BUT not as effective in many domains as a tool with a good KB No real information extraction capability
2002 Paula Matuszek

Information Extraction Overview

Given a body of text: extract from it some well-defined set of information MUC conferences Typically draws heavily on NLP Three main components:
Domain knowledge base Extraction Engine Knowledge model
2002 Paula Matuszek

Information Extraction Domain Knowledge Base

Terms: enumerated list of strings which are all members of some class.
January, February Smith, Wong, Martinez, Matuszek lysine, alanine, cysteine

Classes: general categories of terms


Monthnames, Last Names, Amino acids Capitalized nouns, Verb Phrases
2002 Paula Matuszek

Domain Knowledge Base

Rules: LHS, RHS, salience Left Hand Side (LHS): a pattern to be matched, written as relationships among terms and classes Right Hand Side (RHS): an action to be taken when the pattern is found Salience: priority of this rule (weight, strength, confidence)
2002 Paula Matuszek

Some Rule Examples:


<Monthname> <Year> => <Date> <Date> <Name> => print Birthdate, <Name>, <Date> <Name> <Address> => create address database record <daynumber> / <monthnumber> / <year> => create date database record (50) <monthnumber> / <daynumber> / <year> => create date database record (60) <capitalized noun> <single letter> . <capitalized noun> => <Name> <noun phrase> <to be verb> <noun phrase> => create relationship database record
2002 Paula Matuszek

Generic KB

Generic KB: KB likely to be useful in many domains


names dates places organizations

Almost all systems have one Limited by cost of development: it takes about 200 rules to define dates reasonably well, for instance.
2002 Paula Matuszek

Domain-specific KB

We mostly cant afford to build a KB for the entire world. However, most applications are fairly domain-specific. Therefore we build domain-specific KBs which identify the kind of information we are interested in.
Protein-protein interactions airline flights terrorist activities
2002 Paula Matuszek

Domain-specific KBs

Typically start with the generic KBs Add terminology Figure out what kinds of information you want to extract Add rules to identify it Test against documents which have been human-scored to determine precision and recall for individual items.
2002 Paula Matuszek

Knowledge Model

We arent looking for documents, we are looking for information. What information? Typically we have a knowledge model or schema which identifies the information components we want and their relationship Typically looks very much like a DB schema or object definition
2002 Paula Matuszek

Knowledge Model Examples

Personal records
Name
First name Middle Initial Last Name

Birthdate
Month Day Year

Address

2002 Paula Matuszek

Knowledge Model Examples

Protein Inhibitors
Protein name (class?) Compound name (class?) Pointer to source Cache of text Offset into text

2002 Paula Matuszek

Knowledge Model Examples

Airline Flight Record


Airline
Flight

Number Origin Destination Date Status departure time arrival time

2002 Paula Matuszek

Summary

Text mining below the document level NOT typically interactive, because its slow (1 to 100 meg of text/hr) Typically builds up a DB of information which can then be queries Uses a combination of term- and ruledriven analysis and NLP parsing. AeroText: very good system developed by LMCO; we will get a complete demo on March 26.
2002 Paula Matuszek

You might also like