IMINER

iMiner Introduction
2002 Paula Matuszek
iMiner from IBM

Text Mining tool with multiple components Text Analysis tools includ
Language Identification Tool
Feature Extraction Tool

Summarizer Tool Topic Categorization Tool Clustering Tools
http://www-4.ibm.com/software/data/iminer/fortext/index.html
http://www-4.ibm.com/software/data/iminer/fortext/presentations/im4t23engl/im4t23engl1.htm
2002 Paula Matuszek
iMiner for Text 2
Basic technology includes:

authority file with terms heuristics for extracting additional terms heuristics for extracting other features Dictionaries with parts of speech Partial parsing for part-of-speech tagging Significance measure for terms: Information Quotient (IQ).
Knowledge base cannot be directly expanded by end user Strong machine-learning component
2002 Paula Matuszek
Language Identification
Can analyze
an entire document
a text string input from the command line
Currently handles about a dozen language Can be trained; ML tool with input in language to be learned
Determines approximate proportion in bilingual documents

2002 Paula Matuszek
Language Identification
Basically treated as a categorization problem, where each language is a category Training documents are processed to extract terms. Importance of terms for categorization is determined statistically Dictionaries of weighted terms are used to determine language of new documents
2002 Paula Matuszek
Feature Extraction
Locate and categorize relevant features in text Some features are themselves of interest Also starting point for other tools like classifiers, categorizers. Features may or may not be meaningful to a person Goal is to find aspects of a document which somehow characterize it
2002 Paula Matuszek
Name Extraction
Extracting Proper Names

People, places, organizations
Valuable clues to subject of text
Dictionaries of canonical forms Additional names extracted from documents

Parsing finds tokens
Additional parsing groups tokens into noun phrases

Rules identify tokens which are names Variant groups are assigned a canonical name which is the most explicit variant found in document
2002 Paula Matuszek
Examples for Name Extraction
This subject is taught by Paula Matuszek.

Recognize Paula as a first name of a person Recognize Matuszek as a capitalized word following a first name. Therefore Paula Matuszek is probably the name of a person.
This subject is taught by Villanova University.

Recognize Villanova as a probable name based on capitalization.
Reognize University as a term which normally names an institution.. Therefore Villanova University is probably the name of an institution.
This subject is taught by Howard University

BOTH of these sets of rules could apply. So rules need to be prioritized to determine more likely parse.
2002 Paula Matuszek
Other Rule Examples
Dr., Mr,. Ms. are titles, and titles followed by capitalized words frequently indicate names. If followed by only one word, its the last name
Capitalized word followed by single capitalized letter followed by capitalized word is probably FN MI LN. Nouns can be names. Verbs cant.
2002 Paula Matuszek
Abbreviation/Acronym Extraction

Fruitful source of variants for names and terms Existing dictionary of common terms Name followed by ( [A-Z]+ ) probably gives an abbreviation. Conventions regarding word internal case and prefixes. MSDOS matches MicroSoft DOS, GB matches gigabyte.
2002 Paula Matuszek
Number Extraction
Useful primarily to improve performance of other extractors. Variant expressions of numbers

One thousand three hundred and twenty seven thirteen twenty seven 1327
Other numeric expressions

twenty-seven percent 27%
Base forms are easy; most of effort is variants and determining canonical form based on rules
2002 Paula Matuszek
Date Extraction
Absolute and relative dates Produces canonical form.

March 27, 1997 tomorrow a year ago 1997/03/27 ref+0000/00/01 ref-0001/00/00
Similar techniques and issues as for numbers
2002 Paula Matuszek
Money Extraction
Recognizes currencies and produces canonical representation Uses number extractor Examples
twenty-seven dollars 27.000 dollars USA DM 27 27.000 marks Germany
2002 Paula Matuszek
Term Extraction

Identify other important terms found in text Other major lexical clue for subject, especially if repeated. May use output from other extractors in rules
Recognizes common lexical variants and reduces to canonical form -- stemming

Machine learning is much more important here
2002 Paula Matuszek
Term Extraction
Dictionary with parts of speech info for English
Pattern matching to find noun phrase structure typical of technical terms.

Feature repositories:
Authority dictionary: canonical forms, variants, correct feature map. Used BEFORE heuristics
Residue dictionary: complex feature type (name, term, pattern). Used AFTER heuristics
Authority and residue dictionaries trained

2002 Paula Matuszek
Information Quotient
Each feature (word, phrase, name) extracted is assigned an information quotient Represents the significance of the feature in the document TF-IDF: Term frequency-Inverse Document Frequency Position information Stop words
2002 Paula Matuszek
Feature Extraction Demo
Tool may be used for highlighting, etc, on documents to be displayed Features extracted also form basis for other tools Note that this is not full information extraction, although it is a starting point
http://www-4.ibm.com/software/data/iminer/fortext/extract/extractDemo.html
2002 Paula Matuszek
Other Features
Feature Extractor also identifies other features used by other text analysis tools:
sentence boundaries paragraph boundaries document tags document structure collection statistics
2002 Paula Matuszek
Summarizer Tools
Collection of sentences extracted from document Characteristic of document content Works best for well-structured documents Can specify length Must apply feature extraction first
2002 Paula Matuszek
Summarizer

Feature extractor run first
Words are ranked

Sentences are ranked Highest ranked sentences are chosen Configurable: for length of sentence, for word salience
Works best when document is part of a collection

2002 Paula Matuszek
Word Ranking
Words scored IF
Appears in structures such as titles and captions
Occurs more often in document than in collection (word salience)
Occurs more than once in a document
Score is
salience if > threshold: tf*idf (by default)
weighting factor if occurs in title, heading caption
2002 Paula Matuszek
Sentence Ranking
Scored according to relevance in document and position in document. Sum of

Scores of individual words Proximity of sentence to beginning of its paragraph Bonus for final sentence in long paragraph and final paragraph in long documents Proximity of paragraph to beginning of document
All configurable
2002 Paula Matuszek
Summarization Examples
Examples from IBM documentation
http://www-4.ibm.com/software/data/iminer/fortext/summarize/summarizeDemo.html
2002 Paula Matuszek
Some Common Statistical Measures

(a brief digression)

TF x IDF
Pairwise and multiple-word phrase counts

Some other common statistical measures:
information gain: how many bits of information do we gain by knowing that a term is present in a document mutual information: how likely a term is to occur in a document
term strength: likelihood that a document will occur in both of two closely-related documents
2002 Paula Matuszek
Topic Categorization Tool
Assign documents to predetermined categories Must first be trained

Training tool creates category scheme
Dictionary that stores significant vocabulary statistics
Output is list of possible categories and probabilities for each document Can filter initial schema for faster processing
2002 Paula Matuszek
Features Used for Categorizing
Linguistic Features
Uses the features extracted by Feature Extraction tool
N-Grams
letter groupings and short words.
Can be used for non-English, because it doesnt depend on heuristics

Used by Language categorizer
2002 Paula Matuszek
Document Categorizing
Individual document is analyzed for features Features are compared to those determined for categories:
terms present/absent IQ of terms frequencies document structure
2002 Paula Matuszek
Document Categorization
Important issue is determining which features! High dimensionality is expensive. Ideally you want a small set of features which is
present in all documents of one category absent in all other documents
In actuality, not that clean. So:

use features with relatively high separation eliminate feature which correlates very highly with another feature (to reduce dimension space)
2002 Paula Matuszek
Categorization Demo
Typically categorization is a component in a system which then does something with the categorized documents Ambiguous documents (not assigned to any one category with high probability) often indicate a new category evolving.
http://www-4.ibm.com/software/data/iminer/fortext/categorize/categorize.
2002 Paula Matuszek
Clustering Tools
Organize documents without pre-existing categories
Hierarchical clustering
creates a tree where each leaf is a document, each cluster is positioned under the most similar cluster one step up
Binary Relational clustering

Creates a flat set of clusters with each document assigned to its best fit and relations between clusters captured
2002 Paula Matuszek
Hierarchical Clustering
Input is a set of documents
Output is a dendogram
Root Intermediate level leaves
link to actual documents
Slicing is used to create manageable HTML tree

2002 Paula Matuszek
Steps in Hierarchical Clustering
Select Linguistic Preprocessing technique: determines similarity Cluster documents: create dendogram based on similairy Define shape of tree with slicing technique and produce HTML output
2002 Paula Matuszek
Linguistic Preprocessing
Determining similarity between documents and clusters: how do we define similar? Lexical affinity. Does not require any preprocessing Linguistic Features. Requires that feature extractor be run first.
iMiner is either/or; you cannot combine the two methods of determining similiarity
2002 Paula Matuszek
Clustering: Lexical Affinities
Lexical affinities: groups of words which appear frequently close together

created on the fly during a clustering task word pairs stemming and other morphological analysis stop words
Results in documents with textual similiarity being clustered together

2002 Paula Matuszek
Clustering: Linguistic Features
Linguistic features: Use features extracted by the feature extraction tool

Names of organizations Domain Technical Terms Names of Individuals
Can allow focusing on specific areas of interest Best if you have some idea what you are interested in
2002 Paula Matuszek
Hierarchical Clustering Steps
Put each document in a cluster, characterized by its lexical or linguistic features Merge the two most similar clusters Continue till all clusters are merged
2002 Paula Matuszek
Hierarchical Clustering: Slicing

The Dendogram is too big to be useful Slicing reduces the size of the tree by merging clusters if they are similar enough.
top threshold: collapse any tree which exceeds it bottom threshold: group under root any cluster which is lower Remaining clusters make a new tree # of steps sets depth of tree
2002 Paula Matuszek
Typical Slicing Parameters
Bottom
start around 5% or 10% similar 90% would mean only virtually identical documents get grouped
Top
good default is 90% if want really identical, set to 100%
Depth:
Typically 2 to 10 Two would give you duplicates and rest
2002 Paula Matuszek
Binary Relational Clustering
Binary Relational clustering

Creates a flat set of clusters Each document assigned to its best fit Relations between clusters captured
Similarity based on features extracted by Feature Extraction tool
2002 Paula Matuszek
Relational Clustering: Document Similarity
Based on comparison of descriptors

Frequent descriptors across collection given more weight: priority to wide topics Rare descriptors given more weight: large number of very focused clusters Both, with rare descriptors given slightly higher weight: relatively focused topics but fewer clusters
Descriptors are binary: present or absent

2002 Paula Matuszek
Relational Clustering
Descriptors are features extracted by feature extraction tool. Similarity threshold: at 100% only identical documents are clustered Max # of clusters: overrides similiarity threshold to get number of clusters specified
2002 Paula Matuszek
Binary Relational Clustering Outputs
Outputs are
clusters: topics found, importance of topics, degree of similiarity in cluster
links: sets of common descriptors between clusters
2002 Paula Matuszek
Clustering Demo
Patents from class 395: information processing system organization 10% for top, 1% for bottom, total of 5 slices
lexical affinity
http://www-4.ibm.com/software/data/iminer/fortext/cluster/clusterDemo.html
2002 Paula Matuszek
Summary
iMiner has a rich set of text mining tools Product is well-developed, stable No explicit user-modifiable knowledge base -- uses automated techniques and built-in KB to extract relevant information Can be deployed to new domains without a lot of additional work BUT not as effective in many domains as a tool with a good KB No real information extraction capability
2002 Paula Matuszek
Information Extraction Overview
Given a body of text: extract from it some well-defined set of information MUC conferences Typically draws heavily on NLP Three main components:
Domain knowledge base Extraction Engine Knowledge model
2002 Paula Matuszek
Information Extraction Domain Knowledge Base
Terms: enumerated list of strings which are all members of some class.
January, February Smith, Wong, Martinez, Matuszek lysine, alanine, cysteine
Classes: general categories of terms

Monthnames, Last Names, Amino acids Capitalized nouns, Verb Phrases
2002 Paula Matuszek
Domain Knowledge Base
Rules: LHS, RHS, salience Left Hand Side (LHS): a pattern to be matched, written as relationships among terms and classes Right Hand Side (RHS): an action to be taken when the pattern is found Salience: priority of this rule (weight, strength, confidence)
2002 Paula Matuszek
Some Rule Examples:

<Monthname> <Year> => <Date> <Date> <Name> => print Birthdate, <Name>, <Date> <Name> <Address> => create address database record <daynumber> / <monthnumber> / <year> => create date database record (50) <monthnumber> / <daynumber> / <year> => create date database record (60) <capitalized noun> <single letter> . <capitalized noun> => <Name> <noun phrase> <to be verb> <noun phrase> => create relationship database record
2002 Paula Matuszek
Generic KB
Generic KB: KB likely to be useful in many domains

names dates places organizations
Almost all systems have one Limited by cost of development: it takes about 200 rules to define dates reasonably well, for instance.
2002 Paula Matuszek
Domain-specific KB
We mostly cant afford to build a KB for the entire world. However, most applications are fairly domain-specific. Therefore we build domain-specific KBs which identify the kind of information we are interested in.
Protein-protein interactions airline flights terrorist activities
2002 Paula Matuszek
Domain-specific KBs
Typically start with the generic KBs Add terminology Figure out what kinds of information you want to extract Add rules to identify it Test against documents which have been human-scored to determine precision and recall for individual items.
2002 Paula Matuszek
Knowledge Model
We arent looking for documents, we are looking for information. What information? Typically we have a knowledge model or schema which identifies the information components we want and their relationship Typically looks very much like a DB schema or object definition
2002 Paula Matuszek
Knowledge Model Examples
Personal records
Name
First name Middle Initial Last Name
Birthdate
Month Day Year
Address
2002 Paula Matuszek
Protein Inhibitors
Protein name (class?) Compound name (class?) Pointer to source Cache of text Offset into text
2002 Paula Matuszek
Airline Flight Record

Airline
Flight

Number Origin Destination Date Status departure time arrival time
2002 Paula Matuszek
Summary
Text mining below the document level NOT typically interactive, because its slow (1 to 100 meg of text/hr) Typically builds up a DB of information which can then be queries Uses a combination of term- and ruledriven analysis and NLP parsing. AeroText: very good system developed by LMCO; we will get a complete demo on March 26.
2002 Paula Matuszek

IMINER

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IMINER

Uploaded by

Copyright:

Available Formats

iMiner Introduction

2002 Paula Matuszek

iMiner from IBM

Feature Extraction Tool

2002 Paula Matuszek

iMiner for Text 2

Basic technology includes:

Determines approximate proportion in bilingual documents

2002 Paula Matuszek

Extracting Proper Names

Dictionaries of canonical forms Additional names extracted from documents

Additional parsing groups tokens into noun phrases

Examples for Name Extraction

This subject is taught by Paula Matuszek.

This subject is taught by Villanova University.

This subject is taught by Howard University

2002 Paula Matuszek

Other Rule Examples

Useful primarily to improve performance of other extractors. Variant expressions of numbers

Other numeric expressions

Absolute and relative dates Produces canonical form.

Similar techniques and issues as for numbers

2002 Paula Matuszek

2002 Paula Matuszek

Recognizes common lexical variants and reduces to canonical form -- stemming

2002 Paula Matuszek

Dictionary with parts of speech info for English

Pattern matching to find noun phrase structure typical of technical terms.

Authority and residue dictionaries trained

2002 Paula Matuszek

Feature Extraction Demo

2002 Paula Matuszek

2002 Paula Matuszek

Feature extractor run first

Words are ranked

Works best when document is part of a collection

Occurs more than once in a document

weighting factor if occurs in title, heading caption

2002 Paula Matuszek

Scored according to relevance in document and position in document. Sum of

Examples from IBM documentation

2002 Paula Matuszek

Some Common Statistical Measures

Pairwise and multiple-word phrase counts

2002 Paula Matuszek

Topic Categorization Tool

Assign documents to predetermined categories Must first be trained

Dictionary that stores significant vocabulary statistics

Features Used for Categorizing

Can be used for non-English, because it doesnt depend on heuristics

2002 Paula Matuszek

2002 Paula Matuszek

In actuality, not that clean. So:

2002 Paula Matuszek

2002 Paula Matuszek

Organize documents without pre-existing categories

Binary Relational clustering

Input is a set of documents

Slicing is used to create manageable HTML tree

Steps in Hierarchical Clustering

2002 Paula Matuszek

2002 Paula Matuszek

Clustering: Lexical Affinities

Lexical affinities: groups of words which appear frequently close together