Professional Documents
Culture Documents
Text Mining tool with multiple components Text Analysis tools includ
Language Identification Tool
Knowledge base cannot be directly expanded by end user Strong machine-learning component
2002 Paula Matuszek
Language Identification
Can analyze
an entire document
a text string input from the command line
Currently handles about a dozen language Can be trained; ML tool with input in language to be learned
Language Identification
Basically treated as a categorization problem, where each language is a category Training documents are processed to extract terms. Importance of terms for categorization is determined statistically Dictionaries of weighted terms are used to determine language of new documents
2002 Paula Matuszek
Feature Extraction
Locate and categorize relevant features in text Some features are themselves of interest Also starting point for other tools like classifiers, categorizers. Features may or may not be meaningful to a person Goal is to find aspects of a document which somehow characterize it
Name Extraction
Dr., Mr,. Ms. are titles, and titles followed by capitalized words frequently indicate names. If followed by only one word, its the last name
Capitalized word followed by single capitalized letter followed by capitalized word is probably FN MI LN. Nouns can be names. Verbs cant.
2002 Paula Matuszek
Abbreviation/Acronym Extraction
Fruitful source of variants for names and terms Existing dictionary of common terms Name followed by ( [A-Z]+ ) probably gives an abbreviation. Conventions regarding word internal case and prefixes. MSDOS matches MicroSoft DOS, GB matches gigabyte.
2002 Paula Matuszek
Number Extraction
Base forms are easy; most of effort is variants and determining canonical form based on rules
2002 Paula Matuszek
Date Extraction
Money Extraction
Recognizes currencies and produces canonical representation Uses number extractor Examples
twenty-seven dollars 27.000 dollars USA DM 27 27.000 marks Germany
Term Extraction
Identify other important terms found in text Other major lexical clue for subject, especially if repeated. May use output from other extractors in rules
Term Extraction
Information Quotient
Each feature (word, phrase, name) extracted is assigned an information quotient Represents the significance of the feature in the document TF-IDF: Term frequency-Inverse Document Frequency Position information Stop words
Tool may be used for highlighting, etc, on documents to be displayed Features extracted also form basis for other tools Note that this is not full information extraction, although it is a starting point
http://www-4.ibm.com/software/data/iminer/fortext/extract/extractDemo.html
Other Features
Feature Extractor also identifies other features used by other text analysis tools:
sentence boundaries paragraph boundaries document tags document structure collection statistics
2002 Paula Matuszek
Summarizer Tools
Collection of sentences extracted from document Characteristic of document content Works best for well-structured documents Can specify length Must apply feature extraction first
Summarizer
Word Ranking
Words scored IF
Appears in structures such as titles and captions
Occurs more often in document than in collection (word salience)
Score is
salience if > threshold: tf*idf (by default)
Sentence Ranking
All configurable
2002 Paula Matuszek
Summarization Examples
http://www-4.ibm.com/software/data/iminer/fortext/summarize/summarizeDemo.html
TF x IDF
term strength: likelihood that a document will occur in both of two closely-related documents
Output is list of possible categories and probabilities for each document Can filter initial schema for faster processing
2002 Paula Matuszek
Linguistic Features
Uses the features extracted by Feature Extraction tool
N-Grams
letter groupings and short words.
Document Categorizing
Individual document is analyzed for features Features are compared to those determined for categories:
terms present/absent IQ of terms frequencies document structure
Document Categorization
Important issue is determining which features! High dimensionality is expensive. Ideally you want a small set of features which is
present in all documents of one category absent in all other documents
Categorization Demo
Typically categorization is a component in a system which then does something with the categorized documents Ambiguous documents (not assigned to any one category with high probability) often indicate a new category evolving.
http://www-4.ibm.com/software/data/iminer/fortext/categorize/categorize.
Clustering Tools
Hierarchical clustering
creates a tree where each leaf is a document, each cluster is positioned under the most similar cluster one step up
Hierarchical Clustering
Output is a dendogram
Root Intermediate level leaves
link to actual documents
Select Linguistic Preprocessing technique: determines similarity Cluster documents: create dendogram based on similairy Define shape of tree with slicing technique and produce HTML output
Linguistic Preprocessing
Determining similarity between documents and clusters: how do we define similar? Lexical affinity. Does not require any preprocessing Linguistic Features. Requires that feature extractor be run first.
iMiner is either/or; you cannot combine the two methods of determining similiarity
Can allow focusing on specific areas of interest Best if you have some idea what you are interested in
2002 Paula Matuszek
Put each document in a cluster, characterized by its lexical or linguistic features Merge the two most similar clusters Continue till all clusters are merged
The Dendogram is too big to be useful Slicing reduces the size of the tree by merging clusters if they are similar enough.
top threshold: collapse any tree which exceeds it bottom threshold: group under root any cluster which is lower Remaining clusters make a new tree # of steps sets depth of tree
2002 Paula Matuszek
Bottom
start around 5% or 10% similar 90% would mean only virtually identical documents get grouped
Top
good default is 90% if want really identical, set to 100%
Depth:
Typically 2 to 10 Two would give you duplicates and rest
2002 Paula Matuszek
Relational Clustering
Descriptors are features extracted by feature extraction tool. Similarity threshold: at 100% only identical documents are clustered Max # of clusters: overrides similiarity threshold to get number of clusters specified
Outputs are
clusters: topics found, importance of topics, degree of similiarity in cluster
links: sets of common descriptors between clusters
Clustering Demo
Patents from class 395: information processing system organization 10% for top, 1% for bottom, total of 5 slices
lexical affinity
http://www-4.ibm.com/software/data/iminer/fortext/cluster/clusterDemo.html
Summary
iMiner has a rich set of text mining tools Product is well-developed, stable No explicit user-modifiable knowledge base -- uses automated techniques and built-in KB to extract relevant information Can be deployed to new domains without a lot of additional work BUT not as effective in many domains as a tool with a good KB No real information extraction capability
2002 Paula Matuszek
Given a body of text: extract from it some well-defined set of information MUC conferences Typically draws heavily on NLP Three main components:
Domain knowledge base Extraction Engine Knowledge model
2002 Paula Matuszek
Terms: enumerated list of strings which are all members of some class.
January, February Smith, Wong, Martinez, Matuszek lysine, alanine, cysteine
Rules: LHS, RHS, salience Left Hand Side (LHS): a pattern to be matched, written as relationships among terms and classes Right Hand Side (RHS): an action to be taken when the pattern is found Salience: priority of this rule (weight, strength, confidence)
2002 Paula Matuszek
<Monthname> <Year> => <Date> <Date> <Name> => print Birthdate, <Name>, <Date> <Name> <Address> => create address database record <daynumber> / <monthnumber> / <year> => create date database record (50) <monthnumber> / <daynumber> / <year> => create date database record (60) <capitalized noun> <single letter> . <capitalized noun> => <Name> <noun phrase> <to be verb> <noun phrase> => create relationship database record
2002 Paula Matuszek
Generic KB
Almost all systems have one Limited by cost of development: it takes about 200 rules to define dates reasonably well, for instance.
2002 Paula Matuszek
Domain-specific KB
We mostly cant afford to build a KB for the entire world. However, most applications are fairly domain-specific. Therefore we build domain-specific KBs which identify the kind of information we are interested in.
Protein-protein interactions airline flights terrorist activities
2002 Paula Matuszek
Domain-specific KBs
Typically start with the generic KBs Add terminology Figure out what kinds of information you want to extract Add rules to identify it Test against documents which have been human-scored to determine precision and recall for individual items.
2002 Paula Matuszek
Knowledge Model
We arent looking for documents, we are looking for information. What information? Typically we have a knowledge model or schema which identifies the information components we want and their relationship Typically looks very much like a DB schema or object definition
2002 Paula Matuszek
Personal records
Name
First name Middle Initial Last Name
Birthdate
Month Day Year
Address
Protein Inhibitors
Protein name (class?) Compound name (class?) Pointer to source Cache of text Offset into text
Summary
Text mining below the document level NOT typically interactive, because its slow (1 to 100 meg of text/hr) Typically builds up a DB of information which can then be queries Uses a combination of term- and ruledriven analysis and NLP parsing. AeroText: very good system developed by LMCO; we will get a complete demo on March 26.
2002 Paula Matuszek