Professional Documents
Culture Documents
Reference Material:
Decision Support and Business Intelligence Systems by Efraim Turban, Ramesh Sharda and Dursun Delen, 9/e, Pearson, 2012
Learning Objectives
Describe text mining and understand the need for text mining Differentiate between text mining, Web mining and data mining Understand the different application areas for text mining Know the process of carrying out a text mining project Understand the different methods to introduce structure to text-based data
Learning Objectives
Describe Web mining, its objectives, and its benefits Understand the three different branches of Web mining
Web content mining Web structure mining Web usage mining
Opening Vignette
Mining Text For Security And Counterterrorism What is MITRE? Problem description Proposed solution Results Answer & discuss the case questions
Text mining first, impose structure to the data, then mine the structured data
Challenges
Information is an unstructured textual form Large textual database Almost all publications are also in electronic form Very high number of possible dimensions All possible word and phrase type in the language Complex and subtle relationships between concepts in text AOL merges with Time-Warner Time-Warner is bought by AOL Word ambiguity and context sensitivity Apple (the computer) or Apple (the fruit) Noisy Data Examples Spelling mistakes
What is Text-Mining?
finding interesting regularities in large textual datasets (adapted from Usama Fayad)
where interesting means: non-trivial, hidden, previously unknown and potentially useful
finding semantic and abstract information from the surface form of textual data
(M.Hearst
Abstract concepts are difficult to represent Countless combinations of subtle, abstract relationships among concepts Many ways to represent similar concepts
E.g. space ship, flying saucer, UFO
Concepts are difficult to visualize High dimensionality Tens or hundreds of thousands of features
(M.Hearst 97)
Just about any simple algorithm can get good results for simple tasks:
Pull out important phrases Find meaningfully related words Create some sort of summary from documents
Semi-Structured Data
Text databases are, in general, semi-structured Example:
Title Author Publication_Date Length Category Abstruct Content
Unstructured
Features Generation
Bag of words
Features Selection
Simple counting Statistics
Text/Data Mining
Classification Clustering Associations
Analyzing results
Computational Linguistics
Data Analysis
Techniques
from manual work, over learning to reasoning
Tasks
from search, over (un-, semi-) supervised learning, to visualization, summarization, translation
Text-Mining
Character level
Character level representation of a text consists from sequences of characters
a document is represented by a frequency distribution of sequences Usually we deal with contiguous strings each character sequence of length 1, 2, 3, represent a feature with its frequency
Word level
The most common representation of text used for many techniques
there are many tokenization software packages which split text into the words
Important to know:
Word is well defined unit in western languages e.g. Chinese has different notion of semantic unit
Words Properties
Relations among word surface forms and their senses: Homonomy: same form, but different meaning (e.g. bank: river bank, financial institution) Polysemy: same form, related meaning (e.g. bank: blood bank, financial institution) Synonymy: different form, same meaning (e.g. singer, vocalist) Hyponymy: one word denotes a subclass of an another (e.g. breakfast, meal)
Word frequencies in texts have power distribution: small number of very frequent words big number of low frequency words
Document Representation
Stop Word Removal: Many word are not informative and thus irrelevant for document representation The, and, a, an, is, of, that, . Stemming: Reducing words to their root form A document may contain several occurrences of word like Fish, fishes, fisher, fishers, . But would not retrieved by a query with keyword Fishing Different words share trhe same word stem and should represented with its stem, instead of actual word fish
Phrase level
Instead of having just single words we can deal with phrases We use two types of phrases:
Phrases as frequent contiguous word sequences Phrases as frequent non-contiguous word sequences both types of phrases could be identified by simple dynamic programming algorithm
Part-of-Speech level
By introducing part-of-speech tags we introduce wordtypes enabling to differentiate words functions
For text-analysis part-of-speech information is used mainly for information extraction where we are interested in e.g. named entities which are noun phrases Another possible use is reduction of the vocabulary (features) it is known that nouns carry most of the information in text documents
Part-of-Speech taggers are usually learned by HMM algorithm on manually tagged data
Part-of-Speech Table
http://www.englishclub.com/grammar/parts-of-speech_1.htm
Part-of-Speech examples
http://www.englishclub.com/grammar/parts-of-speech_2.htm
Taxonomies/thesaurus level
Thesaurus has a main function to connect different surface word forms with the same meaning into one sense (synonyms)
additionally we often use hypernym relation to relate generalto-specific word senses by using synonyms and hypernym relation we compact the feature vectors
The most commonly used general thesaurus is WordNet which exists in many other languages (e.g. EuroWordNet)
http://www.illc.uva.nl/EuroWordNet/
Noun Verb
Each database consists from sense entries each sense consists from a set of synonyms, e.g.:
musician, instrumentalist, player person, individual, someone life form, organism, being
Adjective
Adverb
20170
4546
29881
5677
sense
sound Is_a goose Typ_subj
bird
relation
Is_a
beak fly
sense
hawk
Part
Typ_obj
Is_a
Typ_subj Location
WordNet relations
Each WordNet entry is connected with other entries in the graph through relations Relations in the database of nouns:
Relation Hypernym Hyponym Has-Member Member-Of Has-Part Part-Of Antonym Definition From lower to higher concepts From concepts to subordinates From groups to their members From members to their groups From wholes to parts From parts to wholes Opposites Example breakfast -> meal meal -> lunch faculty -> professor copilot -> crew table -> leg course -> meal leader -> follower
Document Representation
A document representation aims to capture what the document is about One possible approach
Each entry describes a document Attribute describe whether or not a term appears in the document
Term Camera Document 1 Document 2 1 1 Digital 1 1 Memory 0 0 Pixel 1 0 -
Document Representation
Another approach
Each entry describe a document Attributes represent the frequency in which a term appears in the document
Word weighting
In the bag-of-words representation each word is represented as a separate variable having numeric weight (importance) The most popular weighting schema is normalized word frequency TFIDF:
N tfidf ( w ) tf . log( ) df ( w )
Tf(w) term frequency (number of word occurrences in a document) Df(w) document frequency (number of documents containing the word) N number of all documents TfIdf(w) relative importance of the word in the document
Sim ( D1 , D2 )
x
x2 j j
i
1i 2 i
xk2 k
Performance Measure
The set of retrieved documents can be formed by collecting the top-ranking documents according to a similarity measure The quality of a collection can be compared by the two following measures Relevant Relevant Precision = -----------------------------------Retrieved Relevant Retrieved Recall = -----------------------------------------Relevant
Relevant Documents Relevant & Retrieved Retrieved Documents
Classification techniques
Decision Tree Classification Bayesian Classifiers Neural Networks Statistical Analysis Genetic Algorithms Rough Set Approach k-nearest neighbor classifiers
NLP is
a very important concept in text mining. a subfield of artificial intelligence and computational linguistics. the study of "understanding" the natural human language.
Dream of AI community
to have algorithms that are capable of automatically reading and obtaining knowledge from text
Sentiment Analysis
A technique used to detect favorable and unfavorable opinions toward specific products and services See Application Case 7.3 for a CRM application
Security applications
ECHELON, OASIS Deception detection
example coming up
Academic applications
Research stream analysis - example coming up
Ontology Word
... xpression of Bcl-2 is correlated with insufficient white blood cell death and activation of p53. e
185 8 51112 9 23017 27 5874 2791 8952 1623 5632 17 8252 8 2523
POS
NN
IN
NN
IN
VBZ
IN
JJ
JJ
NN
NN
NN
CC
NN
IN NN
Shallow Parse
NP
PP
NP
NP
PP
NP
NP
PP NP
Establish the Corpus: Collect & Organize the Domain Specific Unstructured Data
The inputs to the process includes a variety of relevant unstructured (and semistructured) data sources such as text, XML, HTML, etc.
The output of the Task 1 is a collection of documents in some digitized format for computer processing
The output of the Task 2 is a flat file called term-document matrix where the cells are populated with the term frequencies
The output of Task 3 is a number of problem specific classification, association, clustering models and visualizations
ec roj p
ftw so
v de
1 1 3
Web Mining
The term created by Orem Etzioni (1996) Application of data mining techniques to automatically discover and extract information from Web data
Discovering useful information from the World-Wide Web and its usage patterns
Scale
Data generated per day is comparable to largest conventional data warehouses
Speed
Often need to react to evolving usage patterns in real-time (e.g., merchandising)
High linkage
10-20 links/page on average Power-law degree distribution
Bow-tie Structure
Power-laws galore
Structure
In-degrees Out-degrees Number of pages per site
Usage patterns
Number of visitors Popularity e.g., products, movies, music
http://www.simplyhired.com
http://www.fatlens.com
Interesting problems
What ads to show for a search? If Im an advertiser, which search terms should I bid on and how much to bid?
Systems architecture
CPU
Systems Issues
Web data sets can be very large
Tens to hundreds of terabytes
Project
Lots of interesting project ideas
If you cant think of one please come discuss with us
Infrastructure
Aster Data cluster on Amazon EC2 Supports both MapReduce and SQL
Data
Netflix ShareThis Google WebBase TREC
Web data
Semi-structured and unstructured readily available data rich in features and patterns
Web Data
Web Structure
Click here to Shop Online tag
Web Data
Web Usage
Application Server logs Http logs
100
Goes beyond key word extraction, or some simple statistics of words and phrases in documents.
105
106
Web-Structure Mining
Generate structural summary about the Web site and Web page
Depending upon the hyperlink, Categorizing the Web pages and the related Information @ inter domain level Discovering the Web Page Structure.
Discovering the nature of the hierarchy of hyperlinks in the website and its structure.
Web-Structure Mining
Finding Information about web pages
cont
Inference on Hyperlink
Retrieving information about the relevance and the quality of the web page. Finding the authoritative on the topic and content. The web page contains not only information but also hyperlinks, which contains huge amount of annotation. Hyperlink identifies authors endorsement of the other web page.
Web-Structure Mining
cont
Finding micro communities on the web e.g. Google (Brin and Page, 1998)
Schema Discovery in Semi-Structured Environment.
111
112
Web-Usage Mining
Analysis:
cont
Web-Usage Mining
Customer John John Transaction Time 6/21/05 5:30 pm 6/22/05 10:20 pm
cont
Purchased Items Beer Brandy
Web-Usage Mining
cont
Web-Usage Mining
cont
60% of users who placed an online order in /company/product1 also placed an order in /company/product4 within 15 days
Document Classification
Supervised Learning
Supervised learning is a machine learning technique for creating a function from training data . Documents are categorized The output can predict a class label of the input object (called classification).
Feature Selection
Removes terms in the training documents which are statistically uncorrelated with the class labels Simple heuristics Stop words like a, an, the etc. Empirically chosen thresholds for ignoring too frequent or too rare terms Discard too frequent and too rare terms
Document Clustering
Unsupervised Learning : a data set of input objects is gathered Goal : Evolve measures of similarity to cluster a collection of documents/terms into groups within which similarity within a cluster is larger than across clusters. Hypothesis : Given a `suitable clustering of a collection, if the user is interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs. Hierarchical Bottom-Up Top-Down Partitional
Semi-Supervised Learning
A collection of documents is available A subset of the collection has known labels Goal: to label the rest of the collection. Approach Train a supervised learner using the labeled subset. Apply the trained learner on the remaining documents. Idea Harness information in the labeled subset to enable better learning. Also, check the collection for emergence of new topics
Association
Transaction ID Items Purchased
Example: Supermarket
1 2 3
Q&A