Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in

Business Intelligence and Data Mining
By Dr. Atanu Rakshit Email: atanu.rakshit@iimrohtak.ac.in atanu.raks@gmail.com
Business Intelligence and Data Mining (BI &DM)

Text Book:
Business Intelligence A Managerial Approach by Efraim Turban, Ramesh Sharda, Dursun Delen and Devid King, 2/e, Pearson, 2012
Reference Material:
Decision Support and Business Intelligence Systems by Efraim Turban, Ramesh Sharda and Dursun Delen, 9/e, Pearson, 2012

Reference Material:
Business Intelligence Strategy A Practical Guide for Achieving BI Excellence by John Boyer, Bill Frank, Brian Green and Tracy Harris, MC Press, 2010 Business Analytics for Manager by Gert H. N. Laursen and Jesper Thorlund, Wiley, 2010
Business Intelligence and Data Mining (BI &DM) Sessions Plan

Introduction to Business Intelligence Decision Support Systems Concepts, Methodologies and Technologies Data Warehousing Business Performance Management Data Mining for Business Intelligence Text and Web Mining Business Intelligence: Implementation and Emerging Trends
Introduction to Text and Web Mining
Learning Objectives
Describe text mining and understand the need for text mining Differentiate between text mining, Web mining and data mining Understand the different application areas for text mining Know the process of carrying out a text mining project Understand the different methods to introduce structure to text-based data
Learning Objectives
Describe Web mining, its objectives, and its benefits Understand the three different branches of Web mining
Web content mining Web structure mining Web usage mining
Understand the applications of these three mining paradigms
Opening Vignette
Mining Text For Security And Counterterrorism What is MITRE? Problem description Proposed solution Results Answer & discuss the case questions
Opening Vignette: Mining Text For Security

Cluster 1 (L) Kampala (L) Uganda (P) Yoweri Museveni (L) Sudan (L) Khartoum (L) Southern Sudan Cluster 2 (P) Timothy McVeigh (L) Oklahoma City (P) Terry Nichols Cluster 3 (E) election (P) Norodom Ranariddh (P) Norodom Sihanouk (L) Bangkok (L) Cambodia (L) Phnom Penh (L) Thailand (P) Hun Sen (O) Khmer Rouge (P) Pol Pot
Text Mining Concepts

85-90 percent of all corporate data is in some kind of unstructured form (e.g., text). Unstructured corporate data is doubling in size every 18 months. Tapping into these information sources is not an option, but a need to stay competitive. Answer: text mining
A semi-automated process of extracting knowledge from unstructured data sources text data mining or knowledge discovery in textual databases
Data Mining versus Text Mining

Both seek novel and useful patterns Both are semi-automated processes Difference is the nature of the data:
Structured versus unstructured data Structured data: databases Unstructured data: Word documents, PDF files, text excerpts, XML files, and so on
Text mining first, impose structure to the data, then mine the structured data
Text Mining Concepts

Benefits of text mining are obvious especially in text-rich data environments
e.g., law (court orders), academic research (research articles), finance (quarterly reports), medicine (discharge summaries), biology (molecular interactions), technology (patent files), marketing (customer comments), etc.
Electronic communication records (e.g., Email)

Spam filtering Email prioritization and categorization Automatic response generation
Challenges
Information is an unstructured textual form Large textual database Almost all publications are also in electronic form Very high number of possible dimensions All possible word and phrase type in the language Complex and subtle relationships between concepts in text AOL merges with Time-Warner Time-Warner is bought by AOL Word ambiguity and context sensitivity Apple (the computer) or Apple (the fruit) Noisy Data Examples Spelling mistakes
What is Text-Mining?
finding interesting regularities in large textual datasets (adapted from Usama Fayad)
where interesting means: non-trivial, hidden, previously unknown and potentially useful
finding semantic and abstract information from the surface form of textual data
Why dealing with Text is Tough?

97)
(M.Hearst
Abstract concepts are difficult to represent Countless combinations of subtle, abstract relationships among concepts Many ways to represent similar concepts
E.g. space ship, flying saucer, UFO
Concepts are difficult to visualize High dimensionality Tens or hundreds of thousands of features
Why dealing with Text is Easy?

Highly redundant data
most of the methods count on this property
(M.Hearst 97)
Just about any simple algorithm can get good results for simple tasks:
Pull out important phrases Find meaningfully related words Create some sort of summary from documents
Semi-Structured Data
Text databases are, in general, semi-structured Example:
Title Author Publication_Date Length Category Abstruct Content
Structured attributes/value pair
Unstructured
Text Mining Process

Text preprocessing
Syntactic/Semantic text analysis
Features Generation
Bag of words
Features Selection
Simple counting Statistics
Text/Data Mining
Classification Clustering Associations
Analyzing results
Who is in the text analysis arena?

Knowledge Rep. & Reasoning / Tagging Search & DB
Computational Linguistics
Data Analysis
What dimensions are in text analytics?

Three major dimensions of text analytics:
Representations
from character-level to first-order theories
Techniques
from manual work, over learning to reasoning
Tasks
from search, over (un-, semi-) supervised learning, to visualization, summarization, translation
Text-Mining
How do we represent text?
Levels of text representations

Character (character n-grams and sequences) Words (stop-words, stemming, lemmatization) Phrases (word n-grams, proximity features) Part-of-speech tags Taxonomies / thesauri Vector-space model Language models Full-parsing Cross-modality Collaborative tagging / Web2.0 Templates / Frames Ontologies / First order theories

Character Words Phrases Part-of-speech tags Taxonomies / thesauri Vector-space model Language models Full-parsing Cross-modality
Character level
Character level representation of a text consists from sequences of characters
a document is represented by a frequency distribution of sequences Usually we deal with contiguous strings each character sequence of length 1, 2, 3, represent a feature with its frequency
Good and bad sides

Representation has several important strengths:
it is very robust since avoids language morphology (useful for e.g. language identification) it captures simple patterns on character level (useful for e.g. spam detection, copy detection) because of redundancy in text data it could be used for many analytic tasks (learning, clustering, search) It is used as a basis for string kernels in combination with SVM for capturing complex character sequence patterns
for deeper semantic tasks, the representation is too weak

Word level
The most common representation of text used for many techniques
there are many tokenization software packages which split text into the words
Important to know:
Word is well defined unit in western languages e.g. Chinese has different notion of semantic unit
Words Properties
Relations among word surface forms and their senses: Homonomy: same form, but different meaning (e.g. bank: river bank, financial institution) Polysemy: same form, related meaning (e.g. bank: blood bank, financial institution) Synonymy: different form, same meaning (e.g. singer, vocalist) Hyponymy: one word denotes a subclass of an another (e.g. breakfast, meal)
Word frequencies in texts have power distribution: small number of very frequent words big number of low frequency words
Document Representation
Stop Word Removal: Many word are not informative and thus irrelevant for document representation The, and, a, an, is, of, that, . Stemming: Reducing words to their root form A document may contain several occurrences of word like Fish, fishes, fisher, fishers, . But would not retrieved by a query with keyword Fishing Different words share trhe same word stem and should represented with its stem, instead of actual word fish

Phrase level
Instead of having just single words we can deal with phrases We use two types of phrases:
Phrases as frequent contiguous word sequences Phrases as frequent non-contiguous word sequences both types of phrases could be identified by simple dynamic programming algorithm
The main effect of using phrases is to more precisely identify sense

Part-of-Speech level
By introducing part-of-speech tags we introduce wordtypes enabling to differentiate words functions
For text-analysis part-of-speech information is used mainly for information extraction where we are interested in e.g. named entities which are noun phrases Another possible use is reduction of the vocabulary (features) it is known that nouns carry most of the information in text documents
Part-of-Speech taggers are usually learned by HMM algorithm on manually tagged data
Part-of-Speech Table
http://www.englishclub.com/grammar/parts-of-speech_1.htm
Part-of-Speech examples
http://www.englishclub.com/grammar/parts-of-speech_2.htm

Taxonomies/thesaurus level
Thesaurus has a main function to connect different surface word forms with the same meaning into one sense (synonyms)
additionally we often use hypernym relation to relate generalto-specific word senses by using synonyms and hypernym relation we compact the feature vectors
The most commonly used general thesaurus is WordNet which exists in many other languages (e.g. EuroWordNet)
http://www.illc.uva.nl/EuroWordNet/
WordNet database of lexical relations

WordNet is the most well developed and widely used lexical database for English
it consist from 4 databases (nouns, verbs, adjectives, and adverbs)
Category Unique Forms 94474 10319 Number of Senses 116317 22066
Noun Verb
Each database consists from sense entries each sense consists from a set of synonyms, e.g.:
musician, instrumentalist, player person, individual, someone life form, organism, being
Adjective
Adverb
20170
4546
29881
5677
WordNet excerpt from the graph

chicken Is_a clean Is_a preen Is_a smooth Typ_obj Typ_subj Means chatter Is_a make gaggle Classifier peck number Is_a Means strike Is_a quack Typ_subj Is_a Typ_obj animal poultry Quesp hen Is_a Caused_by Is_a Not_is_a Is_a Is_a Is_a Part duck Typ_obj Purpose keep meat egg plant creature feather wing claw Part Typ_subj Is_a turtle mouth Is_a Is_a Is_a Is_a Is_a leg catch opening arm limb Purpose supply Typ_obj
sense
sound Is_a goose Typ_subj
bird
relation
Is_a
beak fly
sense
hawk
Part
Typ_obj
26 relations bill 116k sensesface
Is_a
Typ_subj Location
WordNet relations
Each WordNet entry is connected with other entries in the graph through relations Relations in the database of nouns:
Relation Hypernym Hyponym Has-Member Member-Of Has-Part Part-Of Antonym Definition From lower to higher concepts From concepts to subordinates From groups to their members From members to their groups From wholes to parts From parts to wholes Opposites Example breakfast -> meal meal -> lunch faculty -> professor copilot -> crew table -> leg course -> meal leader -> follower
A document representation aims to capture what the document is about One possible approach
Each entry describes a document Attribute describe whether or not a term appears in the document
Term Camera Document 1 Document 2 1 1 Digital 1 1 Memory 0 0 Pixel 1 0 -
Another approach
Each entry describe a document Attributes represent the frequency in which a term appears in the document
Example: Term frequency table

Term Camera Document 1 Document 2 3 0 Digital 2 4 Memory 0 0 Pixel 1 3 -

Vector-space model level

The most common way to deal with documents is first to transform them into sparse numeric vectors and then deal with them with linear algebra operations
by this, we forget everything about the linguistic structure within the text this is sometimes called structural curse because this way of forgetting about the structure doesnt harm efficiency of solving many relevant problems This representation is referred to also as Bag-Of-Words or Vector-Space-Model Typical tasks on vector-space-model are classification, clustering, visualization etc.
Bag-of-words document representation
Word weighting
In the bag-of-words representation each word is represented as a separate variable having numeric weight (importance) The most popular weighting schema is normalized word frequency TFIDF:
N tfidf ( w ) tf . log( ) df ( w )
Tf(w) term frequency (number of word occurrences in a document) Df(w) document frequency (number of documents containing the word) N number of all documents TfIdf(w) relative importance of the word in the document
The word is more important if it appears several times in a target document
The word is more important if it appears in less documents
Distance Based Matching

In order retrieve documents similar to a given document one need a measure of similarity Euclidean distance
The Euclidean distance between X = (x1, x2, x3, .., xn) and Y = (y1,y2,y3, .., yn) Is defined as
D(X,Y) = (xi yi)2
Similarity between document vectors

Each document is represented as a vector of weights D = <x> Cosine similarity (dot product) is the most widely used similarity measure between two document vectors
calculates cosine of the angle between document vectors efficient to calculate (sum of products of intersecting words) similarity value between 0 (different) and 1 (the same)
Sim ( D1 , D2 )
x
x2 j j
i
1i 2 i
xk2 k
Performance Measure
The set of retrieved documents can be formed by collecting the top-ranking documents according to a similarity measure The quality of a collection can be compared by the two following measures Relevant Relevant Precision = -----------------------------------Retrieved Relevant Retrieved Recall = -----------------------------------------Relevant
Relevant Documents Relevant & Retrieved Retrieved Documents
Classification techniques
Decision Tree Classification Bayesian Classifiers Neural Networks Statistical Analysis Genetic Algorithms Rough Set Approach k-nearest neighbor classifiers
Cluster Analysis for Data Mining

Analysis methods
Statistical methods (including both hierarchical and nonhierarchical), such as k-means, k-modes, and so on. Neural networks (adaptive resonance theory [ART], self-organizing map [SOM]) Fuzzy logic (e.g., fuzzy c-means algorithm) Genetic algorithms
Divisive versus Agglomerative methods
Text Mining for Patent Analysis (see Applications Case 7.2)

What is a patent?
exclusive rights granted by a country to an inventor for a limited period of time in exchange for a disclosure of an invention
How do we do patent analysis (PA)? Why do we need to do PA?

What are the benefits? What are the challenges?
How does text mining help in PA?
Natural Language Processing (NLP)

Structuring a collection of text
Old approach: bag-of-words New approach: natural language processing
NLP is
a very important concept in text mining. a subfield of artificial intelligence and computational linguistics. the study of "understanding" the natural human language.
Syntax versus semantics based text mining

What is Understanding ?
Human understands, what about computers? Natural language is vague, context driven True understanding requires extensive knowledge of a topic
Can/will computers ever understand natural language the same/accurate way we do?

Challenges in NLP
Part-of-speech tagging Text segmentation Word sense disambiguation Syntax ambiguity Imperfect or irregular input Speech acts
Dream of AI community
to have algorithms that are capable of automatically reading and obtaining knowledge from text

WordNet
A laboriously hand-coded database of English words, their definitions, sets of synonyms, and various semantic relations between synonym sets A major resource for NLP Needs automation to be completed
Sentiment Analysis
A technique used to detect favorable and unfavorable opinions toward specific products and services See Application Case 7.3 for a CRM application
NLP Task Categories

Information retrieval Information extraction Named-entity recognition Question answering Automatic summarization Natural language generation & understanding Machine translation Foreign language reading & writing Speech recognition Text proofing Optical character recognition
Text Mining Applications

Marketing applications
Enables better CRM
Security applications
ECHELON, OASIS Deception detection
example coming up
Medicine and biology

Literature-based gene identification
example coming up
Academic applications
Research stream analysis - example coming up
Text Mining Applications

(gene/protein interaction identification)
Gene/ Protein
596 12043 24224 281020 42722 397276 D007962 D 016923 D 001773 D019254 D044465 D001769 D002477 D003643 D016158
Ontology Word
... xpression of Bcl-2 is correlated with insufficient white blood cell death and activation of p53. e
185 8 51112 9 23017 27 5874 2791 8952 1623 5632 17 8252 8 2523
POS
NN
IN
NN
IN
VBZ
IN
JJ
JJ
NN
NN
NN
CC
NN
IN NN
Shallow Parse
NP
PP
NP
NP
PP
NP
NP
PP NP
Text Mining Process

Context diagram for the text mining process
Software/hardware lim itations Privacy issues Linguistic lim itations
Unstructured data (text) Structured data (databases)
Extract Context-specific knowledge knowledge from available data sources A0
Dom ain expertise Tools and techniques
Text Mining Process

Task 1 Task 2 Task 3
Establish the Corpus: Collect & Organize the Domain Specific Unstructured Data
Create the TermDocument Matrix: Introduce Structure to the Corpus

Feedback
Extract Knowledge: Discover Novel Patterns from the T-D Matrix

Feedback
The inputs to the process includes a variety of relevant unstructured (and semistructured) data sources such as text, XML, HTML, etc.
The output of the Task 1 is a collection of documents in some digitized format for computer processing
The output of the Task 2 is a flat file called term-document matrix where the cells are populated with the term frequencies
The output of Task 3 is a number of problem specific classification, association, clustering models and visualizations
The three-step text mining process
Text Mining Process

Step 1: Establish the corpus
Collect all relevant unstructured data (e.g., textual documents, XML files, emails, Web pages, short notes, voice recordings) Digitize, standardize the collection (e.g., all in ASCII text files) Place the collection in a common place (e.g., in a flat file, or in a directory as separate files)
Text Mining Process

Step 2: Create the TermbyDocument Matrix
Terms Documents Document 1 Document 2 Document 3 Document 4 Document 5 Document 6 ... 1 1 2 1 1
in ve e stm ri nt sk tm a an ge n me are t g en in ri ee elo ng nt SA P ... e pm
ec roj p
ftw so
v de
1 1 3
Text Mining Process

Step 2: Create the TermbyDocument Matrix (TDM)
Should all terms be included?
Stop words, include words Synonyms, homonyms Stemming
What is the best representation of the indices (values in cells)?

Row counts; binary frequencies; log frequencies; Inverse document frequency
Text Mining Process

Step 2: Create the TermbyDocument Matrix (TDM)
TDM is a sparse matrix. How can we reduce the dimensionality of the TDM?
Manual a domain expert goes through it Eliminate terms with very few occurrences in very few documents (?) Transform the matrix using singular value decomposition (SVD) SVD is similar to principle component analysis
Text Mining Process

Step 2: Extract patterns/knowledge
Classification (text categorization) Clustering (natural groupings of text)
Improve search recall Improve search precision Scatter/gather Query-specific clustering
Association Trend Analysis ()
Web Mining
The term created by Orem Etzioni (1996) Application of data mining techniques to automatically discover and extract information from Web data
What is Web Mining?
Discovering useful information from the World-Wide Web and its usage patterns
Web Mining v. Data Mining

Structure (or lack of it)
Textual information and linkage structure
Scale
Data generated per day is comparable to largest conventional data warehouses
Speed
Often need to react to evolving usage patterns in real-time (e.g., merchandising)
Web Mining topics

Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues
Web Mining topics

Web Mining topics

Size of the Web

Number of pages
Technically, infinite Much duplication (30-40%) Best estimate of unique static HTML pages comes from search engine claims
Until last year, Google claimed 8 billion(?), Yahoo claimed 20 billion Google recently announced that their index contains 1 trillion pages
How to explain the discrepancy?
The web as a graph

Pages = nodes, hyperlinks = edges
Ignore content Directed graph
High linkage
10-20 links/page on average Power-law degree distribution
Structure of Web graph

Lets take a closer look at structure
Broder et al (2000) studied a crawl of 200M pages and other smaller crawls Bow-tie structure
Not a small world
Bow-tie Structure
Source: Broder et al, 2000
What can the graph tell us?

Distinguish important pages from unimportant ones
Page rank
Discover communities of related pages

Hubs and Authorities
Detect web spam

Trust rank
Web Mining topics

Power-law degree distribution
Source: Broder et al, 2000
Power-laws galore
Structure
In-degrees Out-degrees Number of pages per site
Usage patterns
Number of visitors Popularity e.g., products, movies, music
The Long Tail
Source: Chris Anderson (2004)
Web Mining topics

Extracting Structured Data
http://www.simplyhired.com
Extracting structured data
http://www.fatlens.com
Web Mining topics

Ads vs. search results
Ads vs. search results

Search advertising is the revenue model
Multi-billion-dollar industry Advertisers pay for clicks on their ads
Interesting problems
What ads to show for a search? If Im an advertiser, which search terms should I bid on and how much to bid?
Two Approaches to Analyzing Data

Machine Learning approach
Emphasizes sophisticated algorithms e.g., Support Vector Machines Data sets tend to be small, fit in memory
Data Mining approach

Emphasizes big data sets (e.g., in the terabytes) Data cannot even fit on a single disk! Necessarily leads to simpler algorithms
Web Mining topics

Systems architecture
CPU
Machine Learning, Statistics

Memory Classical Data Mining Disk
Very Large-Scale Data Mining
CPU Mem Disk
CPU Mem Disk
CPU Mem Disk
Cluster of commodity nodes
Systems Issues
Web data sets can be very large
Tens to hundreds of terabytes
Cannot mine on a single server!

Need large farms of servers
How to organize hardware/software to mine multi-terabye data sets

Without breaking the bank!
Project
Lots of interesting project ideas
If you cant think of one please come discuss with us
Infrastructure
Aster Data cluster on Amazon EC2 Supports both MapReduce and SQL
Data
Netflix ShareThis Google WebBase TREC
Data Mining vs. Web Mining

Traditional data mining
data is structured and relational well-defined tables, columns, rows, keys, and constraints.
Web data
Semi-structured and unstructured readily available data rich in features and patterns
Web Data
Web Structure
Click here to Shop Online tag
Web Data
Web Usage
Application Server logs Http logs
Web Data Web Content
Web Mining Categories

Web Content Mining
Discovering useful information from web contents/data/documents.
Web Structure Mining

Discovering the model underlying link structures (topology) on the Web. E.g. discovering authorities and hubs
Web Usage Mining

Make sense of data generated by surfers Usage data from logs, user profiles, user sessions, cookies, user queries, bookmarks, mouse clicks and scrolls, etc.
99
Web Content Data Structure

Unstructured free text Semi-structured HTML More structured Table or Database generated HTML pages Multimedia data receive less attention than text or hypertext
100
Web Content Mining

Process of information or resource discovery from content of millions of sources across the World Wide Web
E.g. Web data contents: text, Image, audio, video, metadata and hyperlinks
Goes beyond key word extraction, or some simple statistics of words and phrases in documents.
Web Content Mining

Pre-processing data before web content mining: feature selection (Piramuthu 2003) Post-processing data can reduce ambiguous searching results (Sigletos & Paliouras 2003) Web Page Content Mining
Mines the contents of documents directly
Search Engine Mining

Improves on the content search of other tools like search engines.
Web Content Mining

Web content mining is related to data mining and text mining. [Bing Liu. 2005]
It is related to data mining because many data mining techniques can be applied in Web content mining. It is related to text mining because much of the web contents are texts. Web data are mainly semi-structured and/or unstructured, while data mining is structured and text is unstructured.
Web Content Mining: IR View

Unstructured Documents
Bag of words, or phrase-based feature representation Features can be boolean or frequency based Features can be reduced using different feature selection techniques Word stemming, combining morphological variations into one feature
104
Web Content Mining: IR View

Semi-Structured Documents
Uses richer representations for features, based on information from the document structure (typically HTML and hyperlinks) Uses common data mining methods (whereas unstructured might use more text mining methods)
105
Web Content Mining: DB View

Tries to infer the structure of a Web site or transform a Web site to become a database
Better information management Better querying on the Web
Can be achieved by:

Finding the schema of Web documents Building a Web warehouse Building a Web knowledge base Building a virtual database
106
Web-Structure Mining
Generate structural summary about the Web site and Web page
Depending upon the hyperlink, Categorizing the Web pages and the related Information @ inter domain level Discovering the Web Page Structure.
Discovering the nature of the hierarchy of hyperlinks in the website and its structure.
Finding Information about web pages
cont
Inference on Hyperlink
Retrieving information about the relevance and the quality of the web page. Finding the authoritative on the topic and content. The web page contains not only information but also hyperlinks, which contains huge amount of annotation. Hyperlink identifies authors endorsement of the other web page.
cont
More Information on Web Structure Mining

Web Page Categorization. (Chakrabarti 1998)
Finding micro communities on the web e.g. Google (Brin and Page, 1998)
Schema Discovery in Semi-Structured Environment.
Web Usage Mining

Tries to predict user behavior from interaction with the Web Wide range of data (logs)
Web client data Proxy server data Web server data Map usage data into relational tables before using adapted data mining techniques Use log data directly by utilizing special pre-processing techniques
110
Two common approaches
Web Usage Mining

Typical problems: Distinguishing among unique users, server sessions, episodes, etc in the presence of caching and proxy servers Often Usage Mining uses some background or domain knowledge
E.g. site topology, Web content, etc
111
Web Usage Mining
Two main categories:

Learning a user profile (personalized) Web users would be interested in techniques that learn their needs and preferences automatically Learning user navigation patterns (impersonalized) Information providers would be interested in techniques that improve the effectiveness of their Web site or biasing the users towards the goals of the site
112
Web-Usage Mining
Analysis:
cont
Data Mining Techniques Navigation Patterns

Example: 70% of users who accessed /company/product2 did so by starting at /company and proceeding through /company/new, /company/products and company/product1 80% of users who accessed the site started from /company/products 65% of users left the site after four or less page references
Web-Usage Mining
Customer John John Transaction Time 6/21/05 5:30 pm 6/22/05 10:20 pm
cont
Purchased Items Beer Brandy
Data Mining Techniques Sequential Patterns

Example: Supermarket Cont
Frank Frank Frank

Mary Mary Mary
6/20/05 10:15 am 6/20/05 11:50 am 6/20/05 12:50 am

6/20/05 2:30 pm 6/21/05 6:17 pm 6/22/05 5:05 pm
Juice, Coke Beer Wine, Cider

Beer Wine, Cider Brandy
Web-Usage Mining
cont

Customer Sequence Example: Supermarket Cont Mining Result
Sequential Patterns with Support >= 40% (Beer) (Brandy) (Beer) (Wine, Cider) Supporting Customers John, Mary Frank, Mary Customer John Frank Mary Customer Sequences (Beer) (Brandy) (Juice, Coke) (Beer) (Wine, Cider) (Beer) (Wine, Cider) (Brandy)
Web-Usage Mining
cont

Web usage examples In Google search, within past week 30% of users who visited /company/product/ had camera as text.
60% of users who placed an online order in /company/product1 also placed an order in /company/product4 within 15 days
Tech for Web Content Mining
Classifications Clustering Association
Document Classification
Supervised Learning
Supervised learning is a machine learning technique for creating a function from training data . Documents are categorized The output can predict a class label of the input object (called classification).
Techniques used are

Nearest Neighbor Classifier Feature Selection Decision Tree
Feature Selection
Removes terms in the training documents which are statistically uncorrelated with the class labels Simple heuristics Stop words like a, an, the etc. Empirically chosen thresholds for ignoring too frequent or too rare terms Discard too frequent and too rare terms
Document Clustering
Unsupervised Learning : a data set of input objects is gathered Goal : Evolve measures of similarity to cluster a collection of documents/terms into groups within which similarity within a cluster is larger than across clusters. Hypothesis : Given a `suitable clustering of a collection, if the user is interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs. Hierarchical Bottom-Up Top-Down Partitional
Semi-Supervised Learning
A collection of documents is available A subset of the collection has known labels Goal: to label the rest of the collection. Approach Train a supervised learner using the labeled subset. Apply the trained learner on the remaining documents. Idea Harness information in the labeled subset to enable better learning. Also, check the collection for emergence of new topics
Association
Transaction ID Items Purchased
Example: Supermarket
1 2 3
butter, bread, milk bread, milk, beer, egg diaper
An association rule can be

If a customer buys milk, in 50% of cases, he/she also buys beers. This happens in 33% of all transactions. 50%: confidence 33%: support
Can also Integrate in Hyperlinks
Q&A

Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in

Uploaded by

Copyright:

Available Formats

Business Intelligence and Data Mining

By Dr. Atanu Rakshit Email: atanu.rakshit@iimrohtak.ac.in atanu.raks@gmail.com

Business Intelligence and Data Mining (BI &DM)

Business Intelligence and Data Mining (BI &DM)

Business Intelligence and Data Mining (BI &DM) Sessions Plan

Business Intelligence and Data Mining (BI &DM)

Introduction to Text and Web Mining

Understand the applications of these three mining paradigms

Opening Vignette: Mining Text For Security

Text Mining Concepts

Data Mining versus Text Mining

Text Mining Concepts

Electronic communication records (e.g., Email)

Why dealing with Text is Tough?

Why dealing with Text is Easy?

Structured attributes/value pair

Text Mining Process

Who is in the text analysis arena?

What dimensions are in text analytics?

How do we represent text?

Levels of text representations

Levels of text representations

Good and bad sides

for deeper semantic tasks, the representation is too weak

Levels of text representations

Levels of text representations

The main effect of using phrases is to more precisely identify sense

Levels of text representations

Levels of text representations

WordNet database of lexical relations

WordNet excerpt from the graph

26 relations bill 116k sensesface

Example: Term frequency table

Levels of text representations

Vector-space model level

Bag-of-words document representation

The word is more important if it appears several times in a target document

The word is more important if it appears in less documents

Distance Based Matching

D(X,Y) = (xi yi)2

Similarity between document vectors

Cluster Analysis for Data Mining

Divisive versus Agglomerative methods

Text Mining for Patent Analysis (see Applications Case 7.2)

How do we do patent analysis (PA)? Why do we need to do PA?

How does text mining help in PA?

Natural Language Processing (NLP)

Syntax versus semantics based text mining

Natural Language Processing (NLP)

Natural Language Processing (NLP)

Natural Language Processing (NLP)

NLP Task Categories

Text Mining Applications

Medicine and biology

Text Mining Applications

Text Mining Process

Unstructured data (text) Structured data (databases)

Extract Context-specific knowledge knowledge from available data sources A0

Dom ain expertise Tools and techniques

Text Mining Process

Create the TermDocument Matrix: Introduce Structure to the Corpus

Extract Knowledge: Discover Novel Patterns from the T-D Matrix