Professional Documents
Culture Documents
CLUSTER ANALYSIS
Hal Hagood
u06a1
CLUSTER ANALYSIS 2
“Clustering or cluster analysis is a generic name for a group of related techniques (such as
constructions, Q-analysis, and so on) that automatically try to find natural groupings in the data. One
crucial difference between clustering and a typical classification model is the absence of any target
variable (where classes or groups are known a priori) in the data. In the context of textual data, this
means that no labeled training examples are needed before documents can be clustered into groups.
As a conceptual activity, the assignment of objects into groups is something humans do routinely
all through their lives to reduce the complexity of the environment that they have to work with. The natural
grouping of objects and observations is extremely important to many disciplines (such as statistics,
psychology, sociology, biology, engineering, economics, and business). Each of these disciplines, in turn,
has used its own label to describe cluster analysis. Although the names might differ across disciplines, all
disciplines share the fundamental concept of separating data suggested by the natural groupings in the
data. In essence, cluster analysis attempts to group objects so that each object in a cluster is similar to
the other objects in the same cluster. However, objects in different clusters are dissimilar to each other. In
the context of textual data, objects are the documents that must be assigned to clusters so that within a
cluster, documents are similar, but between clusters, documents are different.
The basic idea is that documents within a cluster should be similar to each other, and documents
in different clusters should be dissimilar to each other. The similarity between two documents is based on
the similarity of features (such as terms or words) between documents in the vector space model. In this
context, we discuss latent semantic indexing (LSI), which provides a method for determining the similarity
of words and passages by the analysis of large text corpora. Then, we discuss the concept of topic
extraction from a collection of documents. A topic is conceptualized as a collection of terms that capture
the main themes or ideas in the document. Unlike cluster groups, where each document is assigned to
only one cluster, the same document can be assigned to multiple topics, depending on how many ideas
For this particular assignment we worked with a set of data from a survey that contains structured
and unstructured data, this will help provide key insights into the data. A project was created using SAS
Enterprise Miner to import the survey data for analyses. Textual data was imported from the survey in
SAS Miner. Tutorial Q: A Hands on Tutorial on Text Mining in SAS from the textbook for setting up the
One variable at a time should be mined and the first one selected is “Why_Best_Lylty_Card” the
value of Use is set to Yes for that variable. This setting will be used for each node concurrently in this
mining procedure.
The Text Parsing Node is used to parse the data. Then the Text Filter Node is adjusted to check
spelling function to Yes and the number of terms to be displayed to All. The Text Filter node reduces the
total number of parsed terms and or documents that are analyzed, all the inputs from the survey are
used. Next add the Text Cluster node. SAS Miner will cluster documents into sets and supply a report on
the descriptive terms for those clusters. Setting used were outlined in the tutorial. The output of the
diagram built shows 10 clusters created with the nodes that were added.
CLUSTER ANALYSIS 4
CLUSTER ANALYSIS 5
opinions. Examining the clusters on an individual basis can help in understanding how customer’s
sentiment about products and services are provided. This can give insight and additional understanding
to their connections between positive and negative sentiments. Supplying stakeholders with this
essential information can help and sometimes vastly improve business decisions. This in turn relates
directly maintaining customer’s satisfaction levels both now and in the future contingent on whether they
Reference
Text Mining and Analysis, (2017). Text Mining and Analysis: Practical Methods, Examples, and Case
Studies Using SAS Chapter 6 - Clustering and Topic Extraction. Retrieved August 25, 2017 from
http://viewer.books24x7.com/assetviewer.aspx?bookid=59026&chunkid=342485391&resume=ye
s&resumebookmarkid=dc367bed-ce7d-e711-a9c3-00505686029c