Professional Documents
Culture Documents
Sherwin A. Steffin
May 15, 2009
Introduction
Back in the days when I was founder and CEO of a Macintosh software
publishing company (BrainPower, Inc.), I designed a powerful text analysis
product, named ArchiText, for the Macintosh platform. An independent
contractor was employed for program coding. When the company closed in
1990, the program disappeared from public view, but to this day, contains
features that I have found in no other free or low-priced program.
Wikipedia provides a useful definition of the proximity search. As you read this
page you will note that while Google provides some facility for this feature, it
is more than a little clumsy to implement. While there have been some
attempts to implement this feature, they are either clumsy or very limited in
their implementation. One example of this is an API designed to do two-word
proximity searches.
Page 2 of 59
Perhaps there is no greater frustration than searching multiple terms using the
AND operator, and finding thousands of pages containing both terms, yet
totally unrelated to each other. Depending on the search terms employed, one
can get either some very targeted results or be greatly frustrated by the
presence of all terms, unrelated to each other.
You are searching for instances where George H.W. Bush AND CIA drug
running pilot, Barry Seal appear and identified as having a relationship
to each other. You do a quick search, Bush AND Seal. Google reveals a
page count of 1,520,000 Pages! Treating both terms as quote-limited
phrases, “George Bush” AND “Barry Seal,” still yields a count of 967
pages. Even after scanning link titles the opening and scanning of a
substantial number of pages is required to determine whether there was
a specific commonality between these individuals.
If, however, the two terms can be linked as occurring within the same
sentence or paragraph, commonalities can be rapidly identified.
The following material is a three part white paper, introducing the reader to three
approaches to Data Mining, requiring little or no previous training and using low
cost or free computer applications.
In Part I, the reader is introduced to some principles of text analysis, and then
walked through an example of how these principle can be applied using
ArchiText, the program described above.
Page 3 of 59
Do-It-Yourself Data Mining – Part I
Text Analysis Using ArchiText
Before considering the existing program, and the mechanics necessary for
updating it, it is important that we share a common understanding of the
principles of Text analysis that this author considered in developing the original
design of the program.
Concordancing
The earliest analysis of text began with a process going back to the development
of printing. Called Concordancing, it consists of locating every word within a
text, and counting the frequency its appearance within the text. This process was
Page 4 of 59
first applied to scholarly analyses of the Bible: The slide below illustrates a
screen shot of a free Concordancing program, TextStat, available on the Net.
Still, the mere presence of a word, or even it’s frequency of occurrence provides
relatively limited information regarding it’s usage within the corpus (the totality of
all documents under analysis). The next step, therefore, is to view a word or
words of interest within a context. You will often find the acronym, KWIC (Key
Words In Context), referring to this process.
Page 5 of 59
While not shown here, proximity searching is made available in the Query Editor,
initiated with the button shown.
Expanding to a full citation provides a full view of just how and were each word
appears in the context of the total document. This is quickly accomplished by
clicking the Citation button.
Page 6 of 59
The Complexity of Word Tags
Using a number of methods, individual words can be linked to each other as they
occur throughout the corpus of material under study. This is especially essential
when the document contains a large number of names of people, locations, or
events which can easily become confusing. Which person is linked to which
employer, location, or event? What about individuals sharing the same surname?
Separating into categories and clear identities brings clarity to confusion.
The next five slides, with text extracted from the 9/11 Commission Report,
illustrate this clarification process.
Page 7 of 59
The first step in increasing the information value of individual words is to select a
category, in this case surnames of individuals. Using the Replace dialog in Word,
change all instances of a surname to ALL CAPS.
Page 8 of 59
Any compound word which you wish to have listed completely must have the
space character replaced with a hyphen.
Page 9 of 59
Prefixes serve to group words which are members of the same category together
so that they will appear in a group within a word listing.
Page 10 of 59
Suffixing of the root word provides differentiation between same names. Thus in
the example below, the two “AL-SHEHRIs” are identified as being siblings, but
can be separated with respect to individual activity.
Page 11 of 59
Here are some rules for compound words, other than the name of people:
Page 12 of 59
All of the techniques shown above enhance your capabilities for analysis of a
document or collection of documents, but none are essential to the full
employment of the program.
The first step in using ArchiText is to import the corpus of one or more
documents into the program. In the example shown below the text of the 9/11
Commission Report is going to be subjected to analysis.
In general, if the document is of any significant length you'll want to split it into
categories or sections which represent some logical division of the whole. After
creating a new ArchiText file, select the file you're going to use.
Page 13 of 59
You can elect to import the entire file, or to split the file into elements, referred to
as “Nodes.” If you choose the former, you will have just one node, having the
document name.
If you decide to split into sections, you can use any symbolic character (or
combination of symbolic characters) as the target string by which sections are
split, as shown above.
In this example, we have split the entire document into chapters, as illustrated in
the Node Directory shown below. Double clicking on any of the Nodes will open a
window in which the text of the node appears.
Page 14 of 59
Node Selection
Regardless of the analysis you are going to perform, you can select all nodes,
an ordered set of nodes, or a discontinuous combination of nodes. Preferences
allow you to order nodes alphabetically or by time modified.
Page 15 of 59
Keyword Lists
Typically this is the first analysis you are going to do. Typically, you will have
done a lot of tag preparation in the original file, as described above. In the
selection above, our interest is in identifying the key terrorist players, so the
keyword search was restricted to those nodes where they were discussed.
After selecting the nodes whose words are to be listed, the keyword dialog sets
up the parameters for the listing.
Most of these choices are self-explanatory, but the “stop word list” requires some
discussion. ArchiText comes pre-loaded with a modifiable list of words – articles,
prepositions and auxiliary verbs, which ordinarily are irrelevant to content. Thus,
when this item is checked, these words are eliminated from the frequency
Page 16 of 59
listings. However, there are times when these words have usefulness for a given
analysis, and they can then be included by un-checking this box.
In this partial view of the resulting frequency list, each person has been prefixed
by “p-“ which facilitates grouping all of the those named fitting into the category
“Person.” For those occurring with high frequency, we will proceed to extract all
information regarding them, and combine that information into a single new node
only focused on each of them.
Page 17 of 59
Extract and Combine to make new nodes
In our first search, two of the terrorists have been selected. Selecting the “S” tab
will automatically initiate the search dialog. Remember that the nodes have been
preselected when the keyword list was constructed.
After pressing “Start Search” button, you will see the following results in the
Directory.
Page 18 of 59
Notice that the number of occurrences of the names of the two terrorists, within
each node, are highlighted. The next step is to extract just this information, and
combine it into a new node. To do this, select “Combine Nodes” from the
“Analysis” menu.
In the example shown below, we have searched for George Bush, and are
extracting all occurrences of his name throughout the nodes.
Select “Embed Node Name” if you want the source nodes named in the new
node. After completion, a new node containing only those instances in which
Bush is named in a paragraph. The result of this combination looks like this:
Page 19 of 59
This illustration is, of course, only a small potion of the nodes in which the Bush
Name occurs. If you wish, you can “drill down” further, building a keyword list for
this node alone, and searching for other combinations related to Bush, as they
occur within Paragraphs or sentences in which his name occurs. If desired, you
could build extracted nodes for any combinations of Bush and other words
included in your search.
Page 20 of 59
New maps are built in the same way as are new nodes -- by using the create
button for maps in the directory dialog. You will note that there are number of
nodes which are not on the map, but which are available through selection and
pressing the "Add to Map" button. When nodes are deleted from the map, they
appear in the left column which is the "On-Call" list. Another way that nodes can
be added from the On-Call list is through a search which selects some of the
nodes in this list.
Page 21 of 59
One way of visualizing the nodes found in a search is to change the size of the
nodes selected by that search. This option is available by selecting, "Change
Node Size," in the Map menu found on the main toolbar.
Page 22 of 59
A far more powerful option is available. Using one all of the eight linking tools
which are available a "Parent Node" can be connected or linked to each of the
nodes to which a relationship exists. One example of this linking is shown in the
map below. In this case, Terrorist 001 (Osama bin Laden) is linked to each of the
chapters in which his name appears.
Page 23 of 59
As you see below every node which is linked to another is illustrated in the nodes
window. Double clicking on any node name opens the note window, and
depending on the preference settings, will either open the source node and
destination node, or simultaneously open the destination node while closing the
source node.
Page 24 of 59
Implications for data mining
Page 25 of 59
Some limitations
While the design of this program offers features which this author has found in no
other program, because it was designed in 1988, there are some limitations and
deficiencies which demand starting from the beginning and rebuilding the
program shell. Listed below are some of the current problems which must be
resolved for the program to reach its potential power for its users:
• By far the most serious deficiency in this program is the fact that it will
only operate on older Macintosh computers still installed with OS 9x or
earlier. The search and linking functions are available on no other
program, except those enterprise-level highly expensive data mining
systems. Thus the program needs to be updated such that is usable
on any platform.
Page 26 of 59
What’s Next?
Page 27 of 59
Do-It-Yourself Data Mining – Part II
Concepts and Display
Introduction
In beginning our consideration of Data Mining, readers will find many, if not all, of
the concepts involved to be to be foreign to their past experience. “Data mining
(DM),” also called Knowledge-Discovery in Databases (KDD) or Knowledge-
Discovery is the process of automatically searching large volumes of data for
patterns using tools such as classification, association rule mining, clustering,
etc.. Data mining is a complex topic and has links with multiple core fields such
as computer science and adds value to rich seminal computational techniques
from statistics, information retrieval, machine learning and pattern recognition.”
One reason that many avoid engaging in DM is the perception that it requires not
only training in statistics, but in database usage as well. Typically, those
employed as data analysts will have formal training and experience in database
programming languages, statistical programming, as well as research design.
This paper seeks to provide those who have competence in general computer
applications with the intellectual tools necessary to shortcut the heavy duty
software and training used by the professional.
Assume that you hold a “Liberal” viewpoint with respect to the war. You will tend
to reject the statements of conservatives – in essence “screening out” the Red
view of any dispute. You will tend to accept, in fact, even receive, only that
information which is in agreement with these long held views.
Page 28 of 59
Conversely, if the view you hold is consistent with those held by conservatives,
you will tend to see and incorporate into your thinking the views held by members
of that group.
Looking at the names given to the disciplines and knowledges used by those who
are engaged in Data Mining, you are likely to be thinking that such training is far
beyond your own education. This article is designed to show you that, while
some of what those who do DM have advanced academic experience, that the
core principles can be learned and employed by everyone – and in fact, can be
used by students in their early teens.
For those lacking a strong background in statistical analysis and the tools of
quantitative analysis such as Regression, Correlation and Cross Tabulation
(Contingency Tables), this material will initially appear be very new. Most who
have little or no experience using and calculating statistics tend to think of
statistics as a discipline which uses numeric values to do reach conclusions or
results. While certainly this is very much the case, there are also a number of
statistical methods which use text exclusively, or as a part of the calculations.
As you proceed through this tutorial, you will find that the concepts introduced
are much easier to understand than you had previously believed to be the case.
Page 29 of 59
From Numbers and Words to Conclusions – An
Example
Getting acquainted with statistical ideas
Before starting we need to define some terms used by statisticians when they
carry out research.
Two terms you will run across throughout this document are Sample and
Population. The 100 8th graders, whose responses we are going to obtain, are
a Sample of all boys and girls in the 8th grade, going to school in the United
States. This total is referred to as the 8th Grade Population. Whatever we do
with the sample, the greatest concern is that the results we find for the sample,
are very similar to those which would be found if the entire Population could be
measured.
Variables
The height of each member (case) in the sample as well as the grade point
average are the numeric variables which will be used in this example.
Categorical (Text) variables or labels which are assigned to place each case in
one of two or more different Categories. “A,” “B,” “C,” “D,” and “F,” are all
elements of the Categorical Variable named “Grade.” These categories are
either located and extracted in the text of documents being analyzed or derived
from equations, such as shown below. In this instance a formula was used to
derive a letter grade from the grade point average, for each case.
Figure 1 Deriving Letter Grades from Performance Average
Page 30 of 59
First Steps – Defining the Analysis
More than anything you do, this is the most important element of conducting your
analysis. Your efforts are done for some purpose. This first step is where you
define the questions for which you seek answers. Here is an example of the kind
of question that you may seek to answer through your analysis.
If this sample had been actually collected, it would have been done by asking
each sampled student the following questions:
1. Are You a boy or a girl? ________
2. What is the most recent report card grade you received in this class? ____
Page 31 of 59
3. How tall are you (in inches) ____
4. If you have a choice of activity on the weekend which would you prefer to
do? ___ Play Sports___ Do something else? [Put an “X” in the line next to
your choice.]
From a combined word count of a little over 500 words, you will find that (100
sets of answers to the 4 questions) a wealth of information can be obtained. You
will begin to reach some powerful answers to your questions and know the
likelihood that your answers are correct.
The first thing we want to know is whether there are an equal number of boys
and girls in our sample. As you can see both are equally represented in the
sample.
Figure 2 – Gender composition of sample
What about academic performance of the all in the sample? Since the letter
grades are derived from numeric averages of performance throughout the school
year, we use a Bar Chart to inspect the number of students receiving each of the
five grades.
Figure 3 – Grade Frequencies
Page 32 of 59
The horizontal axis contains the ranges of grades, in this case with the lowest
being between 51 to 60, proceeding in increments of 10 points, ending in a
perfect score of 100. The vertical axis gives you the number of cases within the
selected ranges.
We can see that performance is pretty much as we might expect – A few in the
failing and top ranges, more in the high and low ranges, and most right in the
middle…where a “C” is the grade awarded. But that is far from the whole story.
Looking at this graph, you will certainly want to know whether there is any
difference between the academic performance of the boys vs. the girls.
The first step is one of inspecting the frequency of each category Boys and Girls,
with respect to the entire sample. To do this, we use a plot called a “Dot Plot,”
and with the boxes added, the plot is referred to as a “Box Plot.”
Figure 4 – Dot and Box Plots of Performance Averages
As you can see, the girls did better than the boys, both in the median of their
scores, (white line in the center of the shaded areas) and in the top grade
Page 33 of 59
received by a girl vs. that of a boy. While none of the girls failed (Score below 61)
two of the boys did. Removing the boxes, we see the three boys who are
“outliers,” those who represent extreme low values disconnected from the rest of
their group, as well as the three high-scoring girls who are also disconnected
from the rest of their group by their high performance.
Yet, while we have a good look at where the average scores fall, and the
distribution of scores in each groups, we really don’t have any accurate count of
how many in each group received which letter grade. To get this precision, we
instead turn to a statistical tool referred to as a “Contingency Table.”
As you look at the table, what becomes evident that, at every grade level, the
girls did better than the boys. There is something else that is evident. If you look
at the p-value shown at the bottom of the table, you will note that it is shown as p
= .0478. This tells you that there is slightly less than 5 chances in 100 that you
would find the boys equaling or surpassing the performance of the girls, if you
repeated this survey with other boys and girls in the 8th grade.
Figure 5 – Table Boy vs. Girl Grades Received
Finding relationships
To this point, we have been working largely with using word frequencies to
interpret the information we have displayed. Now we are going to some
numerical methods for determining relationships between the underlying
variables which have led to classifying some of the variables into words.
The 8th grade is a time of great change in physical growth for adolescents.
Thirteen year old girls tend to be well into puberty, while many boys lag in
development, causing a far greater variation in boy’s height than is found with the
girls.
Page 34 of 59
Figure 6- Comparison of Heights for Boys and Girls
This Scatter Plot of Height vs. Grade averages for the entire sample is very
interesting. There is a modest trend toward those receiving higher grades being
shorter than those receiving lower grades Boys are shown as Xs, and Girls
shown as small open circles. The red line is the cutoff separating low and high
grades. Is this trend the same or different for boys and girls?
To answer this question we split the total by gender as shown in the figure below.
Figure 9 – Height vs. Grades by Gender
Page 35 of 59
While we see this trend for boys persisting in the left plot, there appears to be
almost no relationship for the girls, as shown by the nearly level line in the girl’s
plot at right. One of the nice things about this kind of display is that it does not
require the viewer to try and interpret complicated numeric calculations – instead,
simply looking at the plot makes a number of things evident:
• There are only 7 girls as opposed to 18 boys who received grades
below “C” (< 71) in this sample.
• Looking at the boy’s heights, the shortest boys tended to get the best
grades, while the tallest boys, were more evenly distributed in the grades they
received. Thus, one might posit that short boys may be more motivated
toward academic work then tall, since they have fewer distractions in their
attraction to the girls, and less likelihood of being engaged in time consuming
sports activity.
• Conversely, there is almost no relationship between the heights of girls
and the grades they receive. While they may be involved in extra-curricular
activities, most parents will severely limit any dating activity by this age group.
Recall that there another two choice question in our example. It asked, “If you
have a choice of activity on the weekend, which would you prefer to do? ___ Play
Sports___ Do something else.”
Recall in our example, these students are living in a small town in Texas. In such
towns, high school football assumes a high degree of importance. This cultural
bias toward sports participation leads us to the following hypotheses:
• While the students in our sample are two young to participate in a high
school program, boys will have strong aspirations and interest in future
participation, leading to leisure time sports activity.
Page 36 of 59
• Since high school football is a male-only sport, girls will show less
interest in sports participation, although some, of course, will participate in
programs that are equally open to boys and girls.
• Regardless of gender, students with larger body mass will have a
greater inclination towards sports participation then their smaller counterparts.
• For a variety of reasons, students who either participate, or are
emotionally invested in organized athletic activity will tend towards lower
grade achievement than those not involved.
We begin this analysis with a contingency table showing the relationship between
academic performance and sports participation:
Figure 10 – Table Activity Choice vs. Grade Performance
When the choice of “Play Sports” vs. “Something Else,” is overlaid over grades
and height, we see a clear relationship between those receiving poor grades and
those choosing to spend their leisure time playing sports, rather than doing
“Something Else.” You will also note that this behavior is much more pronounced
among the boys (11) as compared with the same choices made by girls (3).
Page 37 of 59
Figure 11 – Overlay of Activity Choice vs. Grades and Height
In reviewing all of the above, it is important to note that all of these findings are
based upon a set of results, constructed by the author. The survey questions
described were never actually given to any group of students, and the results are
completely factitious. There are, in fact, a number of studies which take an
opposing view, finding that students athletes tend to be among the high
performers within their educational settings. One, peer reviewed study is
available here, involving many more variables than the sample provided here.
Statistical Programs
For anyone wanting to do serious analysis using the methods described above, a
statistical analysis program is required. Many readers will be reluctant to either
spend the money, or engage in the steep learning curve required to master many
professional level programs. The author uses a professional version of
DataDesk, a uniquely powerful, yet, easy-to- use program. A relatively low cost
( $75.00) Excel add-in, for the same program, is offered by the publisher, making
available all of the analyses described above.
Page 38 of 59
Do-It-Yourself Data Mining – Part III
Using Block Tags to Analyze Text
Introduction
In Part II of this series of articles, you learned how text and numeric data can be
used to extract meaningful and useful information from large collections of textual
material. In this section, we look at how the textual content can be reorganized to
be extracted, as demonstrated in the previous article.
There are two kinds of tagging which will be considered, each form serving to
answer different purposes.
Word Tags
Word Tags are words contained within the document which have particular
importance either because of the frequency with which they occur or their
association with other words within a sentence or paragraph. By capitalizing,
(George BUSH), compounding (New-York), and adding prefixes(DOD-
RUMSFELD) or suffixes (BUSH-POTUS43) to words identified as important,
word elements can be combined and linked to others having some element in
common. A full discussion of Word Tag was provided in Part I of this series of
articles.
Block Tags
Block Tags are words added by the user to categorize sections of text (by
sentence or paragraph) such that Contingency Tables can be constructed
showing the dependence or independence on one category against another.
Before you scratch your head trying to figure out what I am saying, here is an
illustration of the process I am describing:
Page 39 of 59
By the same time in 2006, with the war in Iraq going badly,
much of the nation’s attention was directed at resolving the
war, and away from the issues brought in 2005.
Here are some of the questions which flow from this brief description:
Was there a significant difference in the emphasis that the President gave to
various subjects which appeared in both speeches when compared against the
two years in question?
Since the speeches are used as a means of presenting agenda, and convincing
the audience of the value of the President’s views, what elements comprised the
form in which his views were presented?
We will closely inspect these and a number of other questions which Block Tags
answer.
In preparing this example, the text of both State of the Union speeches was
divided into sentences, and then, several categories were assigned to each
sentence:
• YEAR
• DOMESTIC/FOREIGN AFFAIRS
• RATIONAL/EMOTIONAL (A “Selector” Variable)
Here is what they look like in generating the sample contingency tables:
Page 40 of 59
In the tables you have viewed previously all were simple 2 x 2 tables. This simply
refers to the number of columns and rows within each table. In the example
below, the two speeches are in columns, with each sentence classified as
referring to either FOREIGN AFFAIRS or DOMESTIC matters.
The question answered by this table is: Was there a difference between the two
speeches with respect to a change in emphasis regarding FOREIGN AFFAIRS
or DOMESTIC matters?
Since the number of sentences in the ’06 speech increases by 17 from that of
’05, (287-260), and the number of FOREIGN AFFAIRS sentences increases only
Page 41 of 59
by 27, (116-89) it may first appear that this does not really represent a huge
change.
To get a preliminary idea of the magnitude of the change, let us first substitute
the percentage of column totals to see whether this difference looks important.
Page 42 of 59
All of the expected values are above or below their respective counts by a little
over 8 points. Thus we turn to the next measure of differences – the
Standardized Residuals. This number (after complex numeric processing) gives
you information about the magnitude of the variation of the expected values
from the actual counted values.
In looking at the highlighted number, we see that during the ’06 speech, that
FOREIGN AFFAIRS showed the greatest increase in emphasis. The question
that remains to be answered is: How likely is it that the increase in FOREIGN
AFFAIRS statements was related to the time that it occurred (2006) or was it
merely due to chance?
Page 43 of 59
To answer this question, we turn to two other statistics: The Chi-Square value,
and the Probability (p = ) of occurrence:
Let’s see how this operates when using the contingency table with which we
have been working:
Page 44 of 59
The number “p = 0.1355,” is translated as: The probability that there is no
statistically significant difference between FOREIGN AFFAIRS statements
made in 2005 and 2006 is .1355 (This means that there is a 13.5% chance that
there is no difference between the two variables)
For most researchers this number is far too high. Most researchers require a p-
value less than .05 (5%) or .01 (1%) and so we would accept the hypothesis,
since the .1355 number is too large to reject the statement.
Without going into a technical explanation of the Chi-Square value, the general
rule is that the higher this number the lower will be the p-value. The “df” refers to
“Degrees of Freedom.” It is derived from the number of cells in the table. As df
increases, a larger Chi-Square value is required to obtain the same p-value.
Neither of these statistics are necessary for your interpretation of these tables.
Selector Variables
A selector variable is used to filter all of the counts within a contingency table,
reducing the totals each cell to only those cases in which meet the criteria for the
Page 45 of 59
selector. In the example below, we add the EMOTIONAL Selector to the
contingency table we have been using.
Inspecting the two tables reveal some important differences between them.
• Comparing the Total Counts between the two tables, EMOTIONAL
statements account for 27.2% of all statements made by the President in both
years EMOTIONAL sentences occurred 3.5 time more frequently in ’06, when
compared to ’05.
• DOMESTIC sentences decreased from 342 to 97, with the ratio
between the two years of exactly 1:1 to EMOTIONAL statements dominating
’06 by a ratio of 3.6:1.
• While those total statements regarding FOREIGN AFFAIRS increased
by 30% between the two years, those having an EMOTIONAL rating
increased by 81%.
• Most importantly, the probability for the Null Hypothesis being TRUE
has dropped from .135 to .065.
• Taking all of the data together the interpretation which follows
concludes that the President significantly increased the use of emotional
appeals during between the ’05 and ’06 speeches.
Page 46 of 59
Preparing Block Tags
Selecting Variables
As previously mentioned the number of variables, and the variable headings
should be determined in advance. In the example which follows, there are three
variables with headings: YEAR, TYPE, AND SUBJECT.
Category Selection
The general rule in determining the number of categories you are going to use is
to make each variable have the smallest number of categories allowing for the
information you seek to discover. Ideally, you will have a 2 x 2 table to work with.
Differences between expected and actual values will be at their greatest, giving
the opportunity for higher Chi-Square values with the lowest p values,
obtainable. Of course, in real situations this is often not possible. In fact, until
you are actually in the process of coding variables, you will not know the names
of all the categories.
TYPE
• EMOTIONAL = A conclusion based upon emotion, not facts.
(“Americans are a compassionate people,” or the “Axis of Evil.”)
• FACTUAL = A verifiable statement (“There were 12 marines killed in
the helicopter crash.”)
• F-CONCLUS = An inference or conclusion, based upon factual data
(“Another 20,000 troops are required to end the violence in Baghdad.”)
• PROMISE = A commitment made by the speaker. (“I will sign a bill
which…”)
• REQUEST = A Request made by the speaker (I ask that Congress
continue the Tax Reduction…”)
• REQUIRE = A demand made (Congress must pass legislation to…”)
• RHETORICAL = A statement which communicates no information
SUBJECT
• CONGRESS = Statement regarding Congress
Page 47 of 59
• ENERGY = Oil, nuclear, alternative energy sources
• INTERNATIONAL = All foreign affairs matters, except war
• LAW = Law enforcement and the courts
• MONEY = All things related to the economy, employment, Social
Security, Taxes, Budget, health insurance.
• SECURITY = Homeland Security issues
• WAR
• YOUTH
Much of the text that you will be organizing and manipulating will come from web
pages, pdf files, or numerical tables. All raw data should be converted to Word
documents, so that the tools contained within can be utilized.
In our example we will be working with the State of the Union Addresses
delivered in 2005, and 2006. To assemble the two completed documents, you will
need to copy and paste from the text, changing pages, and eliminating
extraneous material. Since we are going to be working with tow different
speeches, you should code one year at a time, and then combine the two
finished documents into one composite complete data set. In a moment, you will
see why.
Shown below is a fragment of the 2005 Address, showing the first four
paragraphs.
Page 48 of 59
1. Use Global Replace to parse all paragraphs into their component sentences.
Page 49 of 59
2. The following Global Replace dialog readies the test formatting of the two
speeches for Block Tag entry.
3. This results in the following text, with “2005” later replaced with “SOU-05.” A
separate Replace dialog is used for the 2006 speech.
Page 50 of 59
4. Using the Table Menu, the Text is converted to the following Table:
5. Empty cells can be rapidly categorized using the appropriate tag as each
sentence is evaluated. Remember that there are two speeches to be coded
Page 51 of 59
While the above appears to involve a substantial amount of work in preparing the
tag tables, Steps 1 through 4 take only a few minutes. After completion of Step 5,
simply copy the categorical columns to an Excel sheet, and then import the
variables to the statistics application you have chosen. You are now ready to
assess the results of the relationships existing among the variables.
Page 52 of 59
Keep these considerations in mind as we explore the differences and
commonalities in the two speeches.
Emotionality – Rationality
An initial area of interest is the degree to which the style of the two speeches
stays consistent or changes. The ’05 speech reflected the President’s confidence
in his ability to bring the Iraq war to, if not to an end, movement toward his
promised direction. He advanced a number of favorite domestic priorities which
he was confident would be accepted and achieved. By 2006 much had changed.
Domestically, his plan for Social Security privatization which he had so glowingly
announced was dead in the water. While an elected government was developing
to schedule in Iraq, little progress and in fact regression,was evident with respect
to sectarian and insurgent violence. American casualties continued to spike, and
the public was showing serious resistance to the confident assertions of the
Administration.
Page 53 of 59
This table displays all categories of statement types. If we restrict the categories
to “EMOTIONAL,” RATIONAL,” (F-CONCLUS + FACTUAL) and “OTHER” the
major shift to emotionalism in the 2006 speech becomes evident.
Page 54 of 59
The following point to the major shift from RATIONAL to EMOTIONAL
statements:
• EMOTIONAL statements changed from 15.8% of all of the ’05
statements to 37.6% of all ’06 Statements.
• Of all EMOTIONAL statements made in the combined speeches,
72.5% of them were made in the ’06 Speech.
When we inspect Year, Subject, and Type Categories, there is a great deal we
can discover just by trying various combinations in contingency tables. The
combination of two table below, comparing the proportions of only EMOTIONAL
statements compared between those statements having a subject of DOMESTIC
matters vs. FOREIGN AFFAIRS reveals what may have been an unexpected
trend.
Page 55 of 59
Here are some of the inferences supported by a side by side inspection of the
two tables.
• There was a significant shift between 2005 and 2006, with the
emotional component being substantially higher in 2006 than was the case in
the earlier year.
• Than there was an unexpected trend, is demonstrated by the fact that
DOMESTIC statements having an emotional component were greater than
was the case for FOREIGN AFFAIRS.
Sliding Granularity
Inspecting the above we see a shift in the toward emotionality in the tone of the
’06 speech. While the p value of .066 is approaching an acceptable level of
significance (.05), we don’t yet know which of the subject categories accounts for
this change. Thus, we want to get to sufficient detail to tell us which category(s)
account for this shift to emotionality.
The Category, “FOREIGN AFFAIRS” is quite easy to split into its respective
elements, since there it is composed of only two – “WAR” and
“INTERNATIONAL.” “WAR” refers to military actions being taken in either Iraq or
Afghanistan. “INTERNATIONAL” refers to all other references to relations with
foreign countries.
Page 56 of 59
defeat the purpose of determining which among them was most affected by this
shift. As shown in the table below, only two (MONEY and VALUES) exceed 10%
of the combined statements for both years.
Thus we assemble contingency tables for each of the 4 variables, looking for the
p values of each, First, the Main Table, showing the shift toward emotional
statements. Note the p value which clearly establishes the shift toward
EMOTIONAL in the ’06 speech.
Next we compare the change for ’06 between DOMESTIC and FOREIGN
AFFAIRS.
Page 57 of 59
Interpreting the above becomes a bit complicated, but if you follow the process, it
should become clear how to correctly read the results:
• A shift in tone toward the EMOTIONAL was determined in the
comparison of ‘05/’06
• Therefore, in the table above, we search for the EMOTIONAL cell in
which is contained the largest positive residual. This cell, as highlighted, is
the EMOTIONAL/DOMESTIC cell.
• Since the p value of .0245 < .05 (the minimum acceptable value) we
are assured that it meets our criterion for this change being statistically
significant.
Page 58 of 59
Comparing the two tables, it is immediately evident that MONEY is the element of
DOMESTIC policy which saw the greatest increase in EMOTIONAL statements.
We have three ways of confirming the relative weight of the two elements.
• The total number of MONEY statements (148) exceeded the VALUES
statements (68) by a full 70 sentences. Clearly , MONEY statements received
more attention than did VALUES.
• The MONEY/EMOTIONAL cell has a positive residual 2.3 times
greater than that of Money, suggesting that the power of EMOTIONAL
statements was far greater for MONEY, than for VALUES.
• Finally, the Chi-Square value for MONEY was far greater (21.54) than
was that of VALUES (11.54) with the subsequent p value for MONEY<
VALUES (p ≤ .0001 is less than p = .0031).
Page 59 of 59