Professional Documents
Culture Documents
by
Colin Naturman (Ph.D.)
Course Outline:
Session 1:
Session 2:
Normalization:
Session 3:
Navigability
Arbitrary Data Types
Keyed Relationships
Aggregation and Composition Relationships
Inheritance Relationships
Indexes on Tables
Session 4:
Querying Data:
Selecting
Joining (Cross, Inner, Left Outer, Right Outer, Full, Joins and
Normalization
Summarization of Data:
Session 5:
Range
Deviations and Variance (Deviations, Absolute Deviations, Mean
Absolute Deviation, Square Deviation, Variance, Standard
Deviation)
Session 6:
Line Charts
Bar Graphs (Horizontal, Vertical)
Histograms
Pie Charts
Scatter Plots (Regression, Correlation)
Session 7:
Quantiles:
Spreadsheets
Typically data about the same sort of item is split across many
spreadsheets e.g. one spreadsheet for each weeks data. There is
no simple way of working with the data across these separate
spreadsheets.
Relational Databases
All the fields in a single column have the same data type.
A table can store vast amounts of data. One does not work with
the data directly in the table, instead reporting and analysis tools
are used for viewing and manipulating the data.
Databases cater for missing data values via the concept of a null
value in a field. A null is not the same thing as a zero or spaces
or even an empty string of text. Nulls are great for dealing with
missing data but require careful consideration when calculations
need to be done or data must be matched. When defining a
column one can specify whether it must always have a true
value or allow nulls i.e. missing values.
Glossary:
Term Definition
Alternate Key A key for a database table other than the one
designated as the primary key.
Column A set of values represented as a vertical stack. In a
database table each column has a unique name and
the values are all of the same data type. There is one
value in the column for each row.
Candidate Key One of several keys for a table.
Compound Key A key made up of more than one column
Field A space for storing a single value within a database
table. A field is identified by row and column.
Increment A fixed amount by which an identity field increases
with each row.
Identity An integer column in a database table whose fields
contain automatically generated numbers starting
with a seed and increasing by an increment with each
row. Typically used as a surrogate key.
Key A column or set of columns in a database table
whose values in a row uniquely identify the row.
Natural Key A key for a table that has a natural business relevant
meaning.
Null A special “value” used in a relational database to
represent the absence of a true value. A null is not
the same as a zero or spacing or even an empty text
string.
OLAP Online Analytical Processing – doing statistical
analysis and reporting using a database system
OLTP Online Transaction Processing – storing and
maintaining data on the fly using a database system
Primary Key A key for a table that has been designated as the one
that will be actively used for uniquely identifying rows
in the table.
Record A list of values recorded for a particular item of
interest. In a database table each row represents a
single record.
Row A set of values represented as a horizontal list. In a
database table, each row represents a single record
of information. The row has a field for each column of
the table.
Seed The starting value of an identity field.
Surrogate Key An artificially added key to identify rows of data
which do not have any natural key.
Table A named set of data values in a database arranged
within columns and rows. A table has a fixed set of
columns and the fields within a particular column are
all of the same type.
Type (Data A specified type of data that can be stored within a
Type) spreadsheet cell or database field, e.g. int (integers),
varchar (text strings), money (monetary amounts).
In a database table all the fields in a column have the
same type.
First Assignment
Create database tables to store the data that you currently only store
in spreadsheets. This will include:
1. Normalization
Normal Forms
A mnemonic for the above normal forms is “The key, the whole key,
and nothing but the key, so help me Codd.”
Example 1a:
Client Keyword
Earthlink internet DSL broadband
Alamo car travel
Example 1b:
Client Keyword
Earthlink internet, DSL, broadband
Alamo car, travel
A bit better but really “cheating”, there is one field per column but it’s
simply stringing together values that should be in separate columns.
Example 1c:
Client Keyword
Earthlink internet
Earthlink DSL
Earthlink broadband
Alamo car
Alamo travel
Here Client and Keyword together form a compound primary key. This
table is in 1NF but it is not in 2NF. Client contact is information about
Client not about the combination of Client and Keyword.
Example 2b:
Client Keyword
Earthlink internet
Earthlink DSL
Earthlink broadband
Alamo car
Alamo travel
Example 3a:
Here the data is clearly 2NF as there is a single column primary key.
But email provides info about Campaign Manager not about the
primary key which is Client, so the data is not in 3NF.
Example 3b:
Again splitting into two tables removes redundancy. The data is now in
3NF.
Example 4a:
Keyword Client Ad ID
Fast Earthlink 1
Fast Alamo 2
Travel Alamo 3
Internet Earthlink 1
Here the data is in 3NF. But the Client can be uniquely determined
given only the Ad ID. Yet the Ad ID is not a candidate key, so we do
not have BCNF. (A situation that is very rare.)
Example 4b:
Keyword Ad ID
Fast 1
Fast 2
Travel 3
Internet 1
Ad ID Client
2 Alamo
3 Alamo
1 Earthlink
Once again splitting into tables removes redundancy. The data is now
in BCNF.
Example 5a:
Here as in the above example we have data where all columns are
needed for a key. The data is thus clearly in BCNF. But we have two
independent many-to-many relationships within the table – a
relationship between Managers and Language and a relationship
between Managers and Skills. Thus the data is not in 4NF.
Example 5b:
Manager Language
Wayne English
Duncan English
Duncan Afrikaans
Manager Skill
Wayne Marketing
Wayne Client Management
Duncan Marketing
Duncan Client Management
Yet again we remove redundancy by splitting into two tables. The data
is now in 4NF.
Example 6a:
Here the data is in 4NF. Unlike in the previous example we cannot split
the table into two without losing information.
Client Keyword
Earthlink Fast
Earthlink Cheap
Alamo Fast
Alamo Cheap
But suppose for the sake of example, there is a business rule in place
that if a search engine supports a keyword and a certain client is
advertised on a search engine and the client is described by that
keyword, then the client will be advertised on that engine with that
keyword. In that case the original table embodies two semantically
related many-to-many relationships and is thus not in 5NF.
Example 6b:
Client Keyword
Earthlink Fast
Earthlink Cheap
Alamo Fast
Alamo Cheap
Assuming the rule in 6a applies, we have now been able to split the
data into three tables without losing information. The data is in 5NF.
Example 6c:
The data in 6b is in 5NF. But suppose for the sake of example, there is
a rule that Earthlink gets all keywords on all search engines; then the
data would not be in DKNF as there is now a constraint on the data
that is causing redundancy. This redundancy can be removed by
removing rows that can be constructed from the constraint:
Client Keyword
Alamo Fast
Alamo Cheap
You do not need to remember the definitions of the normal forms, you
merely have to apply common sense when looking at how the data is
stored.
We are now ready to begin the design process. The process consists of
the following steps:
GroupID:
contai ns
Nam e:
m any
0..* 1..*
Group Keyw ord
DartSearchKeywordID : varchar:
DartSearchKeywordID : varchar
Glossary:
Term Definition
Recall that our aim is to have our data stored in a database optimized
for statistical analysis. To get to this OLAP database we first need to
understand the logical model of our data. The logical model will be
used to design a normalized database which will then be transformed
into our final database design.
Navigability
Client Product
markets
1..* 1..*
liases via
1..*
Contact Person
While physical databases like SQL Server are limited to a few standard
data types, a logical data model can have arbitrary data types for
attributes – in fact any entity type can be treated as a data type for an
attribute.
Client Product
Client ID
Products : Product Set
markets
1..* 1..*
Keyed Relationships
Client
Ad
Client ID
Ad ID
Client ID advertises via
Client ID
1..* 1
Keyw ord
Keyw ord List
is a list of
1..* 1..*
1..*
is composed of
1..*
Word
Inheritance Relationships
Conv ersion
6. Link tables are given foreign keys for the two entities which they
relate and typically this forms a compound key for the link table.
8. Attributes with arbitrary data types that are actually entity types
become instead relationships to those entities in the physical model
and implemented using foreign keys and link tables as above.
13. We ensure that each table in the physical design has been
assigned a primary key using surrogate keys if need be.
3. Indexes on Tables
Complete the logical data model for your search marketing data.
For reference data around search marketing – produce a physical
database design in SQL Server. Don’t worry too much about
statistical data – that will go instead into our final OLAP database
which we will look at next session.
Glossary:
Term Definition
Where We Are:
We are on the road to getting our data into a database optimized for
statistical analysis. So far, we have looked at
1. Querying Data
The process of getting data out of tables (to view it, to do calculations
on it or to produce a report or a graph) consists of the following
processes:
Selecting
Joining
We join two sets of rows at a time, starting with two tables and joining
the resulting join with those of the next table.
Suppose there are two sets of rows A and B. There are several join
operations that can be applied to them:
Right Outer Join – This is the mirror image operation of the left
outer join. It produces a new set of rows made up of every row
of B combined with every row of A that matches on the chosen
columns as well as every row of B that has no matching row in A
combined with null fields for the columns that would have come
from A had there been a matching row. Thus, ignoring the
ordering of columns, the right outer join of A with B is the same
as the left outer join of B with A.
The first thing a statistician asks is, “what data varies and what data
remains constant?” The two key concepts in this regard are:
We have already seen this phenomenon with candidate keys and non-
key attributes – the non-key attributes have a functional dependency
on the candidate key attributes.
Time- Time-
dependent independent
Non-
Static Constant
Characteristic
Ordinal scale – the variable has order but not necessarily distance or
relative size. These are precisely the ranked variables and this scale is
also called the ranked scale.
Interval scale – the variable has order and a metric function that
determines distances between values but does not necessarily have a
meaningful relative size. Dates and times are typical examples of
interval scale variables.
Method of Determination
Named variables are simply assigned a name. These are the same as
the nominal / categorical scale variables.
Counted variables are assigned a value by counting occurrences, e.g.
clicks.
3. Summarization of Data
Cumulative Data
For ranked data we typically look at the extreme values of the data –
the maximum and the minimum values. Frequencies are also
typically used for ranked variables.
Duties on online sales are typically not additive. Typically, these are
calculated by multiplying by factors. If we have several such factors
their cumulative amount is calculated by taking their product not their
sum. Such variables are called multiplicative.
For nominal variables we use the mode which is the most frequent
value.
geometric mean of a =
Glossary:
Term Definition
Star Schemas
Our resulting database will not be in more than 3 rd normal form. When
querying data we will not have join chains consisting of more than two
tables – indeed we will only have join chains consisting of a dimension
table and a fact table.
The design of our resulting database looks like star shapes with fact
tables in the centre of the stars and dimension tables surrounding
them connected by foreign key relationships forming spokes. Such a
database design is referred to as a star schema.
DMKeyWordGroup
«column»
*PK KeyWordGroupID: int
* KeyWordGroupName: varchar(50)
«PK»
+ PK_DMKeyWordGroup(int)
+PK_DMKeyWordGroup 1
(SKKeyWordGroupID = KeyWordGroupID)
DMKeyWordName
«column»
DMKeyWordID
*PK KeyWordNameID: int
* KeyWordName: nvarchar(100)
«column»
*PK KeyWordIDID: int
+PK_DMKeyWordName
KeyWordID: varchar(10) «PK»
+PK_DMKeyWordID 1 + PK_DMKeyWordName(int)
+FK_FactPerformance_DMKeyWordGroup 0..*
«PK»
1
+ PK_DMKeyWordID(int) FactPerformance
«column»
(SKKeyWordID = KeyWordIDID)
* FactID: int (SKKeyWordNameID = KeyWordNameID)
* ID: int
*FK SKClientID: int
*FK SKCampaignID: int +FK_FactPerformance_DMKeyWordName
+FK_FactPerformance_DMKeyWordID *FK SKSearchEngineID: int
0..*
0..* *FK SKKeyWordID: int
*FK SKKeyWordNameID: int
*FK SKKeyWordGroupID: int
* Impressions: int
* Clicks: int
* MediaCosts: money
* AveragaPosi tion: decimal(18)
* LogDtm: smalldatetime +FK_FactPerformance_DMCampaign
+FK_FactPerformance_DMCl ient
0..*
0..* «FK»
+ FK_FactPerformance_DMCampaign(int)
+ FK_FactPerformance_DMClient(i nt)
+ FK_FactPerformance_DMKeyWordGroup(int) (SKCampaignID = CampaignNameID)
(SKCli entID = ClientID)
+ FK_FactPerformance_DMKeyWordID(int)
+ FK_FactPerformance_DMKeyWordName(int)
+ FK_FactPerformance_DMSearchEngine(int)
+FK_FactPerformance_DMSearchEngi ne 0..* +PK_DMCampaign
+PK_DMCli ent DMCampaign
DMClient 1
1 «column»
«col umn» *PK CampaignNameID: int
*PK ClientID: int CampaignName: varchar(50)
* ClientName: varchar(50)
ClientFriendlyName: varchar(50)
(SKSearchEngineID = SearchEngineID) «PK»
+ PK_DMCampai gn(int)
«PK»
+ PK_DMClient(int)
+PK_DMSearchEngi ne 1
DMSearchEngine
«column»
*PK SearchEngineID: int
* SearchEngineName: varchar(20)
«PK»
+ PK_DMSearchEngine(int)
Snowflake Schemas
Even when we do not have hierarchies that are more complex than a
dimensional hierarchy, we may want to consider keeping contextual
entities normalized in order to help maintain the consistency of the
data if the dimensions are prone to change.
Now that we have our OLAP database design we can populate it with
statistical data.
Although our original OLTP database may contain facts for individual
items or events, in our OLAP database we store facts summarized for
the smallest level of partition possible – partitions determined by
single fact table compound key values made up of single dimension
table primary key values. This smallest partition which corresponds to
single rows in the fact table is referred to as the granularity of the
fact table.
In our case the information obtained from DART Search already has
the data summarized to the granularity that we will be working with
and so we do not need to do any summarization when loading the data
into the OLAP database.
When querying data for partitions that are coarser than the granularity
we summarize the data further by taking sums (for additive data),
products (for multiplicative data) or averages or other central
tendencies for data that does not accumulate. This is known as rolling
up the data. Typically reporting tools automatically roll up the data for
us.
Range
Typically we use the range for variables that are at least of the interval
scale of measure so that differences produce meaningful values. (We
can use the range for ranked variables but this is usually not useful as
for a full population the minimum value is 1 and the maximum is
simply the number of values and so the range is just 1 less than the
number of values which isn’t very useful. If we are looking instead at a
sample of a ranked population then things are a bit better – the range
would give the number of rank positions between the minimum and
maximum rank positions of members of the sample.)
Deviations
The range is based only on the two extreme values of the data and so
it says nothing about how the rest of the data is distributed. For this
we introduce the concept of a deviation. A deviation of a value x i of a
variable x is how much it differs from the mean value x :
deviation x i x
(Some books take the difference the other way around i.e. x x i , it
doesn’t matter as long as you pick one way and stick to it
consistently.)
At first it might seem a good idea to take the mean of all the
deviations, but, a simple proof shows that this will always come out as
zero!! One way of looking at it is that some deviations are negative
and some are positive and when averaged out the negatives cancel the
positives. To avoid this we consider absolute deviations which are
simply the absolute values of the deviations i.e. the sizes of the
deviations regardless of whether they are negative or positive obtained
by making negative values positive and leaving positive values alone:
absolute deviation xi x
(The symbol denotes the absolute value.)
1 n
mean absolute deviation xi x
n i 1
square deviation ( xi x ) 2
1 n
variance
n i 1
( xi x ) 2
Standard Deviation
The symbol (the Greek letter sigma) is typically used to denote the
standard deviation and the variance being equal to the square of the
standard deviation is typically denoted by 2 (reminding us that it is in
square units) instead having a separate symbol of its own.
Glossary:
Term Definition
n
result of applying variance formula
n 1
1 n
( xi x ) 2
n 1 i 1
As with the population variance, this value is in the square of the units
of x and so when we need a value in the same units as x we take the
square root to obtain a quantity called the sample standard
deviation which is usually denoted by s :
1 n
s
n 1 i 1
( xi x ) 2
Interquartile Range
We start off by taking the median. The median partitions our set of
fact values into two halves – values less than the median and values
greater than the median. (We leave the median itself out of either
half.)
We can then take the median of the lower half of values. The number
we obtain is called the lower quartile. Not more than a quarter of the
values are below it and not more than three quarters are above it.
Similarly we can take the median of the upper half of the values. The
number we get this time is called the upper quartile. Not more than
three quarters of the values are below it and not more than a quarter
are above it.
We thus have three values, called quartiles, dividing the set of values
into quarters: the lower quartile (also called the first quartile) the
median (also called the second quartile) and the upper quartile (also
called the third quartile).
Compare this with the ordinary range which was given by:
range Q4 Q0
Semi-Interquartile Range:
Simplifying out, the median Q2 cancels out and we are left with
Q3 Q1
semi-interquartile range =
2
In other words the semi-interquartile range is simply half the
interquartile range, hence its name.
Line Charts
The most straightforward graph used in stats is the line chart (also
called function graph or line plot). This is used to display the
dependency of a quantitative dependent variable on a quantitative
independent variable, typically time. It is used when we know that
there is indeed a functional dependency between the variables. The
graph helps us determine the exact nature of the dependency; in
particular it helps us see if there is a simple formula relating the
variables.
Bar Graphs
Bar graphs show the relative sizes of fact values for different distinct
dimension values by means of bars on an XY plain. There is one bar
per dimension value and the height of the bar represents the fact
value. The bars all have the same width but the width (and hence
area) of the bar does not represent data. Typically the bars are vertical
with their bases on the X axis which represents the dimension values
(vertical bar graph). Alternatively one can also draw the bars
horizontally with their bases on the Y axis representing the dimension
values (horizontal bar graph).
Histograms
Sometimes the bars all have the same width in which case their area is
proportional to their height and in this case we have a special type of
bar graph. However one can have histograms with bars of different
widths in which case one must look at the area of the bar not merely
its height.
Pie Charts
Scatter Plots
We use the X axis for one fact and the Y for the other and plot a point
for every combination of values for the two facts. If there is indeed a
dependency between the two values the points will form a line or
curve whose formula we can then determine from the plot.
Glossary:
Term Definition
1. Advanced Graphing
For normal vertical graphs the X-axis variable used in each individual
graph on the plane is referred to as a category and the additional
variable is referred to as a series. (For horizontal graphs such as a
horizontal bar graph the category is the Y-axis variable instead).
This technique is often used with bar graphs in which case for each
value of the category there is a group of bars. Within each group there
is a bar for each series value.
Stacked Bar Graphs
One can also divide the bars up so that each successive division
includes those below it – the lowest division representing data for the
finest partition according to the additional dimension and each
successively higher division showing a coarser partition. For example.
the smallest division might show impressions that led to conversions,
the next division would show impressions that led to click throughs
(which includes those that led to conversions) and the whole bar might
show all impressions.
One can show different nested doughnuts for different category values.
One can also show segments or sectors of pie charts and doughnut
charts detached for emphasis.
Boxes and Whiskers
2. Draw a box bounded at the bottom by the lower quartile and at the
top by the upper quartile. The box can be any width. The height will
be the interquartile range.
The position of the median line in the box indicates how the data is
skewed i.e. if it is distributed more or less evenly about the median or
concentrated more on one side. If the lower quartile is further from the
median than the upper quartile, the data is said to be negatively
skewed, if the upper quartile is further from the median than the
lower quartile, the data is said to be positively skewed.
In cases where data is sampled for different time intervals one can plot
successive box plots on the same axis and link the medians or means
with line segments to indicate trend.
To compile a stem and leaf plot numbers in the dataset are split into
units (called the leaves) and either tens, hundreds or thousands etc
(called stems) depending on the typical size of the numbers. The
numbers are arranged in order. The stems are listed once in a column
on the left. For each number, its leaf (units portion) is listed as an
entry in the row headed on the left by the stem of the number. For
example if we divide our numbers into units and tens, the number 562
would consist of an entry of 2 (representing 2 units) in the row headed
on the left by 56 (representing 56 tens).
Recall how ranked data is divided into two halves by the median and
that these portions can be further divided in two by quartiles. If we
want an even finer view of how the data is distributed we can repeat
the process of dividing in two a third time to produce the quantities
called octiles which are thus numbers dividing the data into eighths.
The special names for 𝑛-quantiles for various values of 𝑛 are given in
the table below:
n Name of n-quantile
2 median
3 tertile
4 quartile
5 quintile
6 sextile
7 septile
8 octile
9 nonile
10 decile
20 duo-decile
100 percentile
Arrange the data in order. Then the 𝑘th 𝑛-quantile is the number
whose rank (position in the ordered list) is · (𝑁 + 1) where 𝑁 is the
number of values in the list. Now this is fine if the latter is a whole
number, we just pick the number in the list at that position. If however
this calculation produces a fraction, we pick the two whole numbers on
either side of the fraction, pick out the values in the list at those
positions and then calculate the value lying between these values at a
distance between them that is in proportion to the distance that the
number · (𝑁 + 1) lies between the two whole numbers on either side
of it. (This is known as taking a linear interpolation.) The process is
best understood with an example:
Suppose we have 10 values:
Then the second tertile is the value of rank · (10 + 1) = 7 . Now this is
a fraction lying between 7 and 8. So we pick out the 7 th and 8th values
in the list: 56 and 77. Now 7 lies one third of the way from 7 to 8 and
so our desired value is the number that lies a third of the way from 56
to 77, that is 56 + · (77 − 56) = 63.
Glossary:
Term Definition
Max – Min Plot A graph showing maximum and minimum values for
samples using max and min bars.
Negatively Skewed Having a larger distance between lower quartile and
median than between upper quartile and median.
Noniles The quantiles that the divide the data into ninths.