You are on page 1of 74

Statistics for Search Marketing

by
Colin Naturman (Ph.D.)

Course Outline:

Session 1:

Differences Between Databases and Spreadsheets:

 Spreadsheets (Rows, Columns, Cells, Records)


 Relational Databases (Tables, Rows, Columns, Fields, Keys –
Primary, Candidate, Natural, Surrogate, Compound, OLTP vs
OLAP)

Session 2:

Normalization:

 What Normalization is About (Normalization - Insert, Update,


Delete vs Denormalization – Query)
 Normal Forms (1NF, 2NF, 3NF, BCNF, 4NF, 5NF, DKNF)

Designing our Database:

 Steps in Designing a Database


 The Logical Data Model – Entities and Relationships

Session 3:

Refining the Logical Data Model:

 Navigability
 Arbitrary Data Types
 Keyed Relationships
 Aggregation and Composition Relationships
 Inheritance Relationships

Transforming a Logical Model to a Physical Design:

Indexes on Tables
Session 4:

Querying Data:

 Selecting
 Joining (Cross, Inner, Left Outer, Right Outer, Full, Joins and
Normalization

Applying a Statistical Point of View to Data:

 Variables and Constants


 Functional Dependencies Between Variables
 Dependency on Time and “Space” (Time-Dependent vs
Independent, Static vs Non-Static)
 Qualitative vs Quantitative Data (Nominal vs Ranked, Discrete vs
Continuous)
 Scale of Measurement – Comparative Data (Categorical, Ordinal,
Interval, Ratio)
 Method of Determination (Named, Counted, Measured)

Summarization of Data:

 Population and Samples


 Cumulative Data (Frequencies, Extreme Values - Maximum,
Minimum, Additive vs Multiplicative - Sum, Product)
 Central Tendencies of Data (Mode, Median, Arithmetic Mean,
Geometric Mean)

Session 5:

Online Analytical Processing:

 Facts and Dimensions (Facts/Measures, Dimensions, Contexts)


 Dimensional Hierarchies (Dimension Levels)
 Star Schemas
 Snowflake Schemas
 Granularity and Summarization

Measuring Data Dispersion:

 Range
 Deviations and Variance (Deviations, Absolute Deviations, Mean
Absolute Deviation, Square Deviation, Variance, Standard
Deviation)
Session 6:

Measuring Data Dispersion contd:

 Sample Variance vs Population Variance (Sample Variance,


Sample Standard Deviation, Estimators)
 Interquartile Range and Semi-Interquartile Range (Quartiles,
Interquartile Range, Semi-Interquartile Range)

Graphs for Displaying and Analyzing Data:

 Line Charts
 Bar Graphs (Horizontal, Vertical)
 Histograms
 Pie Charts
 Scatter Plots (Regression, Correlation)

Session 7:

Graphs for Displaying and Analyzing Data contd – Advanced Graphing:

 Categories and Series


 Stacked Bar Graphs (Non-Cumulative, Cumulative)
 Variations on Pie Charts (Doughnut Charts, Nested, Detached)
 Boxes and Whiskers (Box Plot and Five Number Summary,
Outliers – Extreme vs Mild, Negative vs Positive Skewing, Error
Bars, Max-Min Plot)
 Stem and Leaf Plots

Quantiles:

 N-Quantiles (Percentiles etc.)


 Fractional Rank and Linear Interpolation
Session 1 Summary

1. Differences Between Databases and Spreadsheets

Spreadsheets

 A spreadsheet has arbitrarily many columns labeled with an


alphabetic scheme. There are arbitrary many rows labeled with
numbers. Within this system data is stored in an arbitrary
manner.

 Cells can be formatted to contain only a certain fixed data type


but the choice of type need not be the same for all cells in a
column. Typically the first row of cells is used for text names for
the columns.

 Data is worked with directly in the sheet and can be moved


around ad hoc.

 Any arbitrary selection of cells may be used to represent the


different values associated with a single item of interest.

 Spreadsheets are limited in size. Only “document” sized datasets


can be worked with.

 Typically data about the same sort of item is split across many
spreadsheets e.g. one spreadsheet for each weeks data. There is
no simple way of working with the data across these separate
spreadsheets.

Relational Databases

 In a relational database like SQL Server, data is stored in a


set of tables.

 Each table has a carefully thought out pre-defined scheme with a


fixed amount of columns each given a unique name. The name
of a column is not part of the data in the table but is used to
identify the column.

 Each row represents a single record of information for an item


of interest. Such a record has a fixed structure – there is
precisely one field in the record for each column of the table.

 All the fields in a single column have the same data type.

 A table can store vast amounts of data. One does not work with
the data directly in the table, instead reporting and analysis tools
are used for viewing and manipulating the data.

 Although rows are naturally numbered by the order in which


they physically reside in the database, this physical order is not
normally used to identify rows. Instead the fields in a single
column or in a set of columns are used as a key for uniquely
identifying rows. If several columns are used we have a
compound key.

 A table may have several possible keys which can be used; in


this case we refer to them as candidate keys. Typically one
candidate key is chosen to be the primary key that will always
be used. Once a primary key has been chosen the other
candidate keys are referred to as alternate keys.

 Sometimes a table has an obvious set of business relevant


columns that can be used as a primary key – this called a
natural key. In other cases there are no keys at all or the
candidate keys are unwieldy. In this case it is best to make an
artificial column known as a surrogate key, usually a generated
number or code chosen to uniquely identify the rows.

 The typical way of creating a surrogate key is to make an


identity column. An identity is an integer that starts with a
specified seed value for the first row and then increase by a
specified increment for subsequent row.

 Databases cater for missing data values via the concept of a null
value in a field. A null is not the same thing as a zero or spaces
or even an empty string of text. Nulls are great for dealing with
missing data but require careful consideration when calculations
need to be done or data must be matched. When defining a
column one can specify whether it must always have a true
value or allow nulls i.e. missing values.

Two Important Ways of Optimizing Database Table Structure

Database Type Usage


OLAP (Online Analytical Optimized for doing statistical
Processing) Database analysis for management decision
support. This is the kind of
database we will mainly work with
to apply statistical techniques to
Search Marketing
OLTP (Online Transaction Optimized for inserting, deleting
Processing) and updating records on the fly –
this is the sort of database Google
would use as they capture clicks
and conversions on the fly. It is
also the type of database
structure that we will use for
storing basic reference data
around clients and campaigns.
Data in an OLAP database is
always derived by rearranging and
aggregating data from an OLTP
database.

The two types of optimization are generally in conflict – one needs


different database structures for different purposes.

Why Do We Need This?

 Spreadsheets do not have the capacity to do calculations on very


large datasets. Database tables have the capacity for the large
amounts of data we will be analyzing.

 Spreadsheets do not provide a way to relate different sets of


data. With database tables we can use key values to relate data,
more on this next session.
 To do statistical calculations using analysis tools we need data
listed according to columns with clearly and correctly defined
data types, it is no good to have text and values in the same
columns. The records must also be uniquely identifiable by a
primary key in order to allow tools to walk through records while
doing calculations.

 Databases allow indexing on columns to facilitate selecting and


filtering data for analysis. More on indexes later.

 We need to understand that breaking up data into tables has to


be optimized in different ways for different purposes. Reference
data on campaigns and clients should not contain unnecessary
repetition of data that can lead to inconsistencies. But data for
doing statistical analysis needs repetition to avoid tools
continually having to look up the same values in related tables
while processing data in one table – OLTP structure vs OLAP
structure.

Glossary:

Term Definition

Alternate Key A key for a database table other than the one
designated as the primary key.
Column A set of values represented as a vertical stack. In a
database table each column has a unique name and
the values are all of the same data type. There is one
value in the column for each row.
Candidate Key One of several keys for a table.
Compound Key A key made up of more than one column
Field A space for storing a single value within a database
table. A field is identified by row and column.
Increment A fixed amount by which an identity field increases
with each row.
Identity An integer column in a database table whose fields
contain automatically generated numbers starting
with a seed and increasing by an increment with each
row. Typically used as a surrogate key.
Key A column or set of columns in a database table
whose values in a row uniquely identify the row.
Natural Key A key for a table that has a natural business relevant
meaning.
Null A special “value” used in a relational database to
represent the absence of a true value. A null is not
the same as a zero or spacing or even an empty text
string.
OLAP Online Analytical Processing – doing statistical
analysis and reporting using a database system
OLTP Online Transaction Processing – storing and
maintaining data on the fly using a database system
Primary Key A key for a table that has been designated as the one
that will be actively used for uniquely identifying rows
in the table.
Record A list of values recorded for a particular item of
interest. In a database table each row represents a
single record.
Row A set of values represented as a horizontal list. In a
database table, each row represents a single record
of information. The row has a field for each column of
the table.
Seed The starting value of an identity field.
Surrogate Key An artificially added key to identify rows of data
which do not have any natural key.
Table A named set of data values in a database arranged
within columns and rows. A table has a fixed set of
columns and the fields within a particular column are
all of the same type.
Type (Data A specified type of data that can be stored within a
Type) spreadsheet cell or database field, e.g. int (integers),
varchar (text strings), money (monetary amounts).
In a database table all the fields in a column have the
same type.

Typical Data Types That We Will Use

Data Type Meaning Usage


varchar variable length Strings of text such as
character data names, descriptions,
keywords. When making a
varchar column in a table one
must specify the maximum
number of characters.
int integer Counts (e.g. impressions,
clicks), ranking (e.g.
position), identities (unique
numbering of rows of data)
money money amount Monetary values correctly
rounded (e.g. average cost
per click, total cost)
decimal fixed precision decimal Fractional amounts where
calculations must be correctly
rounded to a fixed number of
decimals, typically ratios and
percentages. When making a
decimal column in a table one
must specify the total number
of digits and the number of
decimal places (e.g.
conversion rate)
datetime Date and time Dates of days, times of day or
specific time on a date.

First Assignment

Create database tables to store the data that you currently only store
in spreadsheets. This will include:

1. Defining the columns – the header rows you currently use in


spreadsheets are the things to look at.
2. Assigning types to columns – look at what type of data is in the
column, is it a name? a whole number? a monetary amount? a
number involving fractions, a date or time?
3. Importing the data from the spreadsheets into the tables. Don’t
try type it – get a techie to show you how. (An easy way is to
use Access as a gateway between SQL Server and Excel.)
4. Define a primary key for the table. You may have to use a
surrogate key.
Session 2 Summary

1. Normalization

What Normalization Is About

 Normalization refers to the process of removing redundancy


and repetition from data. The reverse process is called
denormalization.

 Normalization minimizes the possibility of inconsistencies in


data. Thus reference data is typically stored in a normalized
structure.

 Normalization minimizes space taken up by data.

 Normalization increases the speed of performing inserts (adding


new rows), updates (changing data in specified rows) and
deletes (removing rows), while maintaining data consistency.
Thus databases intended for maintaining data on the fly (OLTP)
are normalized.

 But, normalization increases the time it takes to query related


data that is spread over several tables. Thus databases intended
for statistical analysis and reporting (OLAP) should have limited
normalization.

So for our reference data around campaigns we will look at


normalizing data, but for the data mart of search data reports that we
will be using for statistical analysis we will be denormalizing data.

Normal Forms

The level of normalization is described by the standard normal forms


that a dataset can be in.
1. First Normal Form (1NF): The data is arranged in tables with
well-defined columns and there is a primary key specified for
uniquely identifying rows.

2. Second Normal Form (2NF): The data is in first normal form


and in addition, if the primary key is a compound key, every
non-key column has data that applies to the whole key not
merely part of the key.

3. Third Normal Form (3NF): The data is in second normal form


and in addition, for every column that is not part of the primary
key the values in the column are not uniquely determined by a
set of columns that are not part of the primary key.

a. A stronger version called Boyce-Codd Normal Form


(BCNF): The data is in third normal form and for every
column, its values are only uniquely determined by the
candidate keys for the table, never by a set of columns
that is not a candidate key.

A mnemonic for the above normal forms is “The key, the whole key,
and nothing but the key, so help me Codd.”

4. Fourth Normal Form (4NF): The data is in Boyce-Codd normal


form and in addition tables do not embody more than one
independent many-to-many relationship.

5. Fifth Normal Form (5NF): The data is in fourth normal form


and in addition tables cannot be split into narrower tables by
applying known constraints on the data.

6. Domain Key Normal Form (DKNF): The data is in fifth normal


form and in addition there are no constraints on the data other
than specification of primary keys and allowed values for
columns.

OLTP databases typically store data in fourth normal form or higher,


but in an OLAP database we work only with tables in second and third
normal form. The transition from third to fourth normal form typically
results in a big performance penalty when querying data across related
tables.
Examples:

Example 1a:

Client Keyword
Earthlink internet DSL broadband
Alamo car travel

Not in 1NF because not tabular – contains repeating groups i.e.


more than one field per column. This won’t even fit in a database
system like SQL Server.

Example 1b:

Client Keyword
Earthlink internet, DSL, broadband
Alamo car, travel

A bit better but really “cheating”, there is one field per column but it’s
simply stringing together values that should be in separate columns.

Example 1c:

Client Keyword
Earthlink internet
Earthlink DSL
Earthlink broadband
Alamo car
Alamo travel

We now have a proper table. Notice that Client and Keyword


combination uniquely identify the rows. We can designate this
compound key as our primary key and we will then be in 1NF.
Example 2a:

Client Keyword Client Contact


Earthlink internet John
Earthlink DSL John
Earthlink broadband John
Alamo car Jane
Alamo travel Jane

Here Client and Keyword together form a compound primary key. This
table is in 1NF but it is not in 2NF. Client contact is information about
Client not about the combination of Client and Keyword.

Example 2b:

Client Client Contact


Earthlink John
Alamo Jane

Client Keyword
Earthlink internet
Earthlink DSL
Earthlink broadband
Alamo car
Alamo travel

We have removed redundancy by having two tables. The data is now


in 2NF.

Example 3a:

Client Campaign Manager Email


Earthlink Wayne wayne@acceleration.biz
Alamo Wayne wayne@acceleration.biz

Here the data is clearly 2NF as there is a single column primary key.
But email provides info about Campaign Manager not about the
primary key which is Client, so the data is not in 3NF.
Example 3b:

Client Campaign Manager


Earthlink Wayne
Alamo Wayne

Campaign Manager Email


Wayne wayne@acceleration.biz

Again splitting into two tables removes redundancy. The data is now in
3NF.

Example 4a:

Keyword Client Ad ID
Fast Earthlink 1
Fast Alamo 2
Travel Alamo 3
Internet Earthlink 1

Here the data is in 3NF. But the Client can be uniquely determined
given only the Ad ID. Yet the Ad ID is not a candidate key, so we do
not have BCNF. (A situation that is very rare.)

Example 4b:

Keyword Ad ID
Fast 1
Fast 2
Travel 3
Internet 1

Ad ID Client
2 Alamo
3 Alamo
1 Earthlink

Once again splitting into tables removes redundancy. The data is now
in BCNF.
Example 5a:

Manager Language Skill


Wayne English Marketing
Wayne English Client Management
Duncan English Marketing
Duncan Afrikaans Marketing
Duncan English Client Management
Duncan Afrkaans Client Management

Here as in the above example we have data where all columns are
needed for a key. The data is thus clearly in BCNF. But we have two
independent many-to-many relationships within the table – a
relationship between Managers and Language and a relationship
between Managers and Skills. Thus the data is not in 4NF.

Example 5b:

Manager Language
Wayne English
Duncan English
Duncan Afrikaans

Manager Skill
Wayne Marketing
Wayne Client Management
Duncan Marketing
Duncan Client Management

Yet again we remove redundancy by splitting into two tables. The data
is now in 4NF.
Example 6a:

Client Search Engine Keyword


Earthlink Google Fast
Earthlink Google Cheap
Alamo Google Fast
Alamo Google Cheap
Earthlink Yahoo Fast

Here the data is in 4NF. Unlike in the previous example we cannot split
the table into two without losing information.

Client Search Engine


Earthlink Google
Alamo Google
Earthlink Yahoo

Client Keyword
Earthlink Fast
Earthlink Cheap
Alamo Fast
Alamo Cheap

Information lost!! We do not have enough information to recreate the


earlier three column table. Although we can make a three column table
from the two tables like in Example 5b, that table wouldn’t provide the
same relationship between the concepts that our original three column
table provides – it’s a different relationship with more rows!

But suppose for the sake of example, there is a business rule in place
that if a search engine supports a keyword and a certain client is
advertised on a search engine and the client is described by that
keyword, then the client will be advertised on that engine with that
keyword. In that case the original table embodies two semantically
related many-to-many relationships and is thus not in 5NF.
Example 6b:

Client Search Engine


Earthlink Google
Alamo Google
Earthlink Yahoo

Client Keyword
Earthlink Fast
Earthlink Cheap
Alamo Fast
Alamo Cheap

Search Engine Keyword


Google Fast
Google Cheap
Yahoo Fast

Assuming the rule in 6a applies, we have now been able to split the
data into three tables without losing information. The data is in 5NF.

Example 6c:

The data in 6b is in 5NF. But suppose for the sake of example, there is
a rule that Earthlink gets all keywords on all search engines; then the
data would not be in DKNF as there is now a constraint on the data
that is causing redundancy. This redundancy can be removed by
removing rows that can be constructed from the constraint:

Client Search Engine


Alamo Google

Client Keyword
Alamo Fast
Alamo Cheap

Search Engine Keyword


Google Fast
Google Cheap
Yahoo Fast

Given the above rule, the data is now in DKNF.


Why Do We Need This? What Do I Need To Remember?

We need to understand that different database schemas have


different levels of redundancy. Redundancy can be decreased by
splitting into smaller tables or removing superfluous information.

Decreasing redundancy is good for building OLTP databases that store


reference data and on the fly events but for OLAP we would prefer to
increase redundancy to speed up querying of related data.

You do not need to remember the definitions of the normal forms, you
merely have to apply common sense when looking at how the data is
stored.

2. Designing Our Database

Steps in Designing a Database – Where Are We Going?

We are now ready to begin the design process. The process consists of
the following steps:

1. Determine a Logical Data Model.


2. Use the logical model to create a Physical OLTP database schema.
3. From the OLTP schema, determine an OLAP schema.

The Logical Data Model – Entities and Relationships

A logical data model is not a database on a computer but a schematic


defining what entities we are interested in and what relationships
exist between these entities.

Entities are simply concepts that we are interested in recording data


about. Relationships explain how two entities are related to each
other.

Entities have attributes which are data values describing instances of


an entity. A relationship may also have attributes directly associated
with it. Attributes correspond to columns in a physical database.
cd Schema1
                                                     
Group Keyw ord

                          
GroupID:
contai ns
                          
Nam e:
m any

                                                     
0..* 1..*

                                                     

                                                     

                                                     
Group Keyw ord

                                                     
DartSearchKeywordID : varchar:

                                                     

                                                     

 Entities are represented by rectangles called logical tables.

 Each logical table contains a compartment at the top with the


name of the entity.

 Attributes are listed in a compartment below the name.

 Each attribute has a name.

 Attributes can also be given data types. The type of an attribute


is indicated after the name, separated from the name by a colon
e.g.

DartSearchKeywordID : varchar

 Relationships between two entities are indicated by a solid line


joining the entities.

 Each end of the line is labeled with cardinality information:


1 One instance of the Entity
1..* One or more instances
0..* Zero or more instance
0..1 Zero or one instance
n Exactly n instances, where
n represents some fixed
number e.g. 4
0…n Zero to exactly n
instances, where n
represents some fixed
number e.g. 0..4
1..n One to exactly n
instances, where n
represents some fixed
number, e.g. 1..6

 A relationship line is usually labeled with a clause describing the


relationship, with an solid triangular arrowhead showing the
direction the description applies, e.g. in the diagram above the
relationship description would be read as “zero or more Groups
contain one or more Keywords”.

 If a relationship has attributes it is also represented by a logical


table called a link table.

 A link table is shown connected to the relationship it represents


by a dotted line.

 A link table is typically named by combing the names of the two


related entities, in the example above “Group Keyword”.

Glossary:

Term Definition

Attribute A property of an entity or relationship. Corresponds


to a column in a physical database.
Cardinality A description of how many entity instances are
involved in a relationship.
Data mart An OLAP database containing historical data for
predetermined decision support goal.
Delete Remove a record from a database.
Denormalization Combining tables and repeating information to avoid
cross referencing and lookups when querying data.
Denormalization increases redundancy but improves
the speed we can query and process related data.
Entity Any concept that we are interested in recording data
about (e.g. Client, Keyword, Campaign).
Insert Add a record to a database.
Link Table A table that describes a relationship. (Used both for
logical tables as well as tables in a physical
database.)
Logical Data A specification of what data we are interested in and
Model relationships within the data regardless of how we
choose to represent that data in a physical database
system such as SQL Server.
Logical Table A named set of attributes representing an entity or
relationship in a Logical Data Model.
Normal Form One of the standard sets of criteria describing the
level of normalization of a dataset.
Normalization Removing redundancy from a dataset by splitting
tables and considering constraints. Normalization
decrease the space data takes up, helps enforce
consistency and increases the speed with which
related data can be changed on the fly.
Physical A database implemented on a computer using
Database database software such as SQL Server.
Query Extract selected data from a database.
Relationship A specification of how two entities are related to each
other.
Repeating A set of values for a single attribute/column
Group contained in one record. Such data is not even in first
normal form and cannot be stored as such in a
relational database.
Schema The definitions of table structures and relationships
between tables in a relational database.
Update Change the data in a record in a database.
Session 3 Summary

1. Refining the Logical Data Model

Recall that our aim is to have our data stored in a database optimized
for statistical analysis. To get to this OLAP database we first need to
understand the logical model of our data. The logical model will be
used to design a normalized database which will then be transformed
into our final database design.

So far, we have looked at analyzing the data in terms of entities and


relationships to begin the logical model. Before we can turn this logical
model into a database design, we need to refine it further.

Navigability

Some relationships between entities involve an inherent notion of


navigability i.e. one or both entities in the relationship “knows” about
the other entity. For example, a Client inherently knows what Products
it advertises on the internet, but the concept of Product does not
typically have any intrinsic knowledge of which clients provide that
Product.

Navigability of a relationship is depicted by an open arrowhead on the


relationship line pointing from the entity that has the knowledge to the
entity that is known. If both entities in the relationship know each
other, then both ends of the line have an arrowhead and the
relationship is called bi-directional.

The navigability arrowheads need not point in the same direction as


the solid arrowhead indicating the direction in which a relationship’s
name is read.
class Data Model

Client Product
markets

1..* 1..*

liases via

1..*

Contact Person

Navigability needs to be considered as it has an impact on the design


of the physical database.

Arbitrary Data Types

While physical databases like SQL Server are limited to a few standard
data types, a logical data model can have arbitrary data types for
attributes – in fact any entity type can be treated as a data type for an
attribute.

If a relationship is navigable from one entity to another it is considered


equivalent to the first entity having an attribute whose type is the
second entity or a collection of the second entity (depending on
cardinality).

Considering arbitrary data types helps you uncover relationships


between entities without being constrained to thinking in terms of
what your physical database (like SQL Server) can provide.
class Data Model

Client Product

Client ID
Products : Product Set

markets

1..* 1..*

Keyed Relationships

Sometimes a logical table representing an entity can have an attribute


or set of attributes which forms a key value for a logical table
representing another entity. This implies a relationship to the other
entity with instances of the related entity being identified by these
foreign key values. This is another way in which one entity can
“know” about another. Such keyed relationships are indicated by
showing the key attributes in a box on the relationship line.

class Data Model

Client
Ad
Client ID
Ad ID
Client ID advertises via

Client ID

1..* 1

Aggregation and Composition Relationships

Sometimes an entity represents a collection of instances of another


entity. (E.g. a Keyword List is a collection of Keywords.) In this case
there is an implicit to-many relationship called an aggregation
relationship. This can be shown in the logical data model by drawing
the relationship with a diamond on the end of the line at the logical
table representing the collection.

A special type of aggregation relationship called composition occurs


when the collection is structurally composed of its members, for
example, the keyword phrase “fast internet DSL” is structurally
composed of the single words “fast”, “internet” and “DSL” and is not
merely a collection of these words. Composition can be indicated by a
solid diamond.

class Data Model

Keyw ord
Keyw ord List
is a list of

1..* 1..*

1..*

is composed of

1..*

Word

Inheritance Relationships

Sometimes one entity is a specialized form of another more general


entity, e.g. an Online Sale is a specialized form of Conversion. This is
known as an inheritance relationship and is indicated in a logical data
model by an arrow from the logical table representing the specialized
entity to the table representing the general entity with a closed empty
arrow head.
class Data Model

Conv ersion Online Sale

If an entity has several specializations, one often shows the arrows


combined into a tree with a common arrowhead.

class Data Model

Conv ersion

Online Sale Registration Reserv ation

2. Transforming a Logical Model to a Physical Design

Once we have completed the analysis of our entities and relationships,


we can produce a physical database design – a design of a database
that can be implemented in a system like SQL Server. This achieved in
the following manner:

1. Each logical table in the logical model becomes a table in the


physical model. In particular every entity will have a physical table
and every relationship that had a link table will have a physical
table.
2. Wherever there is a many-to-many relationship, make a link table
in the physical database representing the relationship even if there
was no link table in the logical model. The many-to-many
relationship will be treated as two many-to-1 relationships from the
Link table to the related entities.

3. Attributes with ordinary data types become columns in the tables.

4. Whenever there is a navigable to-1 relationship in the logical model


from an entity A to an entity B, this is turned into a keyed
relationship in the physical design with A having a column (or
columns) which form a foreign key referencing the primary key of
B. This is done even if we didn’t represent the relationship as a
keyed relationship in the logical model.

5. When we have a navigable to-0,1 relationship we implement it as


above and allow the foreign key to have a null value to represent
the zero case.

6. Link tables are given foreign keys for the two entities which they
relate and typically this forms a compound key for the link table.

7. When many-to-many relationships allow zero entities to be related


to an entity A, this is typically implemented by having an absence
of any entries in the link table with a foreign key for A. If however
we want to have at least one entry for each entity involved we can
use null values as above.

8. Attributes with arbitrary data types that are actually entity types
become instead relationships to those entities in the physical model
and implemented using foreign keys and link tables as above.

9. Attributes with data types that are actually collections of values


become to-many relationships in the physical design and
implemented with foreign keys and or link tables as above.

10. Whenever we use a foreign key we may consider telling the


database system about this by including a foreign key constraint
i.e. we tell the database to ensure that foreign key values really do
match existing primary key values. This is a means of ensuring
data integrity.

11. Aggregation and composition relationships are treated the same


as ordinary relationships.
12. If an entity A is a specialization of entity B we treat this as a 0,1-
to-1 relationship navigable from A to B. Each instance of entity A
will be represented by a row in the table for A containing the
specialized A-specific data and a row in the table for B containing
the general data.

13. We ensure that each table in the physical design has been
assigned a primary key using surrogate keys if need be.

14. Finally we define indexes on tables to improve the speed of


search for data. These are discussed below.

3. Indexes on Tables

An index is a structure associated with a table in a database that helps


the database look up rows quickly without the need to scan through all
the rows in the table. Think of it as being similar to an index at the
back of a book. If you want to look up a topic in a book, it’s much
easier to look up the topic in the index to find the relevant pages than
to try scan through the whole book!

There are two main types of index in SQL Server:

Clustered indexes – these look up rows by using and physically


sorting on primary key values of tables.

Non-clustered indexes – these look up rows using the physical


locations of the rows instead of primary keys.

Another type of index that is sometimes used is a full text search


index that optimizes searching for words or phrases within text
(varchar, nvarchar) columns.

While indexing provides optimization for selecting rows from tables


they do not address the complexity of combining rows from related
tables – for this we need to look at transforming the database from a
normalized structure to a denormalized OLAP structure – more next
session.
Homework

 Complete the logical data model for your search marketing data.
 For reference data around search marketing – produce a physical
database design in SQL Server. Don’t worry too much about
statistical data – that will go instead into our final OLAP database
which we will look at next session.

Glossary:

Term Definition

Aggregation A relationship between an entity representing a


Relationship collection and the entity making up the members of
the collection.
Bi-drectional A relationship in which each entity inherently knows
about the other.
Clustered Index An index based on look ups and physical sorting on
primary key values.
Composition A special case of an aggregation relationship where
Relationship the collection is in fact a structural composition of its
members.
Data Integrity Ensuring that data in a database is logically
consistent in particular that foreign keys do not
reference non-existent primary key values.
Foreign Key Attributes of an entity which form a key or set of
keys for another entity resulting in a keyed
relationship.
Foreign Key A rule set up in a database system telling it to ensure
Constraint that foreign key values correspond to real primary
key values to ensure data consistency.
Full Text Search An index that facilitates the look up of text strings in
Index text fields in the database.
General If an entity is a specialization of another the latter is
said to be more general – it has less distinguishing
attributes.
Index A structure facilitating look ups of rows in a table.
Inheritance The relationship between a specialized entity and the
Relationship general entity of which it is a specialization.
Keyed A relationship resulting from one entity having
Relationship attributes which form a key or set of keys for another
entity – a foreign key.
Navigability The directionality of a relationship indicating whether
one entity in the relationship inherently knows about
the other. Indicated by arrowheads on a relationship
line in a logical data model.
Non-clustered An index based on using look ups by physical location
Index instead of primary key.
Specialized An entity is a specialization of another if it has all the
attributes of the latter as well as well as additional
attributes e.g. an online sale is a specialized
conversion.
Session 4 Summary

Where We Are:

We are on the road to getting our data into a database optimized for
statistical analysis. So far, we have looked at

1. The basics of storing data in a database.


2. The concepts of normalization and denormalization – reference
data should be normalized; statistical data should be
denormalized
3. The process of analyzing the data and producing an initial
normalized database design.

The normalized structure is perfect for our reference data. For


statistical data we will need to optimize the structure by
denormalizing. But before we can do this we need to understand, two
things:

a) How we query data in a relational database – you can’t do stats


if you don’t know how to get to the data!

b) How we analyze data in Statistics – in particular how we typically


query data in order to perform statistical investigations.

1. Querying Data

The process of getting data out of tables (to view it, to do calculations
on it or to produce a report or a graph) consists of the following
processes:

1. Combining rows from several tables to form new rows.


2. Choosing the combined rows based on conditions.
3. Choosing the columns we are interested in and possibly making
new derived columns.
The process of combining rows from several tables is referred to as
joining and the process of choosing rows and columns is called
selecting.

Selecting

Selecting rows is fairly straightforward - we filter by applying


conditions to the fields in the row e.g. what individual values or what
range of values they can have. More complex conditions are also
possible. In addition, we can take unions of several selections, i.e. we
simply combine sets of rows from several selections into a single set.
(We can only do this if the rows all have the same columns.)

Selecting columns typically amounts to nothing more than picking


which columns we want. We can also make new columns based on
calculations done on the values of other columns or sets of rows.

Joining

Joining rows from several tables is more complex and involves


concepts from the branch of mathematics known as the theory of
relations. (Analysis of data in terms of entities and relationships also
drew heavily on concepts from this branch of mathematics!) In
particular we use the concept of a join operation from the sub-
branch of the theory of relations known as relational algebra.

We join two sets of rows at a time, starting with two tables and joining
the resulting join with those of the next table.

Suppose there are two sets of rows A and B. There are several join
operations that can be applied to them:

 Cross Join – This is the simplest join operation. It produces a


new set of rows by combining every row in A with every row in
B. (This is called the Cartesian Product of A and B.) We
typically do this when wanting to form sets of all possible
combinations of things e.g. we can use it to generate all
keywords phrases made up of words drawn from given sets.

 Inner Join – This operation is performed with respect to a set of


columns chosen from A and matching columns chosen from B. It
produces a new set of rows made up of every of row of A
combined with every row of B that matches on the chosen
columns. If a row of A does not have any matching rows in B,
there will be no row in the inner join derived from A. (Notice that
the inner join is in fact a symmetric operation – the inner join
of A with B will be the same as the inner join of B with A if we
ignore the order of columns.)

 Left Outer Join – This is also performed with respect to


matching on sets of chosen columns. It produces a new set of
rows made up of every of row of A combined with every row of B
that matches on the chosen columns as well as every row of A
that has no matching row in B combined with null fields for the
columns that would have come from B had there been a
matching row. Thus, there will always be at least one row in the
left outer join for each row of A. This operation is not symmetric;
the left outer join of A with B is not necessarily the same as the
left outer join of B with A even if we ignore the order of columns.
If there are rows in either A or B that don’t have matches the
joins will be different.

 Right Outer Join – This is the mirror image operation of the left
outer join. It produces a new set of rows made up of every row
of B combined with every row of A that matches on the chosen
columns as well as every row of B that has no matching row in A
combined with null fields for the columns that would have come
from A had there been a matching row. Thus, ignoring the
ordering of columns, the right outer join of A with B is the same
as the left outer join of B with A.

 Full Outer Join – This is once again performed with respect to


matching on sets of chosen columns. It produces a new set of
rows made up of every row of A combined with every row of B
that matches on the chosen columns, as well as every row of A
that has no matching row in B combined with null fields for the
columns that would have come from B had there been a
matching row, and also for every row of B that has no matching
row in A combined with null fields for the columns that would
have come from A had there been a matching row. Thus, there
will always be at least one row in the full outer join for each row
of A and for each row in B. Like the inner join this operation is
symmetric.
Joins and Normalization
Joins are typically performed with matching on key values. Thus, when
working with a normalized OLTP database structure, joins typically
become very long and complicated – we typically get long chains: A
joined to B, B joined to C, C joined to D, D joined to E etc. which
impacts on performance. The more normalized the data the more
complex the required join and the longer the join chains. The
transition from 3rd to 4th normal form typically has the biggest penalty
on the complexity of joins. The whole purpose of using a denormalized
OLAP database for statistical analysis instead of an OLTP database is to
minimize and simplify the joining that is required and eliminate long
join chains.

What do I need to remember?


It’s not important to remember the jargon but it is important to
understand the principles involved. Incorrect joins are an area that
causes many silly mistakes when working with data, one must be
aware of the need to sometimes have outer joins that fill in null
values instead of inner joins that ignore rows that do not match. One
must be aware that null values introduced by outer joins cannot be
using in calculations. Typically, when there are nulls one uses a
derived column that replaces the null with something more useful.

2. Applying a Statistical Point of View to Data

We have seen that to arrive at our final database design we need to


denormalize it in order to simply and minimize joining. But the manner
in which we denormalize it depends on what sort of queries we will be
doing.

Up till now we have been looking at general Data Analysis concepts


that are not specifically related to Statistics. Since we are going to be
doing statistical analysis on our data, we need to now begin applying
statistical concepts to it in order to understand the sort of queries we
will be doing so that we can get to our final structure and begin using
the data.
Variables and Constants

While general data analysts are interested in how data is divided up


into entities and relationships, a statistician is interested in
summarizing data and examining cumulative effects and general
trends.

The first thing a statistician asks is, “what data varies and what data
remains constant?” The two key concepts in this regard are:

Variables – attributes of entities that vary.

Constants – attributes that have a fixed unchanging value.

Most attributes are in fact variables e.g. Impressions, Clicks, Campaign


Name, all these have values that vary, different impressions and clicks
for different keywords on different days, different names for different
campaigns. It is the variables that are of most interest.

The constants are less interesting: the number of days in a week


(always 7), cents in a rand (always 100). Typically, we should not
store these in our physical database whether in normalized or
denormalized form – they enter into calculations but do not need to be
looked up from the database.

One way of looking at the distinction between variables and constants


is to say they are opposites. But if “variable” is thought of as being
synonymous with “attribute”, a constant can be thought of as a special
case of a variable – a variable that retains a fixed value.

Statisticians classify variables in several different ways which we will


examine in the following sections:

Functional Dependencies Between Variables

When the value of one variable y is uniquely determined by the value


of a set of variables x1 ,..., x n we have a functional dependency of y
on x1 ,..., x n .

x1 ,..., x n are called independent variables and y is called the


dependent variable.
Note that this does not necessarily mean that there is a simple formula
that allows us to calculate y from x1 ,..., x n !

We have already seen this phenomenon with candidate keys and non-
key attributes – the non-key attributes have a functional dependency
on the candidate key attributes.

Dependency on Time and “Space”

One of the most important independent variables that statisticians


consider is time. One can classify other variables according to whether
they have a functional dependency on time or not. We thus speak of a
variable being time-dependent or time-independent.

Another important independent variable is the identity of an entity


indicated by its primary key value. The primary key of an entity varies
across the space of instances of the entity. A variable that is
independent of the primary key of an entity (in other words it has the
same value for all instances of the entity) is called static. A variable
that varies with the primary key is called non-static.

We can classify variables according the various combinations of


variation over time and “space”.

Time- Time-
dependent independent

Non-
Static Constant
Characteristic

Non-static Characteristic Identifying


 Characteristic variables are non-static and time dependent e.g.
number of impressions for various keywords.

 Non-Characteristic variables are static and time dependent


e.g. rand exchange rate for various online sales.

 Identifying variables are non-static and time independent e.g.


the keyword id for a keyword. Typically, primary key attributes
are identifying.

 Constants which we discussed above can be thought of as


“variables” that are static and time independent.

Qualitative vs Quantitative Data

We can also classify variables based on whether they represent


qualitative as opposed to quantitative data.

 Qualitative variables describe qualities.

 Quantitative variables measure quantities.

Qualitative variables can be further classified:

 Nominal variables have values which are only “names” having


no numeric content, e.g. keyword name

 Ranked variables have numeric values which indicate only order


but not quantity, e.g. position

Quantitative variables can be further classified:

 Discrete variables take integer values.

 Continuous variables take arbitrary real values.


There are other ways of classifying variables in terms of qualitative
and quantitative information: We can group nominal, ranked and
discrete variables as distinct valued, meaning that they all take
values from a countable set as opposed to the continuous variables
that take arbitrary real values. We can also group ranked, discrete and
continuous variables as numeric as they have values which are
numbers as opposed to the nominal variables whose values are simply
names.

Scale of Measurement – Comparative Data

Another way in which statisticians classify variables is in terms of their


scale of measurement which a measure of how much comparative
information a variable has.

Categorical scale - the variable has merely a value but not


necessarily order, distance or relative size. These are exactly the same
as the nominal variables and this scale is also called the nominal scale.

Ordinal scale – the variable has order but not necessarily distance or
relative size. These are precisely the ranked variables and this scale is
also called the ranked scale.

Interval scale – the variable has order and a metric function that
determines distances between values but does not necessarily have a
meaningful relative size. Dates and times are typical examples of
interval scale variables.

Ratio scale – the variable has order, a metric for determining


distances between values, and a norm function for determining
relative sizes of values, e.g. media cost, click through rate.

Method of Determination

Yet another way that statisticians classify variables is according the


method of determining the value of the variable:

Named variables are simply assigned a name. These are the same as
the nominal / categorical scale variables.
Counted variables are assigned a value by counting occurrences, e.g.
clicks.

Measured variables are assigned a value by some measurement or


calculation process more complex than simply naming or counting,
e.g. media costs.

3. Summarization of Data

Statistics is primarily concerned with the summarization of many


values of entity attributes

Population and Samples

The total set of relevant entities whose data we wish to summarize is


referred to as the population. Typically, it is impractical or impossible
to work with the whole population. We instead work with a sample of
the population.

Ideally, we want to use a representative sample i.e. one from which


we can extrapolate the summarized information for the entire
population. When the extrapolated summarized information deviates
from the true values for the population we say that the sample is
biased.

Cumulative Data

One method of summarizing data is to look at cumulative values where


applicable.

For nominal data there is no way to obtain a single cumulative value.


Instead we look at the frequencies of the various possible values.

For ranked data we typically look at the extreme values of the data –
the maximum and the minimum values. Frequencies are also
typically used for ranked variables.

For quantitative data the typical means of deriving a cumulative or


aggregate value is to add up values. However, one cannot simply add
just any quantitative values, the addition has to make sense.
Variables which accumulate via addition of values are called additive,
an example is media costs to date.

Duties on online sales are typically not additive. Typically, these are
calculated by multiplying by factors. If we have several such factors
their cumulative amount is calculated by taking their product not their
sum. Such variables are called multiplicative.

Central Tendencies of Data

Another very important method of summarizing data in statistics is to


look at the central tendency of the data i.e. the single value that in
some manner represents the middle value of the data.

For nominal variables we use the mode which is the most frequent
value.

For ranked variables the central tendency is given by the median


which is a value that has the property that not more than half the
values lie below it and not more than half lie above it. Thus, for a
sample with an odd number of values written in order, the median is
the value in the middle of the list. For an even number of values, we
can take any value between the middle two values, typically we use
the arithmetic mean of these values (see below).

For quantitative variables we typically use the arithmetic mean


(sometimes simply called the mean or average) as the central
tendency. The arithmetic mean is a single number which when added
to itself as many times as the number of values in the sample,
produces the same result as adding up all the values. To obtain the
arithmetic mean we thus add all the values and divide by the number
of values:

If the quantitative variable is denoted by x its arithmetic mean is


denoted by x . (The symbol ∑ is a capital Greek letter sigma and
denotes a sum of values.)
The arithmetic mean is meaningful for variables which must be totaled
in calculations, in particular for additive variables. It has no meaning
for ranked variables like position!

The arithmetic mean is one of the most important concepts in statistics


and is used to derive many other useful summarizations.

For variables which are multiplied in calculations instead of being


added (such as duty factors applied to an online sale) the relevant
central tendency is not the arithmetic mean but the geometric
mean. This is a single value which when multiplied by itself as many
times as the number of values in the sample, produces the same result
as multiplying all the values together. It is thus obtained by
multiplying all the numbers together and then taking the n th root
where n is the number of values.

geometric mean of a =

(The symbol ∏ is a capital Greek letter pi and denotes a product of


values.)

Glossary:

Term Definition

Additive A variable is additive if its cumulative value is


obtained by adding.
Arithmetic Mean A single number which when added to itself as many
/ Average times as the number of values in the sample,
produces the same result as adding up all the values.
Biased A sample is biased if the summarized information
extrapolated from it differs from the true values for
the population.
Categorical Having merely values not necessarily order, distance
Scale or relative size – identical to nominal.
Cartesian The set of rows produced by a cross join, consisting
Product of every possible combination of rows.
Central A function that provides a notion of a middle value for
Tendency
a set of values.
Characteristic Non-static and time dependent.
Constant A value that never changes.
Continuous Taking arbitrary real values.
Countable A set is countable if its members can be listed in a
sequence whether finite or infinite.
Counted A variable whose value is assigned by counting.
Variable
Cross Join The join operation that combines every row from one
set with every row from another to produce a
Cartesian Product.
Dependent A variable whose value that is uniquely determined
Variable by other variables in a functional dependency.
Discrete Taking integer values.
Distinct Taking values from a countable set – nominal, ranked
or discrete.
Extreme Value A maximum or a minimum value.
Frequency How often a value occurs.
Full Outer Join The join operation that combines rows from two sets
of rows where they match on a specified set of
columns, filling in nulls for the columns of one set
when there is no match for a row in the other set.
Functional When the value of one variable is uniquely
Dependency determined by the values of several others.
Geometric Mean A single value which when multiplied by itself as
many times as the number of values in the sample,
produces the same result as multiplying all the values
together
Independent A variable in a set of variables on which another
Variable variable is functionally dependent.
Identifying Non-static and time independent.
Inner Join The join operation that combines rows from two sets
of rows where they match on a specified set of
columns.
Interval Scale Having order and a metric determining distance
between values but not necessarily a relative size.
Join The set of rows that result from joining rows from
several rows.
Join Operation An operation performed on sets of rows to produce a
join.
Joining Combining rows from several tables or sets of rows.
Left Outer Join The join operation that combines rows from two sets
of rows where they match on a specified set of
columns, filling in nulls for the columns of the second
set when there is no match for a row in the first set.
Maximium The greatest value of a ranked variable.
Mean Typically means the arithmetic mean. In general, a
value which when a cumulative calculation is applied
to it as many times as there are values in a sample,
produces the same result as the cumulative
calculation being applied to the actual values in the
sample.
Measured A measured variable has a value that is measured by
Variables some means involving more than a simple naming or
counting.
Median A value for which not more than half the values for a
ranked variable lie below and not more than half lie
above.
Metric In mathematics a metric is a function that calculates
distances between values.
Minimum The lowest value of a ranked variable.
Mode The most frequent value of a variable.
Multiplicative A variable is multiplicative if its cumulative value is
obtained by multiplying.
Named Variable A variable whose values are merely assigned names
– the same as nominal / categorical scale.
Nominal Having values that are merely names.
Non- Static but time dependent.
Characteristic
Non-Static The value of a non-static variable varies over the
space of entity instances under consideration.
Norm In mathematics a norm is a function that calculates
the relative size of values.
Numeric Having a value that is a number – ranked, discrete or
continuous.
Ordinal Scale Having order but not necessarily distance or relative
size – precisely the ranked variables.
Outer Join A join operation that fills in null values when there
are no matching rows: Left Outer Join, Right Outer
Join, Full Outer Join.
Qualitative A qualitative variable describes qualities.
Quantitative A quantitative variable measures quantities.
Population The total set of relevant entities in a study.
Product The result obtained by multiplying values.
Ranked Having a numeric value indicating only order not
quantity.
Ratio Scale Having order, a metric and a norm for determining
relative sizes of values.
Relational The sub-branch of the Theory of Relations dealing
Algebra with operations that can be performed on relations.
Representative A sample whose summarized values can be used to
Sample extrapolate those of the entire population.
Right Outer The join operation that combines rows from two sets
Join of rows where they match on a specified set of
columns, filling in nulls for the columns of the first set
when there is no match for a row in the second set.
Sample A subset of a population.
Scale of The degree of comparative information in a variable.
Measurement
Selecting Choosing rows and columns when performing a
query.
Space In mathematics a set of instances of a certain
concept is often called a space.
Static A static variable has the same values for all instances
of an entity under consideration.
Sum The result obtained by adding values.
Symmetric An operation on two things that does not depend on
Operation their order.
Theory of The branch of mathematics dealing with relations
Relations (relationships) between entities.
Time Dependent The value of a time dependent variable varies with
time.
Time The value of a time independent variable does not
Independent vary with time.
Union The combination of several sets of rows (all with the
same columns) into a single set of rows.
Variable A value of an attribute that varies.
Session 5 Summary

1. Online Analytical Processing

We have reached the stage where we can explain the structure of an


OLAP database – a denormalized database optimized for statistical
analysis.

Facts and Dimensions

In the last session we looked at the various ways in which statisticians


classify variables. Combining the practical considerations of OLAP and
data warehousing with statistics we get another method of
classifying variables:

Facts – variables whose values are used as measures of performance,


cost or benefit etc. which we wish to analyze statistically. e.g.
impressions, conversion rate, media costs, position.

Dimensions – variables that are used to partition sets of fact values


for summarization, e.g. keyword name, campaign, site, day.

Facts are typically numeric variables (ranked, discrete or continuous),


in fact, facts are typically additive quantitative variables. (Facts are
also called measures which is perhaps a more intuitive name but they
should not be confused with measured variables, indeed counted
variables may also be facts.)

Dimensions are typically qualitative variables (nominal or ranked) and


they uniquely characterize entities called contextual entities or
contexts representing a way of partitioning the data. Facts will either
be attributes of contextual entities or attributes of entities related
either directly or via a sequence of relationships to a contextual entity.

We typically summarize fact values over a partition of the data


determined by a dimension value or set of values.
Dimensional Hierarchies

Contextual entities typically have 1-to-many relationships to other


contextual entities thus forming hierarchical structure called a
dimensional hierarchy. A contextual entity within a dimensional
hierarchy is referred to as a dimension level of the hierarchy.

Star Schemas

For OLAP, instead of creating a physical database with tables that


correspond directly to entities and relationships we instead make
tables that correspond to facts and dimensions

To get to our OLAP database, we start with an OLTP database design


derived directly from our logical data model of entities and
relationships. We carry out the following steps:

1. Determine which columns store variables that we want to use as


our facts. (Typically look for money, decimal or integer columns
which implement numeric variables.)

2. For each fact identified determine the dimensions by which wish


to partition the fact data. These will be columns either in the
same table or in other tables which can be combined with the
fact columns via a series of joins.

3. Determine the dimensional hierarchies.

4. For each dimensional hierarchy, create a dimension table


representing it, with an identity column as a surrogate key and
columns for the attributes of the dimension levels in the
hierarchy, in particular for the dimension values.

5. For each set of facts that will be partitioned by the same


dimensions, define a fact table with a compound primary key
made up of foreign keys referencing the primary keys of the
dimension tables for the dimensional hierarchies that will be
used to partition the fact values.

Our resulting database will not be in more than 3 rd normal form. When
querying data we will not have join chains consisting of more than two
tables – indeed we will only have join chains consisting of a dimension
table and a fact table.

The design of our resulting database looks like star shapes with fact
tables in the centre of the stars and dimension tables surrounding
them connected by foreign key relationships forming spokes. Such a
database design is referred to as a star schema.

class Data Model

DMKeyWordGroup

«column»
*PK KeyWordGroupID: int
* KeyWordGroupName: varchar(50)

«PK»
+ PK_DMKeyWordGroup(int)
+PK_DMKeyWordGroup 1

(SKKeyWordGroupID = KeyWordGroupID)
DMKeyWordName

«column»
DMKeyWordID
*PK KeyWordNameID: int
* KeyWordName: nvarchar(100)
«column»
*PK KeyWordIDID: int
+PK_DMKeyWordName
KeyWordID: varchar(10) «PK»
+PK_DMKeyWordID 1 + PK_DMKeyWordName(int)
+FK_FactPerformance_DMKeyWordGroup 0..*
«PK»
1
+ PK_DMKeyWordID(int) FactPerformance

«column»
(SKKeyWordID = KeyWordIDID)
* FactID: int (SKKeyWordNameID = KeyWordNameID)
* ID: int
*FK SKClientID: int
*FK SKCampaignID: int +FK_FactPerformance_DMKeyWordName
+FK_FactPerformance_DMKeyWordID *FK SKSearchEngineID: int
0..*
0..* *FK SKKeyWordID: int
*FK SKKeyWordNameID: int
*FK SKKeyWordGroupID: int
* Impressions: int
* Clicks: int
* MediaCosts: money
* AveragaPosi tion: decimal(18)
* LogDtm: smalldatetime +FK_FactPerformance_DMCampaign
+FK_FactPerformance_DMCl ient
0..*
0..* «FK»
+ FK_FactPerformance_DMCampaign(int)
+ FK_FactPerformance_DMClient(i nt)
+ FK_FactPerformance_DMKeyWordGroup(int) (SKCampaignID = CampaignNameID)
(SKCli entID = ClientID)
+ FK_FactPerformance_DMKeyWordID(int)
+ FK_FactPerformance_DMKeyWordName(int)
+ FK_FactPerformance_DMSearchEngine(int)
+FK_FactPerformance_DMSearchEngi ne 0..* +PK_DMCampaign
+PK_DMCli ent DMCampaign
DMClient 1
1 «column»
«col umn» *PK CampaignNameID: int
*PK ClientID: int CampaignName: varchar(50)
* ClientName: varchar(50)
ClientFriendlyName: varchar(50)
(SKSearchEngineID = SearchEngineID) «PK»
+ PK_DMCampai gn(int)
«PK»
+ PK_DMClient(int)

+PK_DMSearchEngi ne 1

DMSearchEngine

«column»
*PK SearchEngineID: int
* SearchEngineName: varchar(20)

«PK»
+ PK_DMSearchEngine(int)

Figure 1 Star Schema Database


When data is placed in the star schema database, fact tables are
typically long and thin while dimension tables are short and fat in
terms of numbers of rows and columns.

As dimensions uniquely characterize contextual entities it would be


nice if they were in fact time-independent and thus natural keys for
the contextual entities. However we face the reality that we might
need to change the dimension value on occasion e.g. a search engine
or campaign may change its name. Having identities as surrogate keys
for the dimension tables instead of using dimensions for the primary
key, makes this possible. It also helps keep the fact table size down
despite the denormalization of the data.

Snowflake Schemas

Sometimes contextual entities are arranged in more complex hierarchy


structures than a dimensional hierarchy with definite dimension levels
– we may instead have a situation where a contextual entity has a 1-
to-many or many-to-many relationship with itself. In such a case we
keep our dimension data normalized. Instead of putting all the values
from different dimension levels in the same dimension table we
instead have simple dimension tables each representing a single
contextual entity and we represent 1-to-many and many-to-many
relationships between contextual entities as with a normalized OLTP
database using foreign keys and link tables.

Even when we do not have hierarchies that are more complex than a
dimensional hierarchy, we may want to consider keeping contextual
entities normalized in order to help maintain the consistency of the
data if the dimensions are prone to change.

A database design consisting of fact tables with foreign key


relationships to normalized tables representing contextual entities is
called a snowflake schema as it resembles a snowflake with the
facts tables in the centers of the snowflakes and the relationships to
contextual entity tables forming the arms of the snowflake.

Granularity and Summarization

Now that we have our OLAP database design we can populate it with
statistical data.
Although our original OLTP database may contain facts for individual
items or events, in our OLAP database we store facts summarized for
the smallest level of partition possible – partitions determined by
single fact table compound key values made up of single dimension
table primary key values. This smallest partition which corresponds to
single rows in the fact table is referred to as the granularity of the
fact table.

In our case the information obtained from DART Search already has
the data summarized to the granularity that we will be working with
and so we do not need to do any summarization when loading the data
into the OLAP database.

When querying data for partitions that are coarser than the granularity
we summarize the data further by taking sums (for additive data),
products (for multiplicative data) or averages or other central
tendencies for data that does not accumulate. This is known as rolling
up the data. Typically reporting tools automatically roll up the data for
us.

2. Measuring Data Dispersion

We have already discussed cumulative amounts and central tendencies


which are the first statistical quantities that statisticians use to analyze
facts. The next step is to look at the dispersion of the fact values, i.e.
how much they are spread out over possible values.

For nominal data there is no simple way to measure dispersion, we


typically apply the concept of dispersion to numeric facts.

Range

The simplest measure of dispersion is the range. This is simply the


difference between the maximum and minimum values of the data.

range  max X  min X

where X is the set of values of our fact.

Typically we use the range for variables that are at least of the interval
scale of measure so that differences produce meaningful values. (We
can use the range for ranked variables but this is usually not useful as
for a full population the minimum value is 1 and the maximum is
simply the number of values and so the range is just 1 less than the
number of values which isn’t very useful. If we are looking instead at a
sample of a ranked population then things are a bit better – the range
would give the number of rank positions between the minimum and
maximum rank positions of members of the sample.)

Deviations and Variance

Deviations

The range is based only on the two extreme values of the data and so
it says nothing about how the rest of the data is distributed. For this
we introduce the concept of a deviation. A deviation of a value x i of a
variable x is how much it differs from the mean value x :

deviation  x i  x

(Some books take the difference the other way around i.e. x  x i , it
doesn’t matter as long as you pick one way and stick to it
consistently.)

Absolute Deviations and Mean Absolute Deviation

Whereas the range is a single number, we have a deviation for each


value of the variable. Ideally we would like to summarize these into a
single number.

At first it might seem a good idea to take the mean of all the
deviations, but, a simple proof shows that this will always come out as
zero!! One way of looking at it is that some deviations are negative
and some are positive and when averaged out the negatives cancel the
positives. To avoid this we consider absolute deviations which are
simply the absolute values of the deviations i.e. the sizes of the
deviations regardless of whether they are negative or positive obtained
by making negative values positive and leaving positive values alone:

absolute deviation  xi  x
(The symbol denotes the absolute value.)

We can take the mean value of these absolute deviations to obtain a


single number providing a measure of dispersion – the mean
absolute deviation.

1 n
mean absolute deviation   xi  x
n i 1

Square Deviations and Variance

Although the mean absolute deviation does indeed provide a single


number measuring dispersion and is based on all the data, it is not
very commonly used. The reason for this is that it is difficult to analyze
mathematically as the operation of taking absolute values does not
simplify in algebraic manipulations.

To solve the problem we look at another way making deviations non-


negative so that they can be averaged: we take the square
deviations which are just the deviations squared.

square deviation  ( xi  x ) 2

As before we can take the mean value of these to obtain a single


number – the mean square deviation also known as the variance.

1 n
variance  
n i 1
( xi  x ) 2

The variance is the preferred measure of dispersion as its formula can


be easily manipulated algebraically.

Standard Deviation

There is one disadvantage to the variance, namely it is not in the same


units as the variable x , it is in the square of the units of x . When we
need a measure of dispersion that is in the same units as x we take
the square root of the variance – this is known as the root mean
square deviation or more commonly, the standard deviation.
1 n
standard deviation  
n i 1
( xi  x ) 2

The symbol  (the Greek letter sigma) is typically used to denote the
standard deviation and the variance being equal to the square of the
standard deviation is typically denoted by  2 (reminding us that it is in
square units) instead having a separate symbol of its own.

Glossary:

Term Definition

Absolute Deviation The absolute value of a deviation.

Absolute Value The size of a number regardless of sign, obtained by


making a negative value positive and leaving a
positive value the same.
Contextual Entity An entity representing a method of partitioning data..
(Context)
Data Warehouse A denormalized database that stores historical data on
which statistical analysis can be performed.
Deviation The difference between a particular value of a variable
and the mean value of the variable.
Dimension A variable whose values are used to partition fact
values for summarization.
Dimension Level A contextual entity that is a member of a dimensional
hierarchy
Dimension Table A table in a database with an identity surrogate key as
primary key and non-key columns consisting of the
attributes of the dimension levels in a dimensional
hierarchy.
Dimensional A sequence of dimension levels with a 1-to-many
Hierarchy relationship between successive members of the
sequence.
Dispersion A term used to describe the fact that a variable’s
values are spread out over different possible values.
There are several formal measures of how much data
is dispersed i.e. range, standard deviation.
Fact (Measure) A variable that provides a measurement of
performance, cost or benefit etc. that we wish to
analyze statistically.
Fact Table A table in a database whose primary key is made up of
foreign key references to tables containing dimensions
and whose non-key columns store facts.
Granularity The smallest partition of data determined by a single
fact table key. Rows in the fact table store data already
summarized to this level.
Mean Absolute The mean of the absolute deviations used as a
Deviation measure of dispersion.
Mean Square The mean of the square deviations used as a measure
Deviation (Variance) of dispersion. Also called the variance.
Range The difference between the maximum and minimum
values of a variable providing a measure of dispersion.
Roll Up To summarize fact amounts for a particular partition of
the data.
Root Mean Square The square root of the mean square deviation
Deviation (Standard (variance). Used for sake of having a measure of
Deviation) dispersion in the same units as the variable.
Snowflake Schema A database design consisting of fact tables whose
primary keys are compound keys made up of foreign
keys referencing the primary keys of table
representing normalized contextual entities.
Square Deviation The square of a deviation – used in order to have non-
negative values that are more useful for algebraic
manipulation than absolute deviations.
Standard Deviation The root mean square deviation.

Star Schema A database design consisting of fact tables whose


primary keys are compound keys made up of foreign
keys referencing the primary keys of dimension tables.
Variance The mean square deviation.
Session 6 Summary

1. Sample Variance vs Population Variance

The definition of variance that we looked at in the previous session is


typically used as a measure of dispersion of an entire population and is
therefore also called a population variance and its corresponding
standard deviation is called a population standard deviation.

When we are working with data that is a sample of values drawn


randomly from a population (with the possibility of drawing the same
value more than once) then applying the formula for the variance to
the data does not give a good estimate for the population variance. If
the population variance is  2 then the average result of applying the
variance formula to samples of size n of randomly drawn values is not
n 1 2
 2 but  .
n

This leads us to define the sample variance for a sample of size n


as:

n
 result of applying variance formula
n 1

1 n
  ( xi  x ) 2
n  1 i 1

Where the average x is taken over the sample.

The sample variance is used as an estimator for the population


variance, i.e. its average value over several samples is used as an
estimate for the population variance.

As with the population variance, this value is in the square of the units
of x and so when we need a value in the same units as x we take the
square root to obtain a quantity called the sample standard
deviation which is usually denoted by s :
1 n
s 
n  1 i 1
( xi  x ) 2

The sample variance is therefore denoted by s 2 reminding us that it is


in square units.

2. Interquartile Range and Semi-Interquartile Range

Interquartile Range

Variances and standard deviations involve squaring deviations and


thus only make sense for variables of the ratio scale of measure which
allows values to be multiplied meaningfully. For more general
situations we need another method of measuring dispersion.

We start off by taking the median. The median partitions our set of
fact values into two halves – values less than the median and values
greater than the median. (We leave the median itself out of either
half.)

We can then take the median of the lower half of values. The number
we obtain is called the lower quartile. Not more than a quarter of the
values are below it and not more than three quarters are above it.

Similarly we can take the median of the upper half of the values. The
number we get this time is called the upper quartile. Not more than
three quarters of the values are below it and not more than a quarter
are above it.

We thus have three values, called quartiles, dividing the set of values
into quarters: the lower quartile (also called the first quartile) the
median (also called the second quartile) and the upper quartile (also
called the third quartile).

The first quartile is denoted by Q1 , the second by Q2 , and the third by


Q3 . (Some books also denote the minimum by Q0 and the maximum
by Q4 .)

Then to obtain a measure of dispersion based on quartiles we take the


difference between the upper quartile and the lower quartile. This is
known as the interquartile range:
interquartile range  Q3  Q1

Compare this with the ordinary range which was given by:

range  Q4  Q0

As the upper and lower quartile are determined by looking at how


entire set of values is distributed, the interquartile range provides a
richer measure of dispersion than the ordinary range.

As with the range, the concept of interquartile range can be applied to


any ranked variables but are most meaningful when applied to
variables of the interval scale of measurement which have meaningful
differences between values. (In contrast with an arbitrary ranked
variable the interquartile range will merely indicate the number of rank
positions lying between the upper and lower quartile which is simply
determined by the total number of values.)

Semi-Interquartile Range:

The interquartile range is a bit different to the variance, standard


deviation and mean absolute deviation. The latter are all measures of
dispersion providing a notion of average distance of values from the
central tendency of the data (in fact from the arithmetic mean). The
interquartile range on the other hand is not an average distance from
a central tendency.

To produce a measure of dispersion using quartiles that is similarly


based on the idea of an average distance from a central tendency we
start off by using the median (second quartile) as the appropriate
notion of central tendency. We then take the distance from the lower
quartile to the median Q2  Q1 and the distance from the upper quartile
to the median Q3  Q2 and then take the mean of these two values. The
resulting number is known as the semi-interquartile range.

Simplifying out, the median Q2 cancels out and we are left with

Q3  Q1
semi-interquartile range =
2
In other words the semi-interquartile range is simply half the
interquartile range, hence its name.

3. Graphs for Displaying and Analyzing Data

Besides calculations of cumulative values, central tendencies and


measures of dispersion, statisticians also analyze and summarize data
visually by means of graphs. We initially cover the basic graph types
and their use. Next session we look at more advanced graphing.

Line Charts

The most straightforward graph used in stats is the line chart (also
called function graph or line plot). This is used to display the
dependency of a quantitative dependent variable on a quantitative
independent variable, typically time. It is used when we know that
there is indeed a functional dependency between the variables. The
graph helps us determine the exact nature of the dependency; in
particular it helps us see if there is a simple formula relating the
variables.

The graph consists of standard XY axes with the graph consisting of


points whose X coordinate is the independent variable and whose Y
coordinate is the dependent variable.
For discrete variables the function plot is made up of disconnected
points. For continuous variables the graph is in theory a line or curve.
In practice, if we don’t know a formula up front relating the variables
we will also only be able to plot separated points from which we can
then extrapolate the full connected graph by joining the points with
line segments or curves. We can also join the points on a graph of
discrete variables using line segments but in this case the line
segments merely help show the trend of the graph they do not provide
an extrapolation of values between the shown coordinates since for a
discrete variable there are no in between values!

Bar Graphs

Bar graphs show the relative sizes of fact values for different distinct
dimension values by means of bars on an XY plain. There is one bar
per dimension value and the height of the bar represents the fact
value. The bars all have the same width but the width (and hence
area) of the bar does not represent data. Typically the bars are vertical
with their bases on the X axis which represents the dimension values
(vertical bar graph). Alternatively one can also draw the bars
horizontally with their bases on the Y axis representing the dimension
values (horizontal bar graph).

Histograms

Histograms are used to depict frequencies of values or equivalently


the values of counted fact variables against dimension values. As with
bar graphs, bars are used. The frequencies / counted values are
indicated by the areas of the bar. The base of each bar represents a
interval of dimension values and so its width is significant.

Sometimes the bars all have the same width in which case their area is
proportional to their height and in this case we have a special type of
bar graph. However one can have histograms with bars of different
widths in which case one must look at the area of the bar not merely
its height.

Whereas a bar graph has separately drawn bars, in a histogram the


bars usually touch each other and may even be shown without
separate borders.

Pie Charts

A pie chart consists of a circle divided into sectors, typically shown in


different colours. Each sector represents a portion of a whole amount,
the entire circle representing the whole. The size of a sector (i.e. its
angle, or equivalently arc length or area) represents the size of the
portion.

Typically we have a sector for each possible value of a dimension


variable that takes a few distinct values and represent a related fact
value by the size the sector.
Very often if our dimension variable has a few significant values and
other less significant values, the latter are grouped into a single sector
called “other”. (Some statisticians regard this as a bad practice as it
does not give an indication of how many values are lumped together
as “other” and is sometimes just done out of laziness.)

A common criticism of pie charts is that it is difficult for people to


accurately judge the relative sizes of the sectors compared to a bar
graph as people do not perceive differences in area as easily as they
do differences in length.

Scatter Plots

A scatter plot is similar to a line chart in that it shows points in the XY


plain. But whereas a line chart plots a dependent variable against an
independent variable when we know of the dependency up front, a
scatter plot is used to plot any two quantitative fact values against
each other in order to discover a dependency between them that we
do not already know about or understand.

We use the X axis for one fact and the Y for the other and plot a point
for every combination of values for the two facts. If there is indeed a
dependency between the two values the points will form a line or
curve whose formula we can then determine from the plot.

Usually we do not have a precise dependency but an approximate one


and instead of forming a clean line or curve the points form a cloud
that approximates a line or curve. We can then find a line or curve
(called a regression) that best fits the cloud and use it to make
approximate predications. If the approximate dependency is linear (i.e.
described by a straight line as opposed to a rounded curve) we speak
of their being a correlation between the two facts variables.

Glossary:

Term Definition

Bar Graph A graph showing relative sizes of fact values for


different dimension values using the height of bars.
Correlation An approximate (or exact) linear dependency between
two fact values.
Estimator A statistical value calculated on a random sample
whose average value over all similarly chosen
samples is equal to a statistical value describing the
whole population and which thus acts as estimate for
the latter.
First Quartile The lower quartile.

Histogram A graph showing frequencies or counted values using


the areas of bars.
Horizontal Bar Graph A bar graph with horizontal bars – the Y axis denotes
the dimension values.
Interquartile Range The difference between the upper quartile and the
lower quartile used as a measure of dispersion.
Line Chart (Function A graph consisting of points plotted on an XY plain
Graph, Line Plot) showing the dependency of a quantitative dependent
variable on a quantitative independent variable.
Lower Quartile A number separating the first quarter of ordered data
from the second quarter.
Pie Chart A graph consisting of a circle divided into sectors each
representing a part of a whole amount represented by
the entire circle.
Population Standard The standard deviation of a complete population under
Deviation study.
Population Variance The variance of a complete population under study.

Quartile A number separating quarters of a sample or


population arranged in order.
Regression A line or curve that best fits the points in a scatter plot
that can be used to make approximate predictions.
Refers also to the process of determining such a line
or curve.
Sample Standard The square root of a sample variance which will be in
Deviation the same units as the variable.
Sample Variance An estimator for population variance calculated by
dividing the sum of the sample’s square deviations by
one less than the amount of values in the sample.
Scatter Plot A graph consisting of points plotted on an XY plain
used to investigate a possible dependency between
two quantitative fact variables.
Second Quartile The median.

Semi-Interquartile Half the interquartile range used as a measure of


Range dispersion which like the variance is a notion of
average distance from a central tendency.
Third Quartile The upper quartile.

Upper Quartile A number separating the fourth quarter of ordered data


from the third quarter.
Session 7 Summary

1. Advanced Graphing

Categories and Series

In the simple graphs that we looked at last session, we typically


displayed the relationship between only two variables.

We may however wish to compare how the relationship we are


graphing varies with respect to another variable. To do this we
superimpose multiple graphs on the same plane, one graph for each
value of the additional variable. Typically each graph has its own
colour.

For normal vertical graphs the X-axis variable used in each individual
graph on the plane is referred to as a category and the additional
variable is referred to as a series. (For horizontal graphs such as a
horizontal bar graph the category is the Y-axis variable instead).

This technique is often used with bar graphs in which case for each
value of the category there is a group of bars. Within each group there
is a bar for each series value.
Stacked Bar Graphs

Another way to display an additional variable in a bar graph is to


divide the bars up into segments according to the values of the
additional variable – typically a nominal dimension variable. Each
segment represents the fact value being graphed for a particular value
of the additional dimension. This is known as a stacked bar graph.

One can also divide the bars up so that each successive division
includes those below it – the lowest division representing data for the
finest partition according to the additional dimension and each
successively higher division showing a coarser partition. For example.
the smallest division might show impressions that led to conversions,
the next division would show impressions that led to click throughs
(which includes those that led to conversions) and the whole bar might
show all impressions.

Variations on Pie Charts

One can improve on pie charts by removing a smaller circle from


inside the chart to produce a doughnut shape. This is known as a
doughnut chart. The fact values are now represented equivalently by
inner arc length as well as by sector area, angle and outer arc length
making it easier to compare values visually (in theory).

One can show different nested doughnuts for different category values.

One can also show segments or sectors of pie charts and doughnut
charts detached for emphasis.
Boxes and Whiskers

A box plot is a graph that depicts the values


𝑄 , 𝑄 , 𝑄 , 𝑄 , 𝑄 (sometimes called the five number summary) as well
as other statistical measures of a dataset.

A box plot may be drawn vertically or horizontally on an XY plane.


To draw a (vertical) box plot:

1. Determine the five number summary (quartiles with minimum and


maximum.

2. Draw a box bounded at the bottom by the lower quartile and at the
top by the upper quartile. The box can be any width. The height will
be the interquartile range.

3. The median (second quartile) is marked by a line dividing the box.

4. Values which are more than 3 semi-interquartile ranges (1.5


interquartile ranges) greater than the upper quartile or more than 3
semi-interquartile ranges less than the lower quartile are called
outliers. The highest and lowest values in the dataset that are not
outliers are indicated by short horizontal lines connected to the box
by vertical lines. These lines are called whiskers.

5. Values more than 3 interquartile ranges greater than the upper


quartile or more than 3 interquartile ranges less than the lower
quartile are called extreme outliers. Outliers which are not
extreme outliers are called mild outliers. The highest and lowest
values in the dataset that are not extreme outliers can also be
shown using whiskers connected to the previously drawn whiskers.

6. The outliers themselves (including maximum and minimum) are


indicated on the boxplot by marks vertically aligned with the centre
of the box. Typically a different mark is used for extreme outliers
and mild outliers.

7. One can also indicate the mean value if appropriate by a mark


centered vertically with the box.

The position of the median line in the box indicates how the data is
skewed i.e. if it is distributed more or less evenly about the median or
concentrated more on one side. If the lower quartile is further from the
median than the upper quartile, the data is said to be negatively
skewed, if the upper quartile is further from the median than the
lower quartile, the data is said to be positively skewed.
In cases where data is sampled for different time intervals one can plot
successive box plots on the same axis and link the medians or means
with line segments to indicate trend.

Whiskers can also be used on discrete line charts connected to the


plotted points in the case where the points represent central
tendencies of sampled data. In this case the whiskers are called error
bars and are used to indicate the possible deviation (“error”) of the
real population value from that derived from the sample. For fact
values of the ratio scale of measure one typically places the whiskers
one sample standard deviation from the plotted point representing the
mean.
Another approach uses whiskers to mark the maximum and minimum
values and in this case the whiskers are call max and min bars and
the graph using them is called a max-min plot.

Stem and Leaf Plots

A stem and leaf plot is a means of listing all values of a quantitative


dataset numerically but in a manner that visually resembles a
horizontal histogram.

To compile a stem and leaf plot numbers in the dataset are split into
units (called the leaves) and either tens, hundreds or thousands etc
(called stems) depending on the typical size of the numbers. The
numbers are arranged in order. The stems are listed once in a column
on the left. For each number, its leaf (units portion) is listed as an
entry in the row headed on the left by the stem of the number. For
example if we divide our numbers into units and tens, the number 562
would consist of an entry of 2 (representing 2 units) in the row headed
on the left by 56 (representing 56 tens).

The resulting representation provides a graphical view of how values


are distributed in the dataset. The numbers of entries in a row shows
how many numbers in the dataset fall within the partition of the data
determined by the stem value. (Note that one can have more than one
digit for “units” for example if the stem represents hundreds not tens.)
2. Quantiles

Recall how ranked data is divided into two halves by the median and
that these portions can be further divided in two by quartiles. If we
want an even finer view of how the data is distributed we can repeat
the process of dividing in two a third time to produce the quantities
called octiles which are thus numbers dividing the data into eighths.

We are of course not limited to dividing in two. Sometime statisticians


will use other small prime numbers besides two – dividing the data in
thirds produces the tertiles, dividing into fifths produces the
quintiles, dividing into sevenths produces the septiles.

Since we use a decimal number system it is also convenient to divide


data into tenths producing the deciles which cut off the data at 10%
intervals, or into hundredths producing the percentiles which cut the
data off at 1% intervals.

As with halving we can repeat these divisions several times. Dividing


into thirds to produce tertiles and then dividing each third again into
thirds produces the noniles which divide the data into ninths.

We can combine divisions by different amounts, e.g. first dividing the


data in half by the median and then dividing the lower and upper
halves of the data in thirds results in the data being divided into sixths
by numbers called sextiles. Halving and then dividing each half into
tenths results in the data being divided into twentieths by the duo-
deciles which cut the data off at 5% intervals.
The general concept that we have here is that of a quantile. For a
natural number 𝑛 > 2 the 𝑛-quantiles are numbers that partition the
data into 𝑛 equally sized portions. There will be 𝑛 − 1 𝑛-quantiles. For
each natural number 𝑘 < 𝑛, the 𝑘th 𝑛-quantile is a number such that at
most 𝑘/𝑛 of the data lies below it and (𝑛 − 𝑘)/𝑛 lies above it.

The special names for 𝑛-quantiles for various values of 𝑛 are given in
the table below:

n Name of n-quantile
2 median
3 tertile
4 quartile
5 quintile
6 sextile
7 septile
8 octile
9 nonile
10 decile
20 duo-decile
100 percentile

Recall that when we first looked at medians we mentioned that the


median isn’t unique when we have an even number of values, we can
pick any value between the middle two. We mentioned that typically
we take the mean of the middle two – i.e. the number lying exactly
half way between the middle two. Similarly other quantiles may not be
unique depending on the size of the dataset and we can extend the
method of picking medians to picking quantiles in general in the
following manner:

Arrange the data in order. Then the 𝑘th 𝑛-quantile is the number
whose rank (position in the ordered list) is · (𝑁 + 1) where 𝑁 is the
number of values in the list. Now this is fine if the latter is a whole
number, we just pick the number in the list at that position. If however
this calculation produces a fraction, we pick the two whole numbers on
either side of the fraction, pick out the values in the list at those
positions and then calculate the value lying between these values at a
distance between them that is in proportion to the distance that the
number · (𝑁 + 1) lies between the two whole numbers on either side
of it. (This is known as taking a linear interpolation.) The process is
best understood with an example:
Suppose we have 10 values:

2, 7, 13, 14, 16, 21, 56, 77, 83, 90

Then the second tertile is the value of rank · (10 + 1) = 7 . Now this is
a fraction lying between 7 and 8. So we pick out the 7 th and 8th values
in the list: 56 and 77. Now 7 lies one third of the way from 7 to 8 and
so our desired value is the number that lies a third of the way from 56
to 77, that is 56 + · (77 − 56) = 63.

Why do we need this, what do I need to remember? Basically taking


quantiles provides a way to understand how the data is distributed – is
it bunched up in some places but spread evenly at others? The finer
we partition the data the clearer the picture we have of how the data
is distributed. You don’t need to memorize the names you only have to
understand the general concept. The more data there is the more need
there is to partition it finer in order to understand its distribution.

Glossary:

Term Definition

Box Plot A graph that displays the five number summary


(quartiles and extreme values) for a dataset using a
box bounded by the upper and lower quartiles and
divided by the median, and indicates the highest and
lowest non-outliers by whiskers.
Category The X-axis variable in a vertical graph that shows how
one variable varies with another. (The Y-axis variable
in the case of a horizontal graph.)
Deciles The quantiles that divide the data into tenths (10%
intervals).
Doughnut Chart A variation on a pie chart using a doughnut shape
instead of a circle.
Duo-Deciles The quantiles that divide the data into twentieths (5%
intervals).
Error Bars Whiskers denoting one sample standard deviation
away from a plotted point representing a mean value.
Used as a measure of how much the value might
deviate from the true population value.
Extreme Outliers Values in a dataset that are more than 3 interquartile
ranges greater than the upper quartile or less than the
lower quartile.
Leaf The units portion of a number in a stem and leaf plot.
(Here units simply means the portion less than the
stem and may have more than a single digit.)
Linear Interpolation A method of filling in values between two values of a
dependent continuous variable that is a function of an
independent continuous variable by placing the filled in
values in proportion to the position of their
corresponding independent variable values.
Max and Min Bars Whiskers denoting maximum and minimum values.

Max – Min Plot A graph showing maximum and minimum values for
samples using max and min bars.
Negatively Skewed Having a larger distance between lower quartile and
median than between upper quartile and median.
Noniles The quantiles that the divide the data into ninths.

Octiles The quantiles that divide the data into eighths.

Outliers Values in a dataset that are more than 3 semi-


interquartile ranges (1.5 interquartile ranges) greater
than the upper quartile or less than the lower quartile.
Percentiles The quantiles that divided the data into hundredths
(1% intervals).
Positively Skewed Having a larger distance between upper quartile and
median than between lower quartile and median.
Quantiles A set of numbers that partition data into equally sized
portions. The 𝑛-quantiles partition the data into 𝑛
portions.
Quintiles The quantiles that divide the data into fifths.

Rank A position of a value in an ordered list. We also use


fractional ranks to denote numbers linearly
interpolated between values in the ordered list.
Septiles The quantiles that divide the data up into sevenths.

Sextiles The quantiles that divide the data into sixths.

Series An additional variable shown on a graph by repeating


the graph for each value of the additional variable.
Skewed Concentrated on one side of the median as opposed
to being distributed evenly around it.
Stacked Bar Chart A bar chart showing an additional dimension by
dividing the bars into segments.
Stem A portion of a number consisting of its tens, hundreds
or thousands etc used in a stem and leaf plot.
Stem and Leaf Plot A listing of quantitative data with numbers broken up
into a stem portion (tens, hundreds or thousands etc)
and a leaf portion consisting of units. The stem values
are listed once on the left and each number in the data
is displayed as its leaf listed in the row corresponding
to the stem.
Whisker A line with a marked end used to indicate values in
graphs of sample of data e.g. lowest and highest non-
outlier in a box plot.

You might also like