Professional Documents
Culture Documents
1.2 ATTRIBUTES
A. A numeric attribute is one that has a real-valued or integer-valued domain.
Numeric attributes that take on a finite or countably infinite set of values are called discrete,
whereas those that can take on any real value are called continuous.
• Interval-scaled: For these kinds of attributes only differences (addition or subtraction) make sense. For
example, attribute temperature measured in ◦C or ◦F is interval-scaled.
• Ratio-scaled: Here one can compute both differences as well as ratios between values. For example, for
attribute Age.
B. A categorical attribute is one that has a set-valued domain composed of a set of symbols.
• Nominal: The attribute values in the domain are unordered.
• Ordinal: The attribute values are ordered.
1.3.2 Mean and Total Variance
The mean of the data matrix D is the vector obtained as the average of all the points.
The total variance of the data matrix D is the average squared distance of each point from the mean.
The centered data matrix is obtained by subtracting the mean from all the points.
1.4.2 Multivariate Random Variable
A d-dimensional multivariate random variable X = (X1,X2,...,Xd)T, also called a vector random
variable, is defined as a function that assigns a vector of real numbers to each outcome in the sample
space
1.4.3 Random Sample and Statistics
The probability mass or density function of a random variable X may follow some known form, or
as is often the case in data analysis, it may be unknown.
In statistics, the word population is used to refer to the set or universe of all entities under study.
Usually we are interested in certain characteristics or parameters of the entire population (e.g., the mean
age of all computer science students in the United States)
Statistic
We can estimate a parameter of the population by defining an appropriate sample statistic, which is
defined as a function of the sample.
Data mining comprises the core algorithms that enable one to gain fundamental insights and
knowledge from massive data. It is an interdisciplinary field merging concepts from allied areas such as
database systems, statistics, machine learning, and pattern recognition.
Exploratory data analysis aims to explore the numeric and categorical attributes of the data
individually or jointly to extract key characteristics of the data sample via statistics that give information
about the centrality, dispersion, and so on.
Clustering is the task of partitioning the points into natural groups called clusters, such that points
within a group are very similar, whereas points across clusters are as dissimilar as possible.
The classification task is to predict the label or class for a given unlabeled point.
CHAPTER 2
Univariate analysis focuses on a single attribute at a time; thus the data matrix D can be thought
of as an n×1 matrix, or simply a column vector.
2.1.1 Measures of Central Tendency
These measures given an indication about the concentration of the probability mass, the “middle”
values, and so on.
The mean, also called the expected value, of a random variable X is the arithmetic average of the
values of X. It provides one-number summary of the location or central tendency for the distribution of X.
The median m is the “middle-most” value.
The mode of a random variable X is the value at which the probability mass function or the
probability density function attains its maximum value.
The measures of dispersion give an indication about the spread or variation in the values of a
random variable.
The value range or simply range of a random variable X is the difference between the maximum
and minimum values of X.
Quartiles are special values of the quantile function.
The variance of a random variable X provides a measure of how much the values of X deviate
from the mean or expected value of X.
The standard deviation, σ, is defined as the positive square root of the variance, σ2.