Slide 2 - Basic Tasks

Lecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Topic
Intro, schedule, and logistics
Applications of visual analytics, basic tasks, data types
Introduction to D3, basic vis techniques for non-spatial data
Visual perception and cognition
Visual design and aesthetics
Data types, notion of similarity and distance
Data preparation and reduction
Introduction to R, statistics foundations
Data mining techniques: clusters, text, patterns, classifiers
Data mining techniques: clusters, text, patterns, classifiers
Computer graphics and volume rendering
Techniques to visualize spatial (3D) data
Scientific and medical visualization
Scientific and medical visualization
Midterm #1
High-dimensional data, dimensionality reduction
Big data: data reduction, summarization
Correlation and causal modeling
Principles of interaction
Visual analytics and the visual sense making process
Evaluation and user studies
Visualization of time-varying and time-series data
Visualization of streaming data
Visualization of graph data
Visualization of text data
Midterm #2
Data journalism
Final project presentations
Projects
Project #1 out
Project #1 due
Project #2 out
Project #2 due
Project #3 out
Project #3 due
Final project proposal due
Final Project preliminary report due
Final Project slides and final report due
Numeric
Categorical
Text
Time series
Graphs and networks
Hierarchies
Numeric variables
measure a quantity as a number

like: how many' or 'how much'
can be continuous (grey curve)
or discrete (red steps)
Categorical variables
describe a quality or characteristic

like: 'what type' or 'which category
can be ordinal = ordered, ranked (distances need not be equal)
clothing size, academic grades, levels of agreement
or nominal = not organized into a logical sequence
gender, business type, eye color, brand
But not everything is expressed in numbers
images
video
text
web logs
Need to do feature analysis to turn these abstract things into

numbers
then apply your analysis as usual

but keep the reference to the original data so you can return to
the native domain where the analysis problem originated
Characteristics
often large scale

time series
Feature Analysis
Fourier transform (FT, FFT)

Wavelet transform (WT, FWT)
Motif discovery
Fourier transform
Motif discovery
Characteristics
histograms
array of pixels
Feature Analysis
histograms
values
gradients
FFT, FWT
Scale Invariant Feature
Transform (SIFT)
Bag of Features (BoF)
visual words
SIFT
Characteristics
essentially a time series of images
Feature Analysis
many of the above techniques apply albeit extension is non-trivial
1. Obtain the set of bags of features

(i) Select a large set of images
(ii) Extract the SIFT feature points of all the images in the set and obtain
the SIFT descriptor for each feature point extracted from each image
(iii) Cluster the set of feature descriptors for the amount of bags we
defined and train the bags with clustered feature descriptors
(iv) Obtain the visual vocabulary
2. Obtain the BoF descriptor for a given image/video frame

(v) Extract SIFT feature points of the given image
(vi) Obtain SIFT descriptor for each feature point
(vii) Match the feature descriptors with the vocabulary we created in the
first step
More information
(viii) Build the histogram
Weblogs
typically represented as text strings in a pre-specified format

this makes it easy to convert them into multidimensional
representation of categorical and numeric attributes
Network traffic
characteristics of the network packets are used to analyze

intrusions or other interesting activity
a variety of features may be extracted from these packets
the number of bytes transferred
the network protocol used
IP ports used
Characteristics
often raw and unstructured
Feature analysis
first step is to remove stop words and stem the data

perform named-entity recognition to gain atomic elements
identify names, locations, actions, numeric quantities, relations
understand the structure of the sentence and complex events
example:
Jim bought 300 shares of Acme Corp. in 2006.
[Jim]Person bought [300 shares] Quantity of [Acme Corp.]Organiz. in
[2006]Time
distinguish between
application of grammar rules (old style, need experienced linguists)
statistical models (Google etc., need big data to build)
Example: Stakeholders of a library system
Example: Stakeholders of a library system

Questions you might have
how large is each stakeholder group?

tree with quantities
what fraction is each group with respect to the entire group of
stakeholders?
partition of unity
how is information disseminated among the stakeholders?
information flow
how close (or distant) are the individual stakeholders in terms of
some metric?
force directed layout
More scalable tree, and natural with some randomness
http://animateddata.co.uk/lab/d3-tree/
A standard tree, but one that is scalable to large hierarchies
http://mbostock.github.io/d3/talk/20111018/tree.html
A tree that is scalable and has partial partition of unity
http://mbostock.github.io/d3/talk/20111018/partition.html
More space efficient since its radial, has partial partition of

unity
http://mbostock.github.io/d3/talk/20111116/forcecollapsible.html
No hierarchy information, just quantities
http://bl.ocks.org/mbostock/4063269
Quantities and containment, but not partition of unity
http://mbostock.github.io/d3/talk/20111116/packhierarchy.html
Quantities, containment, and full partition of unity
http://mbostock.github.io/d3/talk/20111018/treemap.html
Relationships among group fractions, not necessarily a tree
Relationships of individual group members, also in terms of

quantitative measures such as information flow
http://mbostock.github.io/d3/talk/20111116/bundle.html
Relationships within organization members expressed as

distance and proximity
http://mbostock.github.io/d3/talk/20111116/forcecollapsible.html
Shows the closest point on the plane for a given set of

points and a new point via interaction
which you may need to convert into
using these methods
Goal
divide the ranges of the numeric attribute into ranges

examples:
age: [0, 10], [11, 20], [21, 30], 0, 1, 2, 3,
salary: [40,000, 80,000], . [1,040,000, 1,080,000] 1, , 26,
what is lost here?
distribution information within group
data might be distributed unevenly (less for age, more for salary)
what can we do?
store a value that characterizes the distribution
use log scaling (when the attribute shows an exponential
distribution)
use equi-depth range store the same number of records per bin
any disadvantages?
If a categorical attribute has different values, then

different binary attributes are created
This can have disadvantages
ordinal data are not spaced by actual distance, whatever the

metric
nominal data have no order and spacing at all
there are techniques based on correlation (stay tuned)
Use Latent Semantic Analysis (LSA)
also known as Latent Semantic Indexing (LSI)

turns text into a high-dimensional vector which can be compared
LSA
Count Matrix
Word/document cluster
Maps the frequency of words in a corpus to size
https://www.jasondavies.com/wordcloud/
Time series to discrete sequence

data
window based averaging

eliminates noise in the
data
value-based discretization
collapse series into a
smaller number of intervals
usually done after
averaging
Time series to numeric data
FFT, DFT
also works for discrete series
and spatial data

Slide 2 - Basic Tasks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slide 2 - Basic Tasks

Uploaded by

Copyright:

Available Formats

Lecture

Final project proposal due

Final Project preliminary report due

Final Project slides and final report due

measure a quantity as a number

describe a quality or characteristic

But not everything is expressed in numbers

Need to do feature analysis to turn these abstract things into

then apply your analysis as usual

often large scale

Fourier transform (FT, FFT)

essentially a time series of images

many of the above techniques apply albeit extension is non-trivial

1. Obtain the set of bags of features

2. Obtain the BoF descriptor for a given image/video frame

typically represented as text strings in a pre-specified format

characteristics of the network packets are used to analyze

often raw and unstructured

first step is to remove stop words and stem the data

Example: Stakeholders of a library system

Example: Stakeholders of a library system

how large is each stakeholder group?

More scalable tree, and natural with some randomness

A standard tree, but one that is scalable to large hierarchies

A tree that is scalable and has partial partition of unity

More space efficient since its radial, has partial partition of

No hierarchy information, just quantities

Quantities and containment, but not partition of unity

Quantities, containment, and full partition of unity

Relationships among group fractions, not necessarily a tree

Relationships of individual group members, also in terms of

Relationships within organization members expressed as

Shows the closest point on the plane for a given set of

which you may need to convert into

using these methods

divide the ranges of the numeric attribute into ranges

If a categorical attribute has different values, then

ordinal data are not spaced by actual distance, whatever the

Use Latent Semantic Analysis (LSA)

also known as Latent Semantic Indexing (LSI)

Maps the frequency of words in a corpus to size

Time series to discrete sequence

window based averaging

Time series to numeric data

You might also like