You are on page 1of 24

ASSIGNMENT

ON

DATA MINING
TOPIC: MINING COMPLEX TYPES OF DATA

BY
P.ISWARYA
II MSC (CS)
CONTENTS

• INTRODUCTION
• MULTIDIMENSIONAL ANALYSIS AND
DESCTIPTIVE MINING OF COMPLEX
DATA OBJECTS.
• SPATIAL DATA MINING
• MULTIMEDIA DATA MINING
• CONCLUSION
• REFERENCES
1. Introduction
An increasingly important task in data mining is to
mine complex types of data, including multidimensional and
descriptive analysis of complex objects, spatial data mining,
multimedia data mining, time-series data mining, mining text data
bases and the mining in World Wide Web. In this section we are
going to discuss the first three types.

2. Multidimensional and descriptive mining of


complex data objects
To introduce data mining and multidimensional data
analysis for complex objects, this section examines how to perform
generalization on complex structured objects and construct object
cubes for OLAP and mining in object databases.
The storage and access of complex structured data
have been studied in object relational and object oriented database
systems. These systems organize a large set of complex data
objects into classes, which are in turn organized into class/subclass
hierarchies. Each object in a class is associated with (1) an object
identifier, (2) a set of attributes that may contain sophisticated data
structures, set-or list-valued data, class composition hierarchies,
multimedia data, and so on and (3) a set of methods that specify
the computational routines or rules associated with the object class.

2.1 Generalization of structured data


An important feature of object relational and object
oriented databases is their capability of storing, accessing and
modeling complex structure-valued data, such as set-valued and
list-valued data and data with nested structures.
A set-valued attribute may be homogeneous or
heterogeneous type. Typically, set-valued data can be generalized
by (1) generalization of each value in the set into its corresponding
higher-level concepts or (2) derivation of the general behavior of
the set, such as the number of elements in the set, the type or value
ranges in the set, or the weighted average for numerical data.
A list-valued or a sequence-valued attribute can be
generalized in a manner similar to that for set-valued attributes
except that the order of the elements in the sequence should be
observed in the generalization.
A complex structure-valued attribute may contain sets,
tuples, lists, trees, records and so on, and their combinations,
where one structure may be nested in another at any level. In
general, a structured valued attribute can be generalized in several
ways such as 1) generalizing each attribute in the structure while
maintaining the shape of the structure 2)flattening the structure 3)
summarizing the low level structures by high level concepts 4)
returning the type or an overview of the structure.
2.2 Aggregation and Approximation in spatial and
multimedia data generalization
Aggregation and approximation are important
techniques for this form of generalization. In a spatial merge, it is
necessary to not only merge the regions of similar types within the
same general class but also compute the total areas, average
density, or other aggregate functions while ignoring some scattered
regions with different types if they are unimportant to study. Other
spatial operators such as spatial-union, spatial-overlapping and
spatial intersection, which may require merging of scattered small
regions into large, clustered regions, can also use spatial
aggregation and approximation as data generalization operators.
A multimedia database may contain complex tests,
graphics, images, video fragments, maps, voice, music and other
forms of audio/video information. Multimedia data are typically
stored as sequences of bytes with variable lengths, and segments of
data are linked together or indexed in a multidimensional way for
easy reference.
Generalization on multimedia data can be performed by
recognition and extraction of the essential features and/or general
patterns of such data. There are many ways to extract such
information. For an image, the size, color, shape, texture,
orientation and relative positions and structures of the contained
objects or regions in the image can be extracted by aggregation
and/or approximation.

2.3 Generalization of object identifiers and


class/subclass hierarchies.
First the object identifier is generalized to the identifier
of the lowest subclass to which belongs. The identifier of this
subclass can then, in turn, be generalized to a higher-level
class/subclass identifier by climbing up the class/subclass
hierarchy. Similarly a class or subclass can be generalized to its
corresponding super classes by climbing up to its associated
class/subclass hierarchy.
Inherited properties of objects are generalized because
some object oriented database systems allow multiple inheritances,
where inherited properties can be inherited from more than one
super class when the class/subclass “hierarchy” is organized in the
shape of lattice. The inherited properties of an object can be
derived by query processing in the object-oriented database.
2.4 Generalization of class composition hierarchies
Generalization on a class composition hierarchy can be
viewed as generalization on a set of nested structured data.
However in most cases the longer the sequence of references
traversed the weaker semantic linkage between the original object
and the referenced composite object. That is, in order to discover
interesting knowledge, generalization should be performed on the
objects in the class composition hierarchy that are closely related
in semantics to the currently focused, but not on those that have
only remote and rather weak semantic linkages.
2.5 Construction and mining of object cubes
The generalization of multidimensional attributes of a
complex object class can be performed by examining each
attribute, generalizing each attribute to simple-valued data, and
constructing a multidimensional data cube called an object cube.
Once an object cube is constructed, multidimensional analysis an
data mining can be performed on it in a manner similar to that for
relational data cubes.
2.6 Generalization based mining of plan databases
by divide-and conquers
Plan: a variable sequence of actions
E.g., Travel (flight) : <traveler, departure, arrival, d-time, a-time,
airline, price, seat>
Plan mining: extraction of important or significant generalized
(sequential) patterns from a plan base (a large collection of plans)
E.g., Discover travel patterns in an air flight database, or find
significant patterns from the sequences of actions in the repair of
automobiles.
Method: Attribute-oriented induction on sequence data
Divide & conquer: Mine characteristics for each subsequence
A Travel Database for Plan Mining
Example: Mining a travel plan base
Travel plans table

plan# action# departure depart_time arrival arrival_time airline …


1 1 ALB 800 JFK 900 TWA …
1 2 JFK 1000 ORD 1230 UA …
1 3 ORD 1300 LAX 1600 UA …
1 4 LAX 1710 SAN 1800 DAL …
2 1 SPI 900 ORD 950 AA …
. . . . . . . .
. . . . . . . .
. . . . . . . .

Airport info table

airport_code city state region airport_size …


1 1 ALB 800 …
1 2 JFK 1000 …
1 3 O RD 1300 …
1 4 LAX 1710 …
2 1 SPI 900 …
. . . . .
. . . . .
. . . . .

Multidimensional Analysis
Strategy
 Generalize the plan base in different directions.
 Look for sequential patterns in the generalized plans.
 Derive high-level plans.
A multi-D model for the plan base

Multidimensional Generalization
Multi-D generalization of the plan base

Plan# Loc_Seq Size_Seq State_Seq


1 ALB - JFK - ORD - LAX - SAN S-L-L-L-S N-N-I-C-C
2 SPI - ORD - JFK - SYR S-L-L-S I-I-N-N
. . .
. . .
. . .

Merging consecutive, identical actions in plans

Plan# Size_Seq State_Seq Region_Seq …


1 S - L+ - S N+ - I - C+ E+ - M - P+ …
2 S - L+ - S I+ - N+ M+ - E+ …
. . .
. . .
. . .

flight ( x, y, ) ∧ airport _ size ( x, S ) ∧ airport _ size ( y, L )


⇒ region ( x) = region ( y ) [75 %]
Generalization-Based Sequence Mining

• Generalize plan base in multidimensional way using


dimension tables
• Use # of distinct values (cardinality) at each level to
determine the right level of generalization (level-“planning”)
• Use operators merge “+”, option “[]” to further generalize
patterns.
• Retain patterns with significant support.
Generalized Sequence Patterns
• Airport Size-sequence survives the min threshold (after
applying merge operator):
S-L+-S [35%], L+-S [30%], S-L+ [24.5%], L+ [9%]
• After applying option operator:
[S]-L+-[S] [98.5%]
• Most of the time, people fly via large airports to get to final
destination
• Other plans: 1.5% of chances, there are other patterns: S-S,
L-S-L.
3. Mining spatial databases
A spatial database stores a large amount of space-related
data, such as maps, preprocessed remote sensing or medical
imaging data, and VLSI chip layout data. Spatial data mining
refers to the extraction of knowledge, spatial relationships, or other
interesting patterns not explicitly stored in spatial databases. Such
mining demands an integration of data mining with spatial
database technologies.
3.1Spatial data warehouse: Integrated, subject-oriented,
time-variant, and nonvolatile spatial data repository for data
analysis and decision making processes.
Spatial data integration: a big issue
 Structure-specific formats (raster- vs. vector-based, OO vs.
relational models, different storage and indexing, etc.)
 Vendor-specific formats (ESRI, MapInfo, Intergraph, etc.)
Spatial data cube: multidimensional spatial database
 Both dimensions and measures may contain spatial
components.
Dimensions and Measures in Spatial Data Warehouse
There are three types of dimensions in spatial data cube.
A non spatial dimension contains only non spatial data.
E.g. temperature: 25-30 degrees generalizes to hot
A spatial-to-non spatial dimension is a dimension whose primitive
level data are spatial but whose generalization, starting at a certain
level, becomes non spatial.
E.g. region “B.C.” generalizes to description “western provinces”
A spatial-to-spatial dimension is a dimension whose primitive
level and all of its high-level generalized data are spatial.
E.g. region “Burnaby” generalizes to region “Lower Mainland”.
Measures:
There are two types of measures in spatial data cube.
A numerical measure contains only numerical data.
i) distributive (e.g. count, sum)
ii) algebraic (e.g. average)
iii) Holistic (e.g. median, rank).
A spatial measure contains a collection of pointers to spatial
objects.
Collection of spatial pointers (eg. Pointers to all regions with 25-30
degrees in July).
Example: BC weather pattern analysis
Input
 A map with about 3,000 weather probes scattered in B.C
 Daily data for temperature, precipitation, wind velocity, etc.
 Concept hierarchies for all attributes
Output
 A map that reveals patterns: merged (similar) regions
Goals
 Interactive analysis (drill-down, slice, dice, pivot, roll-up)
 Fast response time
 Minimizing storage space used
Challenge
 A merged region may contain hundreds of “primitive”
regions (polygons)
Star Schema of the BC Weather Warehouse
Spatial data warehouse
Dimensions
o Region_name
o Temperature
o Precipitation
Measurements
o Region_map
o Area
o Count

Fact table
Dimension table

Spatial merge
 Precomputing all: too much storage space
 On-line merge: very expensive.
Methods for Computation of Spatial Data Cube
On-line aggregation: collect and store pointers to spatial objects in
a spatial data cube.
 It is expensive and slow, need efficient aggregation
techniques.
Precompute and store all the possible combinations.
 Huge space overhead
Precompute and store rough approximations in a spatial data
cube (e.g. use grids)
 Accuracy trade-off
Selective computation: only materialize those which will be
accessed frequently
 A reasonable choice

3.2 Spatial Association Analysis


Spatial association rule: A ⇒ B [s%, c%]
A and B are sets of spatial or nonspatial predicates
 Topological relations: intersects, overlaps, disjoint, etc.
 Spatial orientations: left_of, west_of, under, etc.
 Distance information: close_to, within_distance, etc.
s% is the support and c% is the confidence of the rule
Examples
is_a(x, large_town) ^ intersect(x, highway) ® adjacent_to(x,
water)
[7%, 85%]
is_a(x, large_town) ^adjacent_to(x, georgia_strait) ® close_to(x,
u.s.a.) [1%, 78%]
Progressive Refinement Mining of Spatial Association Rules
Hierarchy of spatial relationship:
 g_close_to: near_by, touch, intersect, contain, etc.
 First search for rough relationship and then refine it
Two-step mining of spatial association:
Step 1: Rough spatial computation (as a filter)
Using MBR or R-tree for rough estimation.
Step2: Detailed spatial algorithm (as refinement)
Apply only to those objects which have passed the rough spatial
association test (no less than min_support)
3.3 Spatial clustering methods
Spatial data clustering identifies clusters, or densely populated
regions, according to some distance measurement in large,
multidimensional data set.
3.4 Spatial Classification and Spatial Trend Analysis
Spatial classification
 Analyze spatial objects to derive classification schemes,
such as decision trees in relevance to certain spatial
properties (district, highway, river, etc.)
 Example: Classify regions in a province into rich vs.
poor according to the average family income
Spatial trend analysis
 Detect changes and trends along a spatial dimension
 Study the trend of nonspatial or spatial data changing
with space
 Example: Observe the trend of changes of the climate
or vegetation with the increasing distance from an
ocean.
3.5 Mining raster databases
Spatial database systems usually handle vector data
that consists of points, lines, polygons, and their compositions,
such as network or partitions. Typical examples include maps,
design graphs, and 3-D representations of the arrangement of the
chains of protein molecules.

4. Mining multimedia databases


A multimedia database system stores and manages a
large collection of multimedia objects, such as audio data, image
data, video data, sequence data and hypertext data, which contain
text, text markups and linkages.
4.1 Similarity search in multimedia data
Description-based retrieval systems
• Build indices and perform object retrieval based on image
descriptions, such as keywords, captions, size, and time of
creation
• Labor-intensive if performed manually.
• Results are typically of poor quality if automated.
Content-based retrieval systems
• Support retrieval based on the image content, such as color
histogram, texture, shape, objects, and wavelet transforms.

Queries in Content-Based Retrieval Systems


Image sample-based queries
• Find all of the images that are similar to the given image
sample.
• Compare the feature vector (signature) extracted from the
sample with the feature vectors of images that have already
been extracted and indexed in the image database.
Image feature specification queries
• Specify or sketch image features like color, texture, or shape,
which are translated into a feature vector.
• Match the feature vector with the feature vectors of the
images in the database.
Approaches Based on Image Signature
Color histogram-based signature
1. The signature includes color histograms based on color
composition of an image regardless of its scale or orientation.
2. No information about shape, location, or texture.
3. Two images with similar color composition may contain very
different shapes or textures, and thus could be completely
unrelated in semantics.
Multifeature composed signature
Define different distance functions for color, shape, location, and
texture, and subsequently combine them to derive the overall result
Wavelet-based signature
 Use the dominant wavelet coefficients of an image as
its signature.
 Wavelets capture shape, texture, and location
information in a single unified framework.
 Improved efficiency and reduced the need for providing
multiple search primitives.
 May fail to identify images containing similar objects
that are in different locations.
Wavelet-based signature with region-based granularity
 Define regions by clustering signatures of windows of
varying sizes within the image.
 Signature of a region is the centroid of the cluster.
 Similarity is defined in terms of the fraction of the area
of the two images covered by matching pairs of regions
from two images.
4.2 Multidimensional Analysis of Multimedia Data
Multimedia data cube
1. Design and construction similar to that of traditional data cubes
from relational data.
2. Contain additional dimensions and measures for multimedia
information, such as color, texture, and shape

The database does not store images but their descriptors


Feature descriptor: a set of vectors for each visual characteristic
Color vector: contains the color histogram.
MFC (Most Frequent Color) vector: five color centroids.
MFO (Most Frequent Orientation) vector: five edge orientation
centroids.
Layout descriptor: contains a color layout vector and an edge
layout vector.
Classifier module of Multimedia Miner and its output
4.3 Classification and prediction analysis in
multimedia data
Classification and predictive modeling have been used
for mining multimedia data, especially in scientific research, such
as astronomy, seismology, and geoscientific research. Decision
tree classification is an essential data mining method inreported
image data mining applications.
4.4 Mining Associations in Multimedia Data
Associations between image content and non-image
content features
“If at least 50% of the upper part of the picture is blue, then it is
likely to represent sky.”
Associations among image contents that are not related
to spatial relationships
“If a picture contains two blue squares, then it is likely to contain
one red circle as well.”
Associations among image contents related to spatial
relationships
“If a red triangle is between two yellow squares, then it is likely a
big oval-shaped object is underneath.”
5.Conclusion
Vast amounts of data in various complex forms that is
structured, unstructured, hypertext, multimedia have been growing
explosively owing to rapid progress of data collection tools,
advanced database system technologies, and World Wide Web
technologies.

6.References
“Data mining concepts and techniques“by Jiawei Han and
Macheline Kamber.
Websites:
http://www.slideshare.net/achmatim/08-mining-type-of-
complex-data
http://www.spatial.cs.umn.edu/
http://www.warez-files.com/direct-download/2109784-
Managing-and-Mining-Multimedia-Databases.html

You might also like