Computational Intelligence For Big Data Analytics - BDA 2013

Big Data Analytics:
l
Challenges
h ll
and
d
What Computational Intelligence
Techniques
h
May Offer
ff
Ah-Hwee Tan
(http://www.ntu.edu.sg/home/asahtan)
School of Computer Engineering
Nanyang Technological University
Big Data Analytics Symposium
London, UK
13 September 2013
Outline
Big Data Analytics
Computational Intelligence Techniques
Web Data Analytics
Flexible Organizer for Competitive

Intelligence (FOCI)
Web Information Fusion and Associative

Discovery
Di
Analytics for Active Living for Elderly
The Era of Big Data

Big data refers to
collection of data sets so large and complex
th t exceed
that
d th
the competence
t
off commonly
l used
d
IT systems in terms of processing space and/or
time.
time
Sources of Big
g Data
Traditionally, mostly produced in scientific fields such as
astronomy meteorology,
astronomy,
meteorology genomics physics
physics, biology
biology, and
environmental research.
With rapid
p development
p
of IT technology
gy and the
consequent decrease of cost on collecting and storing
data, big data has been generated from almost every
industry and sector as well as governmental department
department,
including retail, finance, banking, security, audit, electric
power, healthcare.
Recently, big data over the Web (big Web data for short),
which includes all the context data, such as, user
generated contents,
contents browser/search log data
data, deep web
data, etc.
Examples of Big Data

(Source: Wikipedia)
Walmart handles more than 1 million customer transactions
every hour,
h
which
hi h iis iimported
t d iinto
t d
databases
t b
estimated
ti t d tto
contain more than 2.5 petabytes (2560 terabytes) of data
the equivalent of 167 times the information contained in all the books in
the US Library of Congress.
Facebook handles 50 billion photos from its user base.

FICO Falcon Credit Card Fraud Detection System protects
2.1 billion active accounts world-wide.
Windermere Real Estate uses anonymous GPS signals from
nearly 100 million drivers to help new home buyers
yp
drive times to and from work
determine their typical
throughout various times of the day.
Examples of Big Data

(Source: Wikipedia)
NASA Center for Climate Simulation
(NCCS) stores 32 petabytes of
climate observations and simulations
on the Discover supercomputing
cluster.
Utah Data Center is a data center
c rrentl being constr
currently
constructed
cted b
by the
United States National Security
Agency. When finished, the facility
will handle yottabytes of information
collected by NSA over the Internet.
Value
Metric
1000
kB
kilobyte
10002
MB
megabyte
10003
GB
gigabyte
10004
TB
terabyte
10005
PB
petabyte
10006
EB
exabyte
10007
ZB
zettabyte
10008
YB
yottabyte
Money of Big Data

(Source: Wikipedia)
"Big
g data" have increased the demand of information
management specialists
Software AG, Oracle Corporation, IBM, Microsoft,
SAP EMC,
SAP,
EMC and
d HP have
h
spentt more than
th $15 billion
billi
on software firms specializing in data management
and analytics.
y
In 2010, this industry on its own was worth more than
$100 billion and was growing at almost 10 percent a
year: about twice as fast as the software business as
a whole.
Market of Big Data

(Source: Wikipedia)
Developed economies make increasing use of datadata
intensive technologies. There are 4.6 billion mobilephone subscriptions worldwide and there are between
1 billion and 2 billion people accessing the internet
The world's effective capacity to exchange information
through telecommunication networks was 281
petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes
in 2000, 65 exabytes in 2007[14] and it is predicted that
the amount of traffic flowing over the internet will reach
667 exabytes annually by 2013.[5]
Big Data Market Segments

(Report by Transparency Market Research)
Segmentation
g
of the big
g data market byy components,
p
, byy
applications and by geography.
The different components included are software and
services, hardware and storage.
Software and services segment dominates the components
market whereas storage segment will be the fastest
growing segment for the next 5 years owing to the
perpetual
t l growth
th in
i th
the d
data
t generated.
t d
Big Data Market Segment by

Applications
Covered eight applications namely financial services,
manufacturing, healthcare, telecommunication,
government, retail and media & entertainment and others in
the application segment.
Financial Services, healthcare and the government sector
are the top three contributors of the big data market and
together held more than 55% of the big data market in
2012.
M
Media
di and
dE
Entertainment
t t i
t and
d th
the h
healthcare
lth
sectors
t
will
ill
grow at high CAGR of nearly 42% from 2012 to 2018. The
growth in data in the form of video, images,
g
g
and g
games is
driving the media and entertainment segment.
Read more: http://www.digitaljournal.com/pr/1395146#ixzz2b0hvuxrQ
Challenges of Big Data

Volume
Size in the order of petabytes,
exabytes,
Velocity
Time sensitive data, data that
grow exponentially
g
p
y or even in
rates that overwhelm the wellknown Moore's Law
Value
Metric
1000
kB
kilobyte
10002
MB
megabyte
10003
GB
gigabyte
i b t
10004
TB
terabyte
10005
PB
petabyte
10006
EB
exabyte
10007
ZB
zettabyte
10008
YB
yottabyte
Variety
V i t
From structured data into semi-structured and
completely unstructured data of different types
types, such as
text, image, audio, video, click streams, log files,
Deeper Issues of Big Data

(The additional 3Vs)
Validity
Is the data correct and accurate for the intended
usage?
Veracity
V
i
Are the results meaningful for the given problem
space?
Volatility
How long do you need to look/store this data?
Computational Intelligence
Neural Networks (IJCNN)

Brain-like mathematical models for pattern
recognition, memory, and association discovery
Examples: Perceptron, BP, SVM, SOM, ART,
Fuzzy Systems (IEEE-FUZZ)

Fuzzy operators for handling non-discrete reasoning
Examples: FNN, Fuzzy C-Means,
Evolutionary Computing (CEC)

Classes of heuristic algorithms repeatedly
search for g
good solutions by
y mimicking
g
the process of natural evolution
Commonly used for optimization and
search problems
Examples: Genetic Algo, Memetic Algo,
Flagship Events of
World Congress on Computational Intelligence
(Australia 2012, Beijing 2014)
y p
on Computational
p
Intelligence
g
IEEE Symposium
(Singapore 2013, Florida, USA 2014)
IEEE Symposium on Computational Intelligence
in Big Data (IEEE CIBD'2014)
Examples of Use of CI in Big Data
Data size and feature space adaptation

Uncertainty modeling in learning from big data
Distributed learning techniques in uncertain environment
Uncertainty in cloud computing
Di ib d parallel
Distributed
ll l computation
i
Feature selection/extraction in big data
Sample selection based on uncertainty
Incremental Learning
Manifold Learning on big data
Uncertainty techniques in big data classification/clustering
Imbalance learning on big data
Active learning on big data
R d
Random
weight
i ht networks
t
k on bi
big d
data
t
Transfer learning on big data
Self-Organizing
S
lf O
i i N
Neurall
Networks for
P
Personalized
li d W
Web
b Intelligence
I t lli
Towards Personalized Web Intelligence

g
Ah-Hwee Tan, Hwee-Leng Ong,
Hong Pan, Jamie Ng, Qiu-Xiang Li
Knowledge and Information Systems 18 (2004) 297-306
Workflow for Web Data Analytics

y
Search
Getting the information
Organize
(clustering/categorizing)
Putting things in perspectives
Analyze (data mining)

Discover hidden knowledge
Share (knowledge management)

Saving for reference and sharing
Track
Constant monitoring
Approaches to
Organizing/Analyzing
Clustering
Cl stering
Organizing information into groups based on
similarity functions and thresholds
e.g. BullsEye, NorthernLight, Vivisimo
Categorization
g
Organizing information into a predefined set of
classes
e.g. Yahoo!, Autonomy Knowledge Server
Which is better?
Clustering
g
Pros
Unsupervised/self-organizing, require no training
or predefinition of classes
Able to identify new themes
Cons
Users have no control
Ever changing cluster structure
Difficult to navigate and track
Categorization
g
Pros
Good control on classes
Every info assigned to one or more classes
of interests
Cons
R
Require
i llearning
i ((supervised)
i d) and/or
d/
definition of classification rules/knowledge
Every info has to be assigned to one or
more classes
Good control but lack flexibility to handle
new information
User-configurable Clustering
(Tan & Pan,
Pan PAKDD 2002)
Information organization
organi ation and content
management
Online incremental clustering + user
userdefined structure (preferences)
Reduces to a clustering system if no user
indication given
Allows personalization in a direct
direct,
intuitive, and interactive manner
Control + flexibility
ARAM for Personalized

Information Management
Information Clusters
F2
F1
F1
a
Information Vector
Preference Vector
Flexible Organizer for Competitive

Intelligence (FOCI)
A platform for gathering, organizing,
tracking, analyzing, and sharing
competitive information
Natural way of turning raw search results
into personalized CI portfolios
Multilingual enabled
with Multilingual
g
Efficient Analyzer
y
Domain localization (Technology)
Patented and licensed to many companies
FOCI User Interface
FOCI Architecture
Intranet/
Internet
Users
CI Portfolio
Domain-Specific
Knowledge
Content
Management
Content
Publishing
g
Content
Analysis
Visu
ualization Front End
d
Content
Gathering
Personalized Content Management

g
Portfolio
f
created through Search
S
Unsupervised clustering (ARAM Pattern Channel A)
Loop
Personalization by users (ARAM Pattern Channel B)
Reorganization of clusters (ARAM Pattern Channel A&B)
Saving of personalized portfolio

Tracking of new information
Personalization Functions
Marking/labeling (selected) clusters
Personal interpretation
Inserting Clusters
Indicate preference on groupings
Merging clusters
Indicate preferences on similarities
Splitting clusters
Indicate preferences on differences
...
Information Clustering
g
A portfolio created
byy a meta-search of
4 search engines
with a query on
Text Mining
A Personalized Portfolio
after <=19 p
personalization operations
p
(mainly labeling and creating clusters)
Organizing
g
g New Information
Without the
Personalized
Portfolio
42 new documents from

DirectHit, Netscape, and
B i
BusinessWire
Wi
Based on
Personalized
Portfolio
Summary
y
A fusion neural network algorithm, called fusion ART, has
been
proposed
for
integrating
clustering
and
categorization
Has been applied to competitive
competiti e intelligence on the web.
eb
Comparing with
advantages in
existing
works,
fusion
ART
has
Personalization fusion ART performs analysis and organization

of data based on user preferences
Low time complexity fusion
f
ART performs
f
real-time search and
match of patterns resulting in a linear time complexity
Incremental clustering manner fusion ART may adapt to
d
dynamic
i web
b multimedia
l i di data
d
set by
b incrementally
i
ll clustering
l
i
new
patterns based on the learnt cluster structure without referring to
the old data.
3
2
Heterogeneous Data Co-clustering

for
Social Media Data
Theme Discovery and Mining
Lei Meng,
g Ah-Hwee Tan and Dong
g Xu
IEEE Transactions on Knowledge and Data Engineering, 2013
33
Introduction
The ppopularity
p
y of social websites leads to greatly
g
y
increase of web multimedia documents
Massive number Billions of images and articles online
Diversity Diverse content and booming emerging topics
Multi-modal descriptors images, text, category, tags,
Keywords
comments
Category
Birds
Images
from
Wild, bird, beach,
Surrounding
tree, vacation,
text
animal, mar, sunny,
playa, nayarit,
arena,ave, water,
vacaciones,
i
hollyday,
pelicano.
34
Introduction
Clustering of web multimedia data is challenging
Scalability
S
l bili to big
bi data
d
Difficulty in integrating multi-modal feature data
Ambiguity in deciding the number of categories
Rich but noisy meta-information semantic gap of images, noisy
tags
Bi d
Birds
Wild, bird,
beach, tree,
vacation,
animal mar,
animal,
mar
sunny, playa,
nayarit, arena,
ave, water,
vacaciones,
hollyday,
pelicano.
B h
Beach
Ocean, blue,
sea, summer,
vacation, sun,
man, beach,
b h
water, yellow,
fun, sand,
pplay,
y funny,
y
adult, humor,
lifestyle,
sunny, resort. 35
Problem Statement
We define the theme discovery of web multimedia data
as a heterogeneous
h
d
data
co-clustering
l
i problem,
bl
which
hi h
identifies the semantic categories of data patterns
through the fusion and recognition of multiple types of
features.
Multiple
Apple
Apple
Descriptions
Category
Fruits
Products
Movies
Tag
User
Description
Surrounding
text

36
Proposed
p
Approach
pp
A self-organizing neural network approach to Heterogeneous
Data Co-clustering
Based on Fusion Adaptive Resonance Theory (Fusion ART)
Fuse arbitrary number of feature modalities
Adaptively tune the weights for different feature modalities
Two different learning function for primary data, such as
images and articles, and meta-information to handle short
and nois
noisy text
te t
Incremental fast learning
Do
D not need
d to give
i the
h number
b off clusters
l
37
Experiments
NUS-WIDE data set
36784 images of 18 categories
Visual features: Grid color moment, Edge direction histogram, and
wavelet texture
Textual
T t l features
f t
off surrounding
di text:
t t 1142 words
d (7 words
d per image
i
on
average)
20 Newsgroups
g p data set
12826 text documents of 10 categories
Textual features of document content: over 60k words (800 words per
document on average)
Textual features of category: 3 labels per document on average
38
Experiments on NUS-WIDE Data Set

Evaluation on weight adaptation across channels for visual and
textual features
Performance Comparison with fixed weight values
GHF-ART with the adaptively tuned weight values _SA achieves the best
performance in 5 classes and the overall performance, and achieves close
performance with the best results obtained by fixed weight values
39

Tracking of the change in weight values of _SA
Textual features of surrounding text are assigned higher weights than visual
features
Thee value
v ue of
o _S
SA sstabilizes
b es in [0.7,
[ .7, 0.8]
. ] with
w thee increase
c e se of
o patterns
p e s
Big fluctuation may be resulted by the generation of new clusters
40
Clustering Performance comparison with existing algorithms in terms of

weighted average precision,
precision cluster entropy (H cluster
) class entropy ( H class ),
),
)
l
purity and rand index (RI)
GHF-ART achieves the best performance in terms of all the evaluation

measures
With supervisory information, GHF-ART(SS) consistently obtains better
performance
41

Time complexity analysis
GHF-ART and Fusion ART incur very small increase of time cost
For 23284 images, GHF-ART complete the clustering process in 10 seconds
42
Experiments
p
on 20 Newsgroups
g p Data Set
Clustering performance comparison using document content
andd category
t
information
i f
ti
Both GHF-ART and GHF-ART(SS) outperform other algorithms in all

the evaluation measures
GHF-ART
GHF ART has a 5% gain than Fusion ART in terms of Average
Precision, Purity and Rand Index.
Comparing with other unsupervised algorithms, GHF-ART achieves
around 80% in Average Precision,
Precision Purity and Rand Index while other
algorithms typically obtain less than 75%
43
Summary
y
A Heterogeneous data co-clustering algorithm, called GHFART is proposed to discover the themes of web multimedia data
ART,
via their rich but heterogeneous descriptors.
Comparing with existing works,
works GHF-ART
GHF ART has advantages in
Strong noise immunity A learning function of meta-information is
proposed to handle noise
Adaptive
Ad ti channel
h
l weighting
i hti A well-defined
ll d fi d weighting
i h i algorithm
l i h is
i
proposed to identify the important feature modalities for a better fusion of
multi-modal features for overall similarity measure;
Low
L
ti
time
complexity
l it GHF-ART
GHF ART performs
f
real-time
l ti
search
h and
d match
t h
of patterns resulting in a linear time complexity for big data;
Incremental clustering manner GHF-ART may adapt to dynamic
web
b multimedia
lti di data
d t sett by
b incrementally
i
t ll clustering
l t i new patterns
tt
b d
based
on the learnt cluster structure without referring to the old data.
44
Research Centre of Excellence in

A ti LI
Active
LIving
i for
f the
th elderLY
ld LY (LILY)
Aging in Place:
Opportunities and Challenges
Ah-Hwee Tan
((http://www.ntu.edu.sg/home/asahtan)
p
g
)
School of Computer Engineering
Nanyang Technological University
JOINT UBC-NTU RESEARCH CENTRE
Aging
g g in Place
the ability to live in one's own home and community
safely, independently, and comfortably, regardless of
age, income, or ability level - Center for Disease
Control,, Dec 2011
46
Motivation
Global aging population creates silver challenges
Most adults would prefer to age in place
78 percent of adults between the ages of 50 and 64
report that they would prefer to stay in their current
residence as they age
Growing elderly population will be living

independently in own homes
g
Vital to transform future homes into intelligent
human-centered environment for the elderly
Golden opportunities for innovating assistive
technologies
h l i for
f aging
i in
i place
l
47
A Basic Scenario of Tender Care for Agingin-place

p
Unobtrusive
Sensing
Social Signal
Processingg
Context
Aware Auto
Tagging
Social
Cognitive
Network
Unobtrusive sensing device detects: the elder keeps walking around at an irregular
pace.
Social signal processing indicates: the elder has been silent for an unusually long
time.
Cognitive
Analysis
result
lt
Your
mother may
be feeling
anxious
now
now
I need to
call my
y
mother
now
Silver Challenges
g
49
Vision
To enable
T
bl elderly
ld l to
t maintain
i t i an active,
ti
h lth and
healthy
d
engaging life style in their own homes supported by
an age-friendly
g
y intelligent
g
environment, pprovidingg allround comprehensive tender care
Round-the-clock day-to-day health and wellness
monitoring
i i
Cognitive Support and recommendation to products
and services
Companionship and emotional support
Support for maintaining/stimulating social
interaction
50
Design Consideration and

Challenges
How to perform unobtrusive monitoring?
- Mobile sensing, activity tracking
How to provide all
all-around
around comprehensive care?
- Physical, cognitive, emotional, social, sustainability
How to maintain ubiquitous

q
access
interaction?
- Cross platform, multimedia, multimodal
How to provide friendly, personal touch?
- Adaptive user modeling, mood detection
and
- Proactive,
P
i naturall iinteraction
i
51
Approach
pp
and Methodology
gy
To support
pp active livingg off elderlies
through an intelligent multi-agent environment
with ubiquitous access, natural interface, and allrounded
d d comprehensive
h i care
Key Technologies
Unobtrusive sensing and social signal processing

Activity pattern and user modeling
Information and service recommendation
Proactive stimulation and natural interaction
52
A Multi-Agent Collaborative
Care Environment
Isabel
(Personal Nurse)
Small talk
Recommendations
for healthcare
products and services
Alf d
Alfred
(The Butler)
Small talk
User modeling
Social and travel
advisory
Frank
(Robot Dog)
Activity sensing
Pattern modeling
53
Why
y Multi-Agent?
g
Unobtrusive sensing and monitoring agents
of different characteristics and capabilities
Ubi
Ubiquitous
i
access to information
i f
i and
d
services agents in different platforms and
locations
Comprehensive tender care agents with

diff
different
domain
d
i knowledge
k
l d and
d functions
f
i
Threes a p
party
y more opportunities
pp
for
cognitive stimulation and social interaction
54
Comprehensive Tender Care

Physical Support Activity tracking,
tracking safety and
wellness monitoring
Cognitive
C
ii S
Support information
i f
i and
d
recommendation on (healthcare) products, services,
skills
k
and
nd activities
ct v t
Emotional Support mood detection, affective

support,
t small
ll talk
t lk
Social Support companionship and connection

to family and friends (old and new) through sms,
emails and facebooks etc
55
Unobtrusive Sensing and

Ubiquitous Access to Services
unobtrusive in-home real-time data collection
and contextual social signal processing
- Essential to better understand and cater to the
elderlys
ld l needs.
d
Sensing bio sensing, motion sensors,

wearable/mobile sensors for health monitoring and
activity tracking
Cross Platform Large screen interactive display,

mobile handheld devices, physical robots
Multimedia text, audio, video
56
Adaptive
p
User Modellingg
Identity and profile
Interests and preferences
Behaviour model: Time,
Ti
space,
p
activity
ti it
Knowledge and skills
Social
S i l network:
k Family
l and
d friends
f
d
Meth0ds for Model Building
Explicit: User specification
Implicit: User actions, choices, conversation
57
Cognitive Support:
Product/Service Recommendation
Domain knowledge:
Healthcare, Travel, Cooking
Delivery modes:
- Question & Answer
- Proactive
P
i recommendation
d i
- Conversation
Personal
P
l Touch:
T h
Personalized, Context sensitive, small talks
58
Challenges in
y
Bigg Livingg Analytics
Volume huge amount of data through bio
sensing, motion sensors, wearable/mobile sensors
for health monitoring and activity tracking
Velocity 24x7 real time sensing, sense making,

decision making
making, service recommendation
Variety information integration and knowledge

sharing
h i from
f
cross platform,
l f
multimedia
l i di
unstructured data - text, audio, video, gestures
59
Research Centre of Excellence in

Active LIving
LI
for the elderLY
LY (LILY)
Thank you!
JOINT UBC-NTU RESEARCH CENTRE

Computational Intelligence For Big Data Analytics - BDA 2013

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computational Intelligence For Big Data Analytics - BDA 2013

Uploaded by

Copyright:

Available Formats

Big Data Analytics:

Flexible Organizer for Competitive

Web Information Fusion and Associative

Analytics for Active Living for Elderly

The Era of Big Data

Examples of Big Data

Facebook handles 50 billion photos from its user base.

Examples of Big Data

Money of Big Data

Market of Big Data

Big Data Market Segments

Big Data Market Segment by

Challenges of Big Data

Deeper Issues of Big Data

Neural Networks (IJCNN)

Fuzzy Systems (IEEE-FUZZ)

Evolutionary Computing (CEC)

Examples of Use of CI in Big Data

Data size and feature space adaptation

Towards Personalized Web Intelligence

Workflow for Web Data Analytics

Analyze (data mining)

Share (knowledge management)

ARAM for Personalized

Flexible Organizer for Competitive

Patented and licensed to many companies

FOCI User Interface

Personalized Content Management

Saving of personalized portfolio

42 new documents from

Personalization fusion ART performs analysis and organization

Heterogeneous Data Co-clustering

Experiments on NUS-WIDE Data Set

Experiments on NUS-WIDE Data Set

Experiments on NUS-WIDE Data Set

Clustering Performance comparison with existing algorithms in terms of

GHF-ART achieves the best performance in terms of all the evaluation

Experiments on NUS-WIDE Data Set

Both GHF-ART and GHF-ART(SS) outperform other algorithms in all

Research Centre of Excellence in

JOINT UBC-NTU RESEARCH CENTRE

Growing elderly population will be living

A Basic Scenario of Tender Care for Agingin-place

Design Consideration and

How to maintain ubiquitous

Unobtrusive sensing and social signal processing

Comprehensive tender care agents with

Comprehensive Tender Care

Emotional Support mood detection, affective

Social Support companionship and connection

Unobtrusive Sensing and

Sensing bio sensing, motion sensors,

Cross Platform Large screen interactive display,

Multimedia text, audio, video

Velocity 24x7 real time sensing, sense making,

Variety information integration and knowledge

Research Centre of Excellence in

You might also like