You are on page 1of 60

Big Data Analytics:

l
Challenges
h ll
and
d
What Computational Intelligence
Techniques
h
May Offer
ff
Ah-Hwee Tan
(http://www.ntu.edu.sg/home/asahtan)
School of Computer Engineering
Nanyang Technological University
Big Data Analytics Symposium
London, UK
13 September 2013

Outline
Big Data Analytics
Computational Intelligence Techniques
Web Data Analytics

Flexible Organizer for Competitive


Intelligence (FOCI)

Web Information Fusion and Associative


Discovery
Di

Analytics for Active Living for Elderly

The Era of Big Data


Big data refers to
collection of data sets so large and complex
th t exceed
that
d th
the competence
t
off commonly
l used
d
IT systems in terms of processing space and/or
time.
time

Sources of Big
g Data
Traditionally, mostly produced in scientific fields such as
astronomy meteorology,
astronomy,
meteorology genomics physics
physics, biology
biology, and
environmental research.
With rapid
p development
p
of IT technology
gy and the
consequent decrease of cost on collecting and storing
data, big data has been generated from almost every
industry and sector as well as governmental department
department,
including retail, finance, banking, security, audit, electric
power, healthcare.
Recently, big data over the Web (big Web data for short),
which includes all the context data, such as, user
generated contents,
contents browser/search log data
data, deep web
data, etc.

Examples of Big Data


(Source: Wikipedia)
Walmart handles more than 1 million customer transactions
every hour,
h
which
hi h iis iimported
t d iinto
t d
databases
t b
estimated
ti t d tto
contain more than 2.5 petabytes (2560 terabytes) of data
the equivalent of 167 times the information contained in all the books in
the US Library of Congress.

Facebook handles 50 billion photos from its user base.


FICO Falcon Credit Card Fraud Detection System protects
2.1 billion active accounts world-wide.
Windermere Real Estate uses anonymous GPS signals from
nearly 100 million drivers to help new home buyers
yp
drive times to and from work
determine their typical
throughout various times of the day.

Examples of Big Data


(Source: Wikipedia)
NASA Center for Climate Simulation
(NCCS) stores 32 petabytes of
climate observations and simulations
on the Discover supercomputing
cluster.
Utah Data Center is a data center
c rrentl being constr
currently
constructed
cted b
by the
United States National Security
Agency. When finished, the facility
will handle yottabytes of information
collected by NSA over the Internet.

Value

Metric

1000

kB

kilobyte

10002

MB

megabyte

10003

GB

gigabyte

10004

TB

terabyte

10005

PB

petabyte

10006

EB

exabyte

10007

ZB

zettabyte

10008

YB

yottabyte

Money of Big Data


(Source: Wikipedia)
"Big
g data" have increased the demand of information
management specialists
Software AG, Oracle Corporation, IBM, Microsoft,
SAP EMC,
SAP,
EMC and
d HP have
h
spentt more than
th $15 billion
billi
on software firms specializing in data management
and analytics.
y
In 2010, this industry on its own was worth more than
$100 billion and was growing at almost 10 percent a
year: about twice as fast as the software business as
a whole.

Market of Big Data


(Source: Wikipedia)
Developed economies make increasing use of datadata
intensive technologies. There are 4.6 billion mobilephone subscriptions worldwide and there are between
1 billion and 2 billion people accessing the internet
The world's effective capacity to exchange information
through telecommunication networks was 281
petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes
in 2000, 65 exabytes in 2007[14] and it is predicted that
the amount of traffic flowing over the internet will reach
667 exabytes annually by 2013.[5]

Big Data Market Segments


(Report by Transparency Market Research)
Segmentation
g
of the big
g data market byy components,
p
, byy
applications and by geography.
The different components included are software and
services, hardware and storage.
Software and services segment dominates the components
market whereas storage segment will be the fastest
growing segment for the next 5 years owing to the
perpetual
t l growth
th in
i th
the d
data
t generated.
t d

Big Data Market Segment by


Applications
Covered eight applications namely financial services,
manufacturing, healthcare, telecommunication,
government, retail and media & entertainment and others in
the application segment.
Financial Services, healthcare and the government sector
are the top three contributors of the big data market and
together held more than 55% of the big data market in
2012.
M
Media
di and
dE
Entertainment
t t i
t and
d th
the h
healthcare
lth
sectors
t
will
ill
grow at high CAGR of nearly 42% from 2012 to 2018. The
growth in data in the form of video, images,
g
g
and g
games is
driving the media and entertainment segment.
Read more: http://www.digitaljournal.com/pr/1395146#ixzz2b0hvuxrQ

Challenges of Big Data


Volume
Size in the order of petabytes,
exabytes,

Velocity
Time sensitive data, data that
grow exponentially
g
p
y or even in
rates that overwhelm the wellknown Moore's Law

Value

Metric

1000

kB

kilobyte

10002

MB

megabyte

10003

GB

gigabyte
i b t

10004

TB

terabyte

10005

PB

petabyte

10006

EB

exabyte

10007

ZB

zettabyte

10008

YB

yottabyte

Variety
V i t
From structured data into semi-structured and
completely unstructured data of different types
types, such as
text, image, audio, video, click streams, log files,

Deeper Issues of Big Data


(The additional 3Vs)
Validity
Is the data correct and accurate for the intended
usage?

Veracity
V
i
Are the results meaningful for the given problem
space?

Volatility
How long do you need to look/store this data?

Computational Intelligence

Neural Networks (IJCNN)


Brain-like mathematical models for pattern
recognition, memory, and association discovery
Examples: Perceptron, BP, SVM, SOM, ART,

Fuzzy Systems (IEEE-FUZZ)


Fuzzy operators for handling non-discrete reasoning
Examples: FNN, Fuzzy C-Means,

Computational Intelligence

Evolutionary Computing (CEC)


Classes of heuristic algorithms repeatedly
search for g
good solutions by
y mimicking
g
the process of natural evolution
Commonly used for optimization and
search problems
Examples: Genetic Algo, Memetic Algo,

Flagship Events of
Computational Intelligence
World Congress on Computational Intelligence
(Australia 2012, Beijing 2014)
y p
on Computational
p
Intelligence
g
IEEE Symposium
(Singapore 2013, Florida, USA 2014)
IEEE Symposium on Computational Intelligence
in Big Data (IEEE CIBD'2014)

Examples of Use of CI in Big Data

Data size and feature space adaptation


Uncertainty modeling in learning from big data
Distributed learning techniques in uncertain environment
Uncertainty in cloud computing
Di ib d parallel
Distributed
ll l computation
i
Feature selection/extraction in big data
Sample selection based on uncertainty
Incremental Learning
Manifold Learning on big data
Uncertainty techniques in big data classification/clustering
Imbalance learning on big data
Active learning on big data
R d
Random
weight
i ht networks
t
k on bi
big d
data
t
Transfer learning on big data

Self-Organizing
S
lf O
i i N
Neurall
Networks for
P
Personalized
li d W
Web
b Intelligence
I t lli

Towards Personalized Web Intelligence


g
Ah-Hwee Tan, Hwee-Leng Ong,
Hong Pan, Jamie Ng, Qiu-Xiang Li
Knowledge and Information Systems 18 (2004) 297-306

Workflow for Web Data Analytics


y
Search
Getting the information

Organize
(clustering/categorizing)
Putting things in perspectives

Analyze (data mining)


Discover hidden knowledge

Share (knowledge management)


Saving for reference and sharing

Track
Constant monitoring

Approaches to
Organizing/Analyzing
Clustering
Cl stering
Organizing information into groups based on
similarity functions and thresholds
e.g. BullsEye, NorthernLight, Vivisimo

Categorization
g
Organizing information into a predefined set of
classes
e.g. Yahoo!, Autonomy Knowledge Server

Which is better?

Clustering
g
Pros
Unsupervised/self-organizing, require no training
or predefinition of classes
Able to identify new themes

Cons
Users have no control
Ever changing cluster structure
Difficult to navigate and track

Categorization
g
Pros
Good control on classes
Every info assigned to one or more classes
of interests

Cons
R
Require
i llearning
i ((supervised)
i d) and/or
d/
definition of classification rules/knowledge
Every info has to be assigned to one or
more classes
Good control but lack flexibility to handle
new information

User-configurable Clustering
(Tan & Pan,
Pan PAKDD 2002)

Information organization
organi ation and content
management
Online incremental clustering + user
userdefined structure (preferences)
Reduces to a clustering system if no user
indication given
Allows personalization in a direct
direct,
intuitive, and interactive manner
Control + flexibility

ARAM for Personalized


Information Management
Information Clusters
F2

F1

F1
a

Information Vector

Preference Vector

Flexible Organizer for Competitive


Intelligence (FOCI)
A platform for gathering, organizing,
tracking, analyzing, and sharing
competitive information
Natural way of turning raw search results
into personalized CI portfolios
Multilingual enabled
with Multilingual
g
Efficient Analyzer
y
Domain localization (Technology)

Patented and licensed to many companies

FOCI User Interface

FOCI Architecture
Intranet/
Internet

Users
CI Portfolio
Domain-Specific
Knowledge

Content
Management
Content
Publishing
g
Content
Analysis

Visu
ualization Front End
d

Content
Gathering

Personalized Content Management


g
Portfolio
f
created through Search
S
Unsupervised clustering (ARAM Pattern Channel A)
Loop
Personalization by users (ARAM Pattern Channel B)
Reorganization of clusters (ARAM Pattern Channel A&B)

Saving of personalized portfolio


Tracking of new information

Personalization Functions
Marking/labeling (selected) clusters
Personal interpretation

Inserting Clusters
Indicate preference on groupings

Merging clusters
Indicate preferences on similarities

Splitting clusters
Indicate preferences on differences

...

Information Clustering
g

A portfolio created
byy a meta-search of
4 search engines
with a query on
Text Mining

A Personalized Portfolio
after <=19 p
personalization operations
p
(mainly labeling and creating clusters)

Organizing
g
g New Information
Without the
Personalized
Portfolio

42 new documents from


DirectHit, Netscape, and
B i
BusinessWire
Wi

Based on
Personalized
Portfolio

Summary
y
A fusion neural network algorithm, called fusion ART, has
been
proposed
for
integrating
clustering
and
categorization
Has been applied to competitive
competiti e intelligence on the web.
eb
Comparing with
advantages in

existing

works,

fusion

ART

has

Personalization fusion ART performs analysis and organization


of data based on user preferences
Low time complexity fusion
f
ART performs
f
real-time search and
match of patterns resulting in a linear time complexity
Incremental clustering manner fusion ART may adapt to
d
dynamic
i web
b multimedia
l i di data
d
set by
b incrementally
i
ll clustering
l
i
new
patterns based on the learnt cluster structure without referring to
the old data.
3
2

Heterogeneous Data Co-clustering


for
Social Media Data
Theme Discovery and Mining

Lei Meng,
g Ah-Hwee Tan and Dong
g Xu
IEEE Transactions on Knowledge and Data Engineering, 2013

33

Introduction
The ppopularity
p
y of social websites leads to greatly
g
y
increase of web multimedia documents
Massive number Billions of images and articles online
Diversity Diverse content and booming emerging topics
Multi-modal descriptors images, text, category, tags,
Keywords
comments
Category
Birds

Images

from
Wild, bird, beach,
Surrounding
tree, vacation,
text
animal, mar, sunny,
playa, nayarit,
arena,ave, water,
vacaciones,
i
hollyday,
pelicano.
34

Introduction
Clustering of web multimedia data is challenging

Scalability
S
l bili to big
bi data
d
Difficulty in integrating multi-modal feature data
Ambiguity in deciding the number of categories
Rich but noisy meta-information semantic gap of images, noisy
tags

Bi d
Birds

Wild, bird,
beach, tree,
vacation,
animal mar,
animal,
mar
sunny, playa,
nayarit, arena,
ave, water,
vacaciones,
hollyday,
pelicano.

B h
Beach

Ocean, blue,
sea, summer,
vacation, sun,
man, beach,
b h
water, yellow,
fun, sand,
pplay,
y funny,
y
adult, humor,
lifestyle,
sunny, resort. 35

Problem Statement
We define the theme discovery of web multimedia data
as a heterogeneous
h
d
data
co-clustering
l
i problem,
bl
which
hi h
identifies the semantic categories of data patterns
through the fusion and recognition of multiple types of
features.
Multiple
Apple
Apple

Descriptions
Category

Fruits

Products

Movies

Tag
User
Description
Surrounding
text


36

Proposed
p
Approach
pp
A self-organizing neural network approach to Heterogeneous
Data Co-clustering
Based on Fusion Adaptive Resonance Theory (Fusion ART)
Fuse arbitrary number of feature modalities
Adaptively tune the weights for different feature modalities
Two different learning function for primary data, such as
images and articles, and meta-information to handle short
and nois
noisy text
te t
Incremental fast learning
Do
D not need
d to give
i the
h number
b off clusters
l
37

Experiments
NUS-WIDE data set
36784 images of 18 categories
Visual features: Grid color moment, Edge direction histogram, and
wavelet texture
Textual
T t l features
f t
off surrounding
di text:
t t 1142 words
d (7 words
d per image
i
on
average)

20 Newsgroups
g p data set
12826 text documents of 10 categories
Textual features of document content: over 60k words (800 words per
document on average)
Textual features of category: 3 labels per document on average

38

Experiments on NUS-WIDE Data Set


Evaluation on weight adaptation across channels for visual and
textual features
Performance Comparison with fixed weight values

GHF-ART with the adaptively tuned weight values _SA achieves the best
performance in 5 classes and the overall performance, and achieves close
performance with the best results obtained by fixed weight values

39

Experiments on NUS-WIDE Data Set


Tracking of the change in weight values of _SA

Textual features of surrounding text are assigned higher weights than visual
features
Thee value
v ue of
o _S
SA sstabilizes
b es in [0.7,
[ .7, 0.8]
. ] with
w thee increase
c e se of
o patterns
p e s
Big fluctuation may be resulted by the generation of new clusters

40

Experiments on NUS-WIDE Data Set

Clustering Performance comparison with existing algorithms in terms of


weighted average precision,
precision cluster entropy (H cluster
) class entropy ( H class ),
),
)
l
purity and rand index (RI)

GHF-ART achieves the best performance in terms of all the evaluation


measures
With supervisory information, GHF-ART(SS) consistently obtains better
performance

41

Experiments on NUS-WIDE Data Set


Time complexity analysis

GHF-ART and Fusion ART incur very small increase of time cost
For 23284 images, GHF-ART complete the clustering process in 10 seconds

42

Experiments
p
on 20 Newsgroups
g p Data Set
Clustering performance comparison using document content
andd category
t
information
i f
ti

Both GHF-ART and GHF-ART(SS) outperform other algorithms in all


the evaluation measures
GHF-ART
GHF ART has a 5% gain than Fusion ART in terms of Average
Precision, Purity and Rand Index.
Comparing with other unsupervised algorithms, GHF-ART achieves
around 80% in Average Precision,
Precision Purity and Rand Index while other
algorithms typically obtain less than 75%
43

Summary
y
A Heterogeneous data co-clustering algorithm, called GHFART is proposed to discover the themes of web multimedia data
ART,
via their rich but heterogeneous descriptors.
Comparing with existing works,
works GHF-ART
GHF ART has advantages in
Strong noise immunity A learning function of meta-information is
proposed to handle noise
Adaptive
Ad ti channel
h
l weighting
i hti A well-defined
ll d fi d weighting
i h i algorithm
l i h is
i
proposed to identify the important feature modalities for a better fusion of
multi-modal features for overall similarity measure;
Low
L
ti
time
complexity
l it GHF-ART
GHF ART performs
f
real-time
l ti
search
h and
d match
t h
of patterns resulting in a linear time complexity for big data;
Incremental clustering manner GHF-ART may adapt to dynamic
web
b multimedia
lti di data
d t sett by
b incrementally
i
t ll clustering
l t i new patterns
tt
b d
based
on the learnt cluster structure without referring to the old data.
44

Research Centre of Excellence in


A ti LI
Active
LIving
i for
f the
th elderLY
ld LY (LILY)

Aging in Place:
Opportunities and Challenges
Ah-Hwee Tan
((http://www.ntu.edu.sg/home/asahtan)
p
g
)
School of Computer Engineering
Nanyang Technological University

JOINT UBC-NTU RESEARCH CENTRE

Aging
g g in Place
the ability to live in one's own home and community
safely, independently, and comfortably, regardless of
age, income, or ability level - Center for Disease
Control,, Dec 2011

46

Motivation
Global aging population creates silver challenges
Most adults would prefer to age in place
78 percent of adults between the ages of 50 and 64
report that they would prefer to stay in their current
residence as they age

Growing elderly population will be living


independently in own homes
g
Vital to transform future homes into intelligent
human-centered environment for the elderly
Golden opportunities for innovating assistive
technologies
h l i for
f aging
i in
i place
l
47

A Basic Scenario of Tender Care for Agingin-place


p
Unobtrusive
Sensing
Social Signal
Processingg
Context
Aware Auto
Tagging
Social
Cognitive
Network

Unobtrusive sensing device detects: the elder keeps walking around at an irregular
pace.
Social signal processing indicates: the elder has been silent for an unusually long
time.

Cognitive
Analysis
result
lt

Your
mother may
be feeling
anxious
now
now

I need to
call my
y
mother
now

Silver Challenges
g

49

Vision
To enable
T
bl elderly
ld l to
t maintain
i t i an active,
ti
h lth and
healthy
d
engaging life style in their own homes supported by
an age-friendly
g
y intelligent
g
environment, pprovidingg allround comprehensive tender care
Round-the-clock day-to-day health and wellness
monitoring
i i
Cognitive Support and recommendation to products
and services
Companionship and emotional support
Support for maintaining/stimulating social
interaction
50

Design Consideration and


Challenges
How to perform unobtrusive monitoring?
- Mobile sensing, activity tracking
How to provide all
all-around
around comprehensive care?
- Physical, cognitive, emotional, social, sustainability

How to maintain ubiquitous


q
access
interaction?
- Cross platform, multimedia, multimodal
How to provide friendly, personal touch?
- Adaptive user modeling, mood detection

and

- Proactive,
P
i naturall iinteraction
i
51

Approach
pp
and Methodology
gy
To support
pp active livingg off elderlies
through an intelligent multi-agent environment
with ubiquitous access, natural interface, and allrounded
d d comprehensive
h i care
Key Technologies

Unobtrusive sensing and social signal processing


Activity pattern and user modeling
Information and service recommendation
Proactive stimulation and natural interaction
52

A Multi-Agent Collaborative
Care Environment
Isabel
(Personal Nurse)
Small talk
Recommendations
for healthcare
products and services

Alf d
Alfred
(The Butler)
Small talk
User modeling
Social and travel
advisory

Frank
(Robot Dog)
Activity sensing
Pattern modeling
53

Why
y Multi-Agent?
g
Unobtrusive sensing and monitoring agents
of different characteristics and capabilities

Ubi
Ubiquitous
i
access to information
i f
i and
d
services agents in different platforms and
locations

Comprehensive tender care agents with


diff
different
domain
d
i knowledge
k
l d and
d functions
f
i

Threes a p
party
y more opportunities
pp
for
cognitive stimulation and social interaction
54

Comprehensive Tender Care


Physical Support Activity tracking,
tracking safety and
wellness monitoring

Cognitive
C
ii S
Support information
i f
i and
d
recommendation on (healthcare) products, services,
skills
k
and
nd activities
ct v t

Emotional Support mood detection, affective


support,
t small
ll talk
t lk

Social Support companionship and connection


to family and friends (old and new) through sms,
emails and facebooks etc

55

Unobtrusive Sensing and


Ubiquitous Access to Services
unobtrusive in-home real-time data collection
and contextual social signal processing
- Essential to better understand and cater to the
elderlys
ld l needs.
d

Sensing bio sensing, motion sensors,


wearable/mobile sensors for health monitoring and
activity tracking

Cross Platform Large screen interactive display,


mobile handheld devices, physical robots

Multimedia text, audio, video

56

Adaptive
p
User Modellingg
Identity and profile
Interests and preferences
Behaviour model: Time,
Ti
space,
p
activity
ti it
Knowledge and skills
Social
S i l network:
k Family
l and
d friends
f
d
Meth0ds for Model Building
Explicit: User specification
Implicit: User actions, choices, conversation
57

Cognitive Support:
Product/Service Recommendation
Domain knowledge:
Healthcare, Travel, Cooking

Delivery modes:
- Question & Answer
- Proactive
P
i recommendation
d i
- Conversation

Personal
P
l Touch:
T h
Personalized, Context sensitive, small talks
58

Challenges in
y
Bigg Livingg Analytics
Volume huge amount of data through bio
sensing, motion sensors, wearable/mobile sensors
for health monitoring and activity tracking

Velocity 24x7 real time sensing, sense making,


decision making
making, service recommendation

Variety information integration and knowledge


sharing
h i from
f
cross platform,
l f
multimedia
l i di
unstructured data - text, audio, video, gestures

59

Research Centre of Excellence in


Active LIving
LI
for the elderLY
LY (LILY)

Thank you!
JOINT UBC-NTU RESEARCH CENTRE

You might also like