You are on page 1of 56

KHAI THC D LIU

& NG DNG
(DATA MINING)
GV : ThS.L Ngc Thnh

BI 1
TNG QUAN

NI DUNG
1. Ti sao cn khai thc d liu ?
2. Khai thc d liu (KTDL) l g ?
3. Qui trnh Khm ph tri thc (KDD)

4. Cc nhim v chnh ca KTDL


5. Cc k thut KTDL
6. Cc thch thc ca KTDL
3


Kha cnh thng mi
S CN THIT CA KTDL
Khi lng ln d liu

c thu thp v lu tr

Web data, e-commerce

Ha n mua hng ti siu th

/ trung tm mua sm
Giao dch ngn hng /
th tin dng

My tnh mnh hn, r hn


p lc cnh tranh rt mnh
o
Cung cp cc dch v a dng, cht lng tt ( CRM
Customer Relationship Management)
4

S CN THIT CA KTDL

Kha cnh Khoa hc


D liu c thu thp
v lu tr vi tc cao (GB/h)

o
o
o

Thit b remote sensor trn v tinh


Knh thin vn quan st bu tri
Microarray to d liu biu din gien
Th nghim khoa hc to hng TeraByte

Cc k thut truyn thng khng


kh nng lm vic vi d liu th
KTDL c th gip cc nh khoa hc

o
o

Phn loi v phn on d liu


Xy dng gi thuyt
5

S RA I CA KTDL
KTDL ra i trong bi
cnh : GIU DL

NGHO TRI THC


We are drowning in
data, but starving for
knowledge!
KTDL - gii php
gip phn tch t ng
cc ni DL v h tr ra
quyt nh .

S CN THIT CA KTDL
DL cha rt nhiu thng tin gi

tr, c li cho qui trnh ra quyt


nh
Khng th phn tch DL = tay

Con ngi cn hng tun l


khm ph ra thng tin c ch
Phn ln d liu cha bao gi
c phn tch c
H su gia kh nng sinh ra DL
v kh nng s dng DL
Usama Fayyad

106-1012 bytes:
Khong bao gi co
the nhn thay mot
cach ay u tap
d lieu hoac a
vao bo nh cua
may tnh

S CN THIT CA KTDL
4,000,000
3,500,000

H su d liu

3,000,000
2,500,000
2,000,000

S DL thu thp (TeraB) t nm 1995

1,500,000
1,000,000

S DL c
phn tch

500,000
0
1995

1996

1997

1998

1999
8

S DNG KTDL KHI NO?


D liu qu nhiu
D liu ln (chiu v kch thc)
D liu nh ( kch thc)
D liu gene (s chiu)
C t tri thc v d liu
9

LNH VC NG DNG KTDL


Thong tin thng mai

Thong tin san xuat

-Phan tch th trng va


mua ban
-Phan tch au t
-Chap thuan cho vay
- ieu khien va len ke hoach
-Phat hien gian lan
- Quan tr mang

- Phan tch cac ket qua thc


nghiem

Thong tin ca nhan


Thong tin khoa hoc
- Thien van hoc
- C s d lieu sinh hoc
- Khoa hoc a chat: bo do tm ong
at

10

Customer Relationship Management (CRM)

Customer Relationship
Management (CRM)
xy dng mi quan h vi khch hng, cc cng
ty cn phi bit :
1.

Notice what its customers are doing

2.

Remember what it and its customers have


done over time

3.

Learn from what it has remembered

4.

Act On what it has learned to make customers


more profitable

Da trn cc d liu giao dch


(Transaction Data)

Da trn cc d liu giao dch


(Transaction Data)

Pht hin v nm gi mi quan


h l cha kho ca thnh cng

NI DUNG
1. Ti sao cn khai thc d liu ?

2. Khai thc d liu l g ?


3.
4.
5.
6.

Qui trnh KDD


Cc nhim v chnh ca KTDL
Cc k thut KTDL
Cc thch thc ca KTDL
16

TH NO L KTDL
Khai thc d liu l qu trnh khng tm thng ca vic xc
nh cc mu tim n c tnh hp l, mi l, c ch v c
th hiu c ti a trong CSDL U.Fayyad, (1996)
a x ly
Qua trnh khong tam thng

Hp le

Chng minh tnh ung


Cua mau / Mo hnh

Mi la

Khong biet trc

Co ch
Co the hieu c

Co the s dung c
Bi con ngi va may

17

KHAI THC DL

Th no l mu tim n ?

L mi quan h trong d liu v d nh :

Nhng ngi mua qun ty thng hay mua


thm o s mi
Nhng ngi c mc tn dng tt th thng
t b tai nn.
n ng, 37+, thu nhp : 50K-75K, -> chi
khong 25$-50$ cho t mua hng qua
catalog

18

KHAI THC DL ....


What is not Data
Mining?

What is Data Mining?

Tm s in thoi
trong danh b in
thoi

Cc tn ph bin ti khu
vc xc nh ca M
(OBrien, ORurke,
OReilly vng Boston )

Tm thng tin v
Amazon
trn
serach engine

Gom nhm cc ti liu


ging nhau thu c t
search engine da trn ni
dung (VD: rng nhit i
Amazon , Amazon.com)
19

NI DUNG
1. Ti sao cn khai thc d liu ?
2. Khai thc d liu l g ?

3. Qui trnh Khm ph tri thc

(KDD)
4. Cc nhim v chnh ca KTDL
5. Cc k thut KTDL
6. Cc thch thc ca KTDL
20

QUI TRNH KHM PH TRI THC

KTDL : Mt bc
quan trng trong qui
trnh KDD (knowledge
discovery in DB)

Pattern Evaluation

3
Data Mining

Task-relevant Data

Data Warehouse

Selection

Data Cleaning

1 Data Integration
Databases

21

QUI TRNH KDD

D lieu c to chc theo chc


nang
Tao ra/chon loc
CSDL ch

Data warehousing
1

Chon la ky thuat
ien hnh va d lieu mau
Thay the nhng
gia tr thieu

Kh nhieu
D lieu

Chuan hoa
gia tr

Bien oi
gia tr

2
Tao cac thuoc
Tnh dan xuat

Tm thuoc tnh quan


trong &Mien gia tr

3
La chon
nhiem vu DM

Bien oi qua
bieu ien khac

La chon
phng phap DM

Trch xuat
Tri thc

Phat sinh ra cau hoi va bao cao


Cac phng phap cai tien
kieu ket hp va lap day
5

Kiem tra
tri thc

Tnh che
Tri thc

22

KIN TRC H THNG KTDL


TIU BIU
Graphical user interface

Pattern evaluation
Data mining engine
Knowledge-base

Database or data
warehouse server
Data cleaning & data integration

Databases

Filtering

Data
Warehouse
23

NI DUNG
1. Ti sao cn khai thc d liu ?
2. Khai thc d liu l g ?
3. Qui trnh khm ph tri thc (KDD)

4. Cc nhim v chnh ca KTDL


5. Cc k thut KTDL
6. Cc thch thc ca KTDL
24

CC NHIM V CHNH CA KTDL

25

CC NHIM V CHNH CA KTDL

D on (Predictive) :

S dng mt vi bin d bo gi tr cha bit hoc


gi tr tng lai ca cc bin khc

Phn lp
Hi qui
Pht hin s thay i /lc hng
M t ( Descriptive) :

Xc nh cc mu m t DL m con ngi c th hiu


c

Gom cm
Tm tt
M hnh ha ph thuc
26

CC NHIM V CHNH CA KTDL


Pht hin ra m t ca mt
vi lp c xc nh
v phn loi d liu vo
mt trong cc lp .

Phn lp
?

nh x t mt mu d liu
thnh mt bin d on
trc c gi tr thc .

Hi qui
Pht hin ra nhng thay i
quan trng nht
trong d liu

Pht hin s thay


i/lc hng

Tm ra mt tp xc nh
Cc nhm hay cc cm
m t d liu

Gom cm
Pht hin ra mt m
hnh m m t ph
thuc quan trng nht
gia cc bin

M hnh ha
ph thuc
Pht hin ra mt m t
tm tt cho mt
tp con d liu

Tm tt

27

V D PHN LP

Cng ty Verizon Wireless :


Cng ty cung cp thit b, dch v khng dy ln
nht M. www.verizonwireless.com
S lng khch hng : 65.7 triu (cui nm 2007)
Thu nhp hng nm: 43.9 t $
Vn :
T l khch hng b mt cao : 2%/thng (1,300,000
khch hng ri b/thng)
Chi ph thay th : hng trm triu $/nm
Chi ph trung bnh cho mi khch hng mi : 320$
28

V D PHN LP

Gii php thng thng :


Cho mi, khuyn mi tt c khch hng trc khi ht hp ng
Ch ph qu tn km, lng ph

Gii php ca KTDL :

Xy dng m hnh d on

Dng m hnh d on xc nh cc khch hng c


kh nng ri b

Sau :
Khuyn mi, cho mi (VD: mt in thoi mi) cho
nhng khch hng c nhiu kh nng ri b nht
Pht trin k hach mi nhm p ng nhu cu ca khch
hng
Kt qu : gim t l mt khch hng di 1.5 %/ thng
29

V D PHN LP
Model/Pattern

Training Data:
Customer characteristics &
cell phone usage behavior

The model is used to infer the probability a customer would leave

Model

Consumer i

Probability
customer
would
terminate
contract
30

PHN LP: NG DNG 1

Pht hin gian ln :


Mc ch : D on cc trng hp gian ln trong giao
dch th tn dng

Hng gii quyt :

Dng cc giao dch th tn dng v thng tin ca ch


th nh thuc tnh
Khch hng mua ci g, lc no, s ln dng th

Gn nhn giao dch c l gian ln hay hp l, ng - to


thnh thuc tnh lp
Xy dng m hnh cho lp cc giao dch
Dng m hnh khm ph gian ln trn cc giao dch th
tn dng

31

PHN LP: NG DNG 2

Qung co :
Mc ch : Gim ch ph th tn bng cch tp trung vo
nhm khch hng c nhiu kh nng mua sn phm in
thoi di ng mi

Hng gii quyt :

S dng d liu cho sn phm tng t trc y


Dng quyt nh {mua, khng mua} lm thuc tnh lp
Thu thp thng tin c nhn, cch sng v quan h ca tt
c cc khch hng
Dng cc thng tin trn nh l d liu u vo xy
dng m hnh phn lp
32

GOM CM : Minh ha
Gom cm da trn khong cch Euclide trong
khng gian 3-D
Intracluster distances
are minimized

Intercluster distances
are maximized

33

GOM CM : NG DNG 1

Gom nhm khch hng :

Mc ch : Chia khch hng thnh cc nhm/cm ring


bit c th p dng cc bin php qung co khc nhau

Hng gii quyt :


Thu thp thng tin c nhn, cch sng ca tt c cc
khch hng
Xc nh cc cm/nhm khch hng ging nhau
Kim tra cht lng ca cc cm thng qua vic quan
st c trng mua hng ca khch hng trong cng
mt cm so vi khch hng khc cm

34

GOM CM : NG DNG 2

Gom cm ti liu :
Mc ch : Tm nhm ti liu ging nhau da trn cc t
quan trng

Hng gii quyt :


Xc nh ph bin ca t trong ti liu. Xy dng
o tng t da trn ph bin ca cc t gom
cm.
Li ch : Trong lnh vc truy vn thng tin (IR), c th
dng cc cm lin kt ti liu mi vi cc ti liu
gom cm

35

Gom cm DL c phiu S&P 500


Quan st s bin ng ca gi c phiu hng ngy
D liu : C phiu {UP/DOWN}
o tng t : cc s kin thng ging nhau trong
cng mt ngy
Discovered Clusters

1
2
3
4

Applied-Matl-DOW N,Bay-Net work-Down,3-COM-DOWN,


Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Co mm-DOW N,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N,
Sun-DOW N
Apple-Co mp-DOW N,Autodesk-DOWN,DEC-DOWN,
ADV-M icro-Device-DOWN,Andrew-Corp-DOWN,
Co mputer-Assoc-DOWN,Circuit-City-DOWN,
Co mpaq-DOWN, EM C-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN
Fannie-Mae-DOWN,Fed-Ho me-Loan-DOW N,
MBNA-Corp -DOWN,Morgan-Stanley-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlu mberger-UP

Industry Group

Technology1-DOWN

Technology2-DOWN

Financial-DOWN
Oil-UP

36

KHAI THC LUT KT HP


Transaction-id

Items bought

10

A, B, C

20

A, C

30

A, D

40

B, E, F
Customer
buys both

Customer
buys diaper

Itemset X={x1, , xk}


Tm mi quan h gia
cc thuc tnh thng
xut hin ng thi
A C (50%, 66.7%)
C A (50%, 100%)

Buy diapers
on
Friday night

Then
Buy beer

Customer
buys beer
37

Khai thc LKH : NG DNG 1

Qun l quy hng siu th:


Mc ch : Xc nh nhng mt hng c nhiu
khch hng mua chung

Hng gii quyt :


X

l d liu bn hng tm mi lin h


gia cc mt hng
Lut c in : Nu khch hng mua t giy
v sa th c kh nng mua bia.
38

Khai thc LKH : NG DNG 2

Qun l hng ha:


Mc ch : Cng ty bo tr thit b tiu dng mun
on trc nguyn nhn sa cha cc sn phm tiu
dng v trang b cc xe bo tr cc b phn cn thit
gim thiu s ln n nh khch hng

Hng gii quyt :


X

l d liu trn cc dng c v b phn


yu cu trong cc ln sa trc tm cc mu
ng xut hin.
39

HI QUI
D on gi tr ca bn da trn gi tr ca
cc bin khc
V d :
D bo khi lng bn hng ca sn phm
mi da trn chi ph qung co
D on tc gi nh mt hm ca nhit ,
m, p sut khng kh,
D on ch s th trng chng khon

40

Pht hin s Lc hng/


Bt bnh thng

Xc nh s lch hng r
rt so vi hnh vi thng
thng
ng dng :
Pht hin gian ln
th tn dng
Pht hin xm
nhp mng tri php

41

NI DUNG
1. Ti sao cn khai thc d liu ?
2. Khai thc d liu l g ?
3. Qui trnh Khm ph tri thc (KDD)
4. Cc nhim v chnh ca KTDL

5. Cc k thut KTDL
6. Cc thch thc ca KTDL
42

CC K THUT KTDL
KTDL ly tng t cc lnh vc nh
my hc, thng k, nhn dng, h thng
DL
Cc k thut truyn thng c th khng
ph hp do :

Kch thc ln ca DL
S chiu DL ln
Bn cht DL khng ng nht
43

KTDL KT HP CC PHNG PHP


Database
Technology

Machine
Learning
Pattern
Recognition

Statistics

Data Mining

Algorithm

Visualization

Other
Disciplines
44

NI DUNG
1. Ti sao cn khai thc d liu (DM) ?
2. DM l g ?
3. Qui trnh KDD
4. Cc nhim v chnh ca KTDL
5. Cc k thut KTDL

6. Cc thch thc ca KTDL


45

CC THCH THC CA KTDL


Ngun : http://www.cs.uvm.edu/~icdm/10Problems/index.shtml :
2005-2006 ca ICDM

Developing a Unifying Theory of Data Mining

Scaling Up for High Dimensional Data and High Speed Data


Streams

Mining Sequence Data and Time Series Data

Mining Complex Knowledge from Complex Data

Data Mining in a Network Setting

Distributed Data Mining and Mining Multi-agent Data

Data Mining for Biological and Environmental Problems

Data-Mining-Process Related Problems

Security, Privacy and Data Integrity

Dealing with Non-static, Unbalanced and Cost-sensitive Data


46

TI SAO CN NGHIN CU KTDL

Cc nhm tho lun v t


a ra cu tr li.

47

TM TT
Khm ph mu c ch, cha bit t khi

lng ln DL
Qui trnh khm ph tri thc (KDD)
Thu thp v tin x l DL -> KTDL -> nh
gi mu -> Biu din tri thc

Khai thc trn nhiu loi DL, thng tin


Cc loi mu cn khai thc
Lut kt hp, mu tun t, phn lp, gom
nhm, mu him, mu c bit, sai lch
48

S pht trin ca KTDL


1989 IJCAI Workshop on Knowledge Discovery in Databases
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W.
Frawley, 1991)
1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad,
G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in
Databases and Data Mining (KDD95-98)
Journal of Data Mining and Knowledge Discovery (1997)
ACM SIGKDD conferences t 1998 v SIGKDD Explorations
Nhiu hi ngh khc v KTDL
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE)
ICDM (2001),
ACM Transactions on KDD t 2007
49

Bi tp theo nhm s 1

Thi gian tho lun : 15

Tho lun tnh hung KTDL trong nhm v 01 ngi i din cho
nhm trnh by.
Thi gian trnh by : ti a 3 .
Trnh by tnh hung
Hng gii quyt v li ch

Tnh hung 1 : Th trng bn l (v d cn tng doanh


thu bn hng)
Nhm :
Gi :
Dng DL no c thu thp . S dng nhim v no ca KTDL ?
Cc thng tin no ta cn bit v khch hng
C cn bit khch hng mua cc mt hng g?
50
C cn phn loi khch hng ?,

Bi tp theo nhm s 1

Thi gian : 15

Tho lun tnh hung KTDL trong nhm v 01 ngi i din cho
nhm trnh by
Thi gian trnh by : ti a 3
Trnh by tnh hung
Hng gii quyt v li ch
Tnh hung 2 : Qung co sn phm (v d chn la hnh
thc, i tng qung co gim chi ph, tng li nhun)

Nhm :

Gi :
DL cn thu thp l g. S dng nhim v no ca KTDL ?
C cn thit gi t qung co sn phm n tt c cc khch hng
Hay ch gi cho 1 nhm c chn lc.
C th d kin kh nng phn hi ca khch hng so vi chi ph 51gi
qung co ?

BI TP NHM
Tt

c cc nhm s post kt qu
tho lun nhm ln website mn
hc (trong mc din n tho
lun)
Hn
cht post: 23h00 8/08/2011
52

CC CNG VIC CN LM
1.

Post bi tp nhm s 1
Tt c cc nhm s post kt qu tho lun nhm
ln website mn hc (trong mc din n tho
lun)
Hn cht post : 23h00 1/08/2011

2. Chun b bi 2 : Qui trnh chun b DL


Xem ni dung bi tp nhm chng 2:

cc vn khi lm vic vi DL thc t.


Cch thc hin :

Nghin cu slide, xem v d.


Tham kho trn Internet v ti liu tham
kho
53

TI LIU THAM KHO

G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data


mining to knowledge discovery: An overview. U.M.
Fayyad, et al. (eds.), Advances in Knowledge Discovery
and Data Mining, 1-35. AAAI/MIT Press, 1996
http://vi.wikipedia.org/wiki/Khai_ph%C3%A1_d%E1
%BB%AF_li%E1%BB%87u : bch khoa ton th
m wikipedia
J.Han, M.Kamber, Chng 1 Data mining :
Concepts and Techniques
P.-N. Tan, M. Steinbach, V. Kumar, Chng 1 Introduction to Data Mining
54

BI TP
Th no l khai thc d liu ? Cho v d minh
ha.
2. Cc kiu d liu, thng tin no c kh nng c
s dng trong qui trnh KDD?
3. Cho v d thc t v vic p dng KTDL em n
thnh cng trong kinh doanh (ngoi cc v d c
trong bi ging).
1.

Gi : Bi ton tng doanh thu ca th trng bn l.


Bi ton xy dng k hoch qung co v khuyn mi
Loi DL no c thu thp ? Loi nhim v no ca
KTDL c s dng ? C th thay bng phng php
truy vn DL hay phn tch thng k n gin khng ?
55

56

You might also like