Clustering

City University of Hong Kong
Department of Computer Science
BSCS Final Year Project Report 2003-2004
(03CS064)
Data mining for slope risky evaluation

by use of clustering algorithm
(Volume 1 of 1 )
Student Name : Chan Tat Wing

Student No. : 96004020 For Official Use Only
Programme Code : BSCS2
Supervisor : Dr. Joseph Fong
1st Reader : Tao, Yufei
2nd Reader : Lee, C H

Final Report - Data mining for slope risky evaluation by use of clustering algorithm
Extended Abstract
Page 2
Extended Abstract
By using the clustering algorithms of the data mining, this project will investigate the data mining
technology for analyzing the raw data of the slope database and investigate the various factors
affecting slope risky. In the case study, I will attempt to find some business rules for predicting
the risky of the slope.
The clustering algorithms of the data mining will mainly achieve two standards that are efficiency
and scalability. In this project, I will study and examine the clustering algorithms stated in the
reference book and then summarize and reference the suitable clustering algorithm for analyzing
the slope database in order to find out several critical factors which affect the slope safety.
In this project, I will use K-Means Clustering as an algorithm to investigate the relationship
between the slope risky and its various attributes such as height of slope, length of slope or angle
of slope, etc. Being a non-hierarchical approach, K-Means Clustering is a simple and most basic
method to produce clusters. By using of K-Means Clustering, I have to first predefine the
number of clusters, k, and then this algorithm partition the data iteratively until the desired cluster
is formed and the cluster center is calculated. This is a supervised approach to get the clusters
from the raw data, however, in addition I will further develop the CityMiner in order to get the
optimum clusters from the raw data, and we call this is a Non-supervised approach. For the
Non-supervised approach, we will try different values of k from 1 to n where n is the number of
desired clusters we want to get. Then we measure the average distance mean for each value of k
so as to achieve the optimum k when the changes in the average distance mean is less than 5 %.
As a result, we get the optimum clusters with the automation manner.
In this project, we perform the case study with both of the supervised and non-supervised
approach to investigate the factors which affecting the slope risky. These slope risky factors
investiaged in the case study include slope geometry like slope height, slope length and slope
angle. We believe that the level of slope risky is highly related to the slope geometry factors such
as height, length and angle, so we manipulate the CityMiner to perform the Non-supervised
clustering approach to obtain the optimumal grouping from the prepared mining data. As a
result, we generate the clustering results for each geometry factors and defined THREE rules
from it. We concluded that when the height of the slope is below 9 meters, the slope height is
Page 3
directly proportional to the slope risky score. While when the height of the slope is over 9 meters,
the slope risky score will be more and more stable. In addition, when the angle of the slope is
over 45 degees, over 95% of the slope is risky. However, we find that the slope length is not a
dominant factor for the slope risky.
In conclusion, we know that not all the geometry factors will affect the risky level of the slope.
In this project, it show that slope height and slope angle are the dominant factors for the slope
risky while slope length is not no matter how long or short of the slope is.
Page 4
Acknowledgments
Page 5
Acknowledgement
First of all, I need to special thank to my supervisor Dr. Joseph Fong, who not only spent much
time along the project from the beginning to the end and also provide many valuable comments,
innovative idea and experience sharing for me during the project development. Actually, he play
an very important role for my success in this final year project. I will remember what I have
learnt from him and this project and I holp I could contribute my knowledge to the society in the
near future.
In addition, I would like to take this chance to thank my workmate, Joe and my friend, Lu, who
have shared their experience in data mining to me and also thank to all mining analysts who have
contributed their knowledges and experiences in the Internet for most of the beginner to learn
what data mining are.
At last, I have to thank to my girl friend , Woody So, who not only gives me a greatest support
and also suffers from my neglect during the development of this final year project.
Page 6
Table of Contents
Page 7
EXTENDED ABSTRACT .......................................................................................................3
ACKNOWLEDGEMENT .......................................................................................................6
1.1 PROJECT AIM........................................................................................................... 11
1.2 PROJECT OBJECTIVE............................................................................................. 12
1.3 PROJECT WORK ...................................................................................................... 12
2.1 DATA MINING HISTORY........................................................................................ 14
2.2 CLUSTERING ............................................................................................................ 14
2.3 GENERAL PROCEDURE TO PERFORM DATA MINING................................... 15
2.4 DATA GATHERING .................................................................................................. 15
2.5 DATA CLEANING ..................................................................................................... 15
2.6 DATA TRANSFORMATION .................................................................................... 16
2.6.1 NORMALIZATION BY DECIMAL SCALING .............................................................. 16
2.6.2 DISCRETIZATION OF SLOPE RISKY FACTOR VALUE ............................................. 17
2.7 CLUSTERING ALGORITHM................................................................................... 17
2.7.1 CONCEPT FOR K-MEANS CLUSTERING ................................................................. 17
3.1 INTRODUCTION ....................................................................................................... 20
3.2 DATA GATHERING .................................................................................................. 20
3.3 DATA CLEANING ..................................................................................................... 22
3.4 DATA TRANSFORMATION .................................................................................... 23
3.5 SUPERVISED CLUSTERING WITH K-MEANS ALGORITHM .......................... 25
3.6 NON-SUPERVISED CLUSTERING WITH K-MEANS ALGORITHM ................ 34
4.1 SUPERVISED APPROACH IMPLEMENTATION ................................................. 36
4.1.1 IMPLEMENTATION OF SLOPE HEIGHT FACTOR (SUPERVISED) .............................. 36
4.1.2 IMPLEMENTATION OF SLOPE LENGTH FACTOR (SUPERVISED) ............................. 38
Page 8
4.1.3 IMPLEMENTATION OF SLOPE ANGLE FACTOR (SUPERVISED) ............................... 40
4.2 NON-SUPERVISED APPROACH IMPLEMENTATION ....................................... 42
4.2.1 IMPLEMENTATION OF SLOPE HEIGHT FACTOR (NON-SUPERVISED)...................... 43
4.2.2 RESULT FOR THE NON-SUPERVISED CLUSTERING OF SLOPE HEIGHT FACTOR ...... 45
4.2.3 IMPLEMENTATION OF SLOPE LENGTH FACTOR (NON-SUPERVISED) ..................... 46
4.2.4 RESULT FOR THE NON-SUPERVISED CLUSTERING OF SLOPE LENGTH FACTOR ..... 48
4.2.5 IMPLEMENTATION OF SLOPE ANGLE FACTOR (NON-UPERVISED) ......................... 49
4.2.6 RESULT FOR THE NON-SUPERVISED CLUSTERING OF SLOPE ANGLE FACTOR ........ 51
5.1 TECHNICAL SUMMARY......................................................................................... 53
5.2 TECHNICAL IMPLEMENTATION......................................................................... 54
5.2.1 OVERALL SYSTEM ARCHITECTURE OF CITYMINER............................................. 54
5.2.2 FUNCTIONAL DESCRIPTION OF CITYMINER ......................................................... 55
5.2.2.1 Data Preparation Menu ..................................................................................... 55
5.2.2.2 Data Mining Menu ............................................................................................ 58
5.2.2.2.1 The Supervised Approach ......................................................................... 59
5.2.2.2.2 The Non-Supervised Approach ................................................................. 60
5.2.2.3 Data Result Analysis .......................................................................................... 62
6.1 PROJECT REVIEW................................................................................................... 66
6.2 PROJECT ACHIEVEMENTS ................................................................................... 66
6.3 PROJECT FUTURE EXTENSIONS ......................................................................... 67
APPENDIX A: PROJECT REFERENCE ............................................................................ 69
APPENDIX B: PROGRESS REPORT SUMMARY ........................................................... 71
APPENDIX C: INSTALLATION GUIDE FOR CITYMINER........................................... 80
Page 9
Chapter 1:
General Information
Page 10
1.1 Project Aim
In past two decades, both of the government and the citizen of Hong Kong become more and
more aware of the safety of the slope near the buildings, roads or any facilities nearing them. It
may due to the several serious slope landslide accidents happened in the past. Once these kind of
landslide accidents of the slope occurs, it not only will cause the lost of properties, and even will
cause the lost of life of human beings. In order for our society to minimize the chance for
happening such kind of tragedies, most of the government department have develop a set of
policies and standard to monitor and inspect the slope condition periodically (e.g. inspect the
slope per five year). For the engineering inspection activity, the engineer will go out to the site
for observation and measurement, and eventually the collected slope data will be store into the
central database for the future generation of the Engineering Inspection Report for the slope and
make the statistical report base on the collected data. By the way, the various government
departments such as Lands Department (LD), Water Supply Department (WSD), Drainage
Service Department (DSD) and so on, could make use of the central database to understand the
safety or risky condition for each individual slope or even make a priority scheme to arrange and
manage the slope maintenance works.
From the engineering point of view, these professional engineers perform the slope safety or risky
analysis base on their experience and some pre-defined engineering rules. However, the database
may hide some useful information or patterns for predicating the risky of the slope. In order to
capture these hidden signals, this project will use the data mining technique to find out some rules
related risky of the slope.
Page 11
1.2 Project Objective
By using the clustering algorithms of the data mining, this project will investigate the data mining
technology for analyzing the raw data of the slope database and investigate the various factors
affecting slope risky. In the implementation stage, I will attempt to find some business rules for
predicting the risky of the slope.
1.3 Project Work
In this project, it will mainly divide into three phases for development. The first phase is the
Initial Study stage, in which I will study and develop the clustering algorithms and also
summarized three or four critical factors applied in the analyst module. The second phase is the
implementation stage, in which I will develop an analyst module with implementing the agreed
clustering algorithms. The third phase is the presentation phase, in which I will extract some rule
base on the result of data mining algorithm.
Page 12
Chapter 2:
Study Review
Page 13
2.1 Data Mining History
In past two decade, more and more companies in the global world change their business running
mode from the traditional manual paper works to the nowadays computerized system that
facilitated user to perform their works more efficiently and effectively. Also the information
sharing becomes more and more important among the business decision in marketing, accounting
and so on. Due to the global computerized effect, it results in many different database and
inventory produced for different business. However, most of the company decision maker will
not interest on the individual data record or data item, they will just pay attention on the various
statistical report or summary such as the financial report or the sale trend report. However, it is
still can not fulfill all the needs for the company decision maker, they need to understand more
advance information, so as a result, the data mining technology have been manipulated for
investigate the hidden information form the existing data inventory. Basically, data mining can be
divided in two main models, one is “Predictive” which can be sub-divided into four group that are
classification, Regression, Time series analysis and Prediction, and the other one is “Descriptive”
which can also be sub-divided into four group that are Clustering, Summarization, Association
rules and Sequence discovery. In this project, I will mainly exam the data mining with clustering
technique for investigating the rules for predicting the slope risky.
2.2 Clustering
Clustering is one of the data mining technique in which the grouping is accomplished by finding
similarities between data according to characteristics found in the actual data. It divides a data
set so that records with similar content are in the same group, and groups are as different as
possible from each other. When the categories are unspecified, it is referred to as unsupervised
learning, however, when the categories are specified, it is referred to as supervised learning.
Generally, this approache assign records with a large number of attributes into a relatively small
set of groups or cluster and eventually the specified or unspecified grouping is generated. It is
often one of the first steps in data mining analysis and nowadays has been used in many
application domains, including biology, medicine and marketing, etc.
Page 14
2.3 General procedure to perform data mining
Procedure Name Procedure Description

Stage 1 Data gathering To gather the data elements from the data warehousing.
Stage 2 Data cleansing To eliminate errors and /or undesired data items.
Stage 3 Data transformation To obtain the interesting attributes of the data.
Stage 4 Data mining To discover the hidden pattern of the data set.
Stage 5 Data presentation To conclude the business rules from the mining result.
Stage 6 Results Evaluation To compare the data mining results generated from
your approach with other results generated by other
approach.
2.4 Data Gathering
In this project, I have got the slope database inventory in Microsoft Access 2000 MDB file
format as the data source from one of the government department for mining algorithm process
later. After the stage of data preparation, all the quality data will be put into the data warehouse
so as to maximize the efficiency of query processing of cleansed and integrated data and also
retrieve and analyze data quickly and easily during the mining process.
2.5 Data Cleaning
From the quality point of view, data cleaning play an important role in the process of data mining.
It is because quality decision must be based on quality data, the higher the quality of the data, the
more accurate the mining results. So it is important to fill in missing values, smooth noisy data,
identify and remove outliers from the data source in order to ensure the quality of the mining
result. In this project, I have got around 5000 records as a sample data for clustering and I will
remove all the records with the NULL or ZERO attribute value.
Page 15
2.6 Data Transformation
In this project, I select decimal scaling as the normalization method to make the scale of attribute
data fall within a small, specified range. Decimal Scaling is simple and use commonly in
normalization works.
2.6.1 Normalization by Decimal Scaling
By using the formula: v’ = v / 10 j
We moves the decimal point of v by j positions such that j is the minimum number of positions
moved so that absolute maximum value falls in [0..1]. For example, if v ranges between 0.56 and
999.99, when we set j is equal to 3 then v’ will be ranged between 0.00056 and 0.9999 as a result.
In the project, I will apply this formula to the attribute in the data warehouse for the uage in the
process of data mining. As a result, all values of the slope risky score will be fallen in the range
from 0 to 1.
For example of the Slope Risky Score factor, I will set j is equal to 3 and the formula will be S’ =
S / 103
Original Value of Slope Risky Score,S Minig Value of Slope Risky Score,S’
999.999 0.999999
89.87279 0.082673
9.7083 0.0097083
0.816 0.000816
Table 2.1 Shown the changes after applying the decimal scaling in the sample clustering data
Page 16
2.6.2 Discretization of Slope Risky Factor Value
In this project, I have selected a simple and common technique named “Binning method” to
perform discretization for the slope risky factor values. Binning method will convert continuous
data to discrete data by replacing a value from a continuous range with a bin identifier, where
each bin represents a range of values. For example, age could be converted to bins such as 20 or
under, 21-40, 41-65 and over 65. While in this project, I will try to convert the slope risky score
and result in three range of values such as high risk slope, mid risk slope and low risk slope by
using of the Equal-width approach.
2.7 Clustering Algorithm
In this project, I have chosen K-Means Clustering as an algorithm to investigate the relationship
is formed and the cluster center is calculated.
2.7.1 Concept for K-Means Clustering
Basically, K-Means Algorithm can be divided into three steps for implementation:
Step 1 Define the number of cluster (k) and designate a cluster center for each cluster.
Step 2 Assign each data item to one of the clusters depending on the minimum distance.
Step 3 Recalculate the centroid’s position for each clusters every time a data item is added to the
cluster.
Step 4 Repeat Step 2 and Step 3 until all the data items are grouped into the final required
number of cluster (k).
Page 17
The K-means algorithm running procedure

Input:
D = {t1,t2,…, tn} //Number of desired clusters
k //Number of desired clusters
Output:
K //Set of clusters
K-Means algorithm:
assign initial values for means m1, m2,…, mk;
repeat
assign each item ti to the cluster which has the closest mean;
calculate new mean for each cluster;
until convergence criteria is met;
Page 18
Chapter 3:
Approach and Methodology
Page 19
3.1 Introduction
In this project, I will use K-Means Clustering as an algorithm to investigate the relationship
is formed and the cluster center is calculated. The detail concept and approach for this algorithm
will be discussed in the following section.
3.2 Data Gathering
In this project, we have got the slope inventory from one of the government department,
however, there only several attributes contribute the useful values for the mining process. So I
will run the following SQL statements in the Microsoft Access 2000 Queries Wizard to extract
all the useful data into our data schema and eventually store it into the data warehouse ready for
mining process:
SELECT SLP_NO_PK AS ID1, SLP_SCORE AS SLOPE_SCORE INTO TMP_TABLE_1

FROM SLOPE;
SELECT MM_SLP_NO_FK AS ID1, MAX(MM_EI_DATE_K) AS ID2 INTO

TMP_TABLE_2A
FROM MM
GROUP BY MM_SLP_NO_FK;
SELECT MM_SLP_NO_FK AS ID1, MM_EI_DATE_K AS ID2,

VAl(MM_CONSEQ_CAT) AS SLOPE_SAFETY_CATEGORY INTO TMP_TABLE_2
FROM MM, TMP_TABLE_2A
WHERE MM.MM_SLP_NO_FK = TMP_TABLE_2A.ID1 AND MM.MM_EI_DATE_K
= TMP_TABLE_2A.ID2;
Page 20
SELECT MMST_SLP_NO_FK AS ID1, MAX(MMST_EI_DATE_FK) AS ID2 INTO

TMP_TABLE_3A
FROM MM_SLOPTECH GROUP BY MMST_SLP_NO_FK;
SELECT MMST_SLP_NO_FK, MMST_EI_DATE_FK, MMST_HEIGHT,

MMST_LENGTH, MMST_ANGLE INTO TMP_TABLE_3B
FROM MM_SLOPTECH, TMP_TABLE_3A
WHERE MM_SLOPTECH.MMST_SLP_NO_FK = TMP_TABLE_3A.ID1 AND
MM_SLOPTECH.MMST_EI_DATE_FK = TMP_TABLE_3A.ID2;
SELECT MMST_SLP_NO_FK AS ID1, MMST_EI_DATE_FK AS ID2,

AVG(MMST_HEIGHT) AS SLOPE_HEIGHT, AVG(MMST_LENGTH) AS
SLOPE_LENGTH, AVG(MMST_ANGLE) AS SLOPE_ANGLE INTO TMP_TABLE_3
FROM TMP_TABLE_3B
GROUP BY MMST_SLP_NO_FK, MMST_EI_DATE_FK;
SELECT TMP_TABLE_2.ID1, TMP_TABLE_2.SLOPE_SAFETY_CATEGORY,

TMP_TABLE_3.SLOPE_HEIGHT, TMP_TABLE_3.SLOPE_LENGTH,
TMP_TABLE_3.SLOPE_ANGLE INTO TMP_TABLE_4
FROM TMP_TABLE_2, TMP_TABLE_3
WHERE TMP_TABLE_2.ID1 =TMP_TABLE_3.ID1 and TMP_TABLE_2.ID2 =
TMP_TABLE_3.ID2;
SELECT TMP_TABLE_1.ID1, TMP_TABLE_4.SLOPE_SAFETY_CATEGORY,

round(TMP_TABLE_1.SLOPE_SCORE,4) AS SLOPE_SCORE,
CInt((TMP_TABLE_4.SLOPE_HEIGHT) * 100) / 100 AS SLOPE_HEIGHT,
CInt((TMP_TABLE_4.SLOPE_LENGTH) * 100) / 100 AS SLOPE_LENGTH,
CInt((TMP_TABLE_4.SLOPE_ANGLE) * 100) / 100 AS SLOPE_ANGLE INTO
MINING
FROM TMP_TABLE_1, TMP_TABLE_4
WHERE TMP_TABLE_1.ID1 = TMP_TABLE_4.ID1;
Page 21
By the end of the data gathering, we will create a data schema named MINING with the
following data structure which contain all the values need for the mining process. Besides, I will
store this schema and data in the Microsoft Access 2000 database file named “ClusterDB.mdb”
to do as a data warehouse.
No. Field Name Filed Description Tyep Comment

1 ID1 Identity CHAR(13)
2 SLOPE_SAFETY_ Slope Safety Category INTEGER “1” or “2” or “3”
CATEGORY
3 SLOPE_SCORE Slope Risky Score DOUBLE Range from 0 to 1
Index
4 SLOPE_HEIGHT Slope Height DOUBLE Range from 0 to 1
5 SLOPE_LENGTH Slope Length DOUBLE Range from 0 to 1
6 SLOPE_ANGLE Slope Angle DOUBLE Range from 0 to 1
Table 3.1 The table structure of the data mining schema.
3.3 Data Cleaning
In the previous chapter, we have discuss the important of the process of data cleaning, now I
will run the following SQL statements in the Microsoft Access 2000 Queries Wizard remove all
the dity data:
DELETE *
FROM MINING
WHERE SLOPE_SCORE = 0 OR SLOPE_SCORE IS NULL;
DELETE *
FROM MINING
WHERE SLOPE_SAFETY_CATEGORY NOT IN(1,2,3)
OR SLOPE_SAFETY_CATEGORY IS NULL;
Page 22
DELETE *
FROM MINING
WHERE SLOPE_HEIGHT = 0 OR SLOPE_HEIGHT IS NULL;
DELETE *
FROM MINING
WHERE SLOPE_LENGTH = 0 OR SLOPE_LENGTH IS NULL;
DELETE *
FROM MINING
WHERE SLOPE_ANGLE = 0 OR SLOPE_ANGLE IS NULL;
3.4 Data Transformation
In the previous chapter, also we have proposed a method named Decimal Scaling to make the
values in data mining can be fallen in a definite range such as 0 to 1, now I will run the following
SQL statements in the Microsoft Access 2000 Queries Wizard to implement this approach
UPDATE MINING SET SLOPE_SCORE = SLOPE_SCORE/(10*10*10);
UPDATE MINING SET SLOPE_HEIGHT = SLOPE_HEIGHT/(10*10*10);
UPDATE MINING SET SLOPE_LENGTH = SLOPE_LENGTH/(10*10*10);
UPDATE MINING SET SLOPE_ANGLE = SLOPE_ANGLE/(10*10*10);
At last, we have secussfully extracted 5084 slope records into the data warehouse in the schema
MINING as a sample data for the mining process.
SELECT COUNT(*) FROM MINING WHERE SLOPE_SAFETY_CATEGORY = 1;
Page 23
SLOPE_SAFETY_ Description % in Data Set

Quantity in DB
CATEGORY
Most Danger Slope = 422/5084*100%
1 422
= 8.36%
Medium Danger Slope = 689/5084*100%
2 689
= 13.55%
Less Danger Slope = 3973/5084*100%
3 3973
= 78.15%
By using the Binning method with the equal width approach, we could sort the value of the slope
risky score ascendingly and get the data records from 1 to 3973 as the first range of value, and
3974 to 3974 +689 as the second range of value. Finally, we take the data records from
3974+689+1 to 5084 as the third range of value.
SLOPE_SCORE (FROM) SLOPE_SCORE (TO) Comments

0.0000047 0.001672 Low Risky Slope
0.001672 0.0047487 Mid Risky Slope
0.00475 0.082673 High Risky Slope
Page 24
3.5 Supervised Clustering with K-Means Algorithm
First of all, I will investigate the relationship between the Slope Height and the Slope Safety
Score, however there will be only include 20 sample records from the slope database for
reference.
Data Point Slope ID # Slope Height, H Slope Safety Score, C Assigned Cluster
1 3SE-D/C 123 0.015 0.00881 B
2 3SE-D/C 124 0.004 0.004757 A
3 3SE-D/C 125 0.0034 0.000197 A
4 3SE-D/CR 70 0.004 0.005944 A
5 3SE-D/CR 73 0.01 0.000804 A
6 3SE-D/CR 121 0.011 0.000437 A
7 3SE-D/CR 126 0.0028 0.000151 A
8 3SE-D/F 24 0.013 0.00341 A
9 3SE-D/F 29 0.011 0.004729 A
10 3SE-D/F 30 0.0068 0.0000866 A
11 3SW-A/C 24 0.005 0 A
12 3SW-A/C 117 0.015 0.005949 B
13 3SW-A/C 118 0.01 0.004076 A
14 3SW-A/F 22 0.0065 0 A
15 3SW-A/F 76 0.005 0 A
16 3SW-B/C 102 0.017 0 C
17 3SW-B/C 158 0.008 0.000583 A
18 3SW-B/C 161 0.027 0.002744 C
19 3SW-B/C 169 0.01 0.000883 A
20 3SW-B/C 171 0.011 0.000815 A
Table 3.2 Sample data for clustering slope height factor.
Page 25
Plot between Slope Heigh and Slope Safety Score in the slope items
0.01
Cluster B
Slope Safety Score, C
0.008
0.006
0.004
0.002
0
Cluster A
0 0.01 0.02 0.03
Slope Height, H Cluster C
Diagram 3.1 Plots for the result of clustering when k is equal to 3
I find from the graph that the data set could be divided into five clusters and one of the data items
is chosen as a center for each cluster.
Cluster Name Slope Height, H Slope Safety Core, C

Cluster A 0.0068 0.0000866
Cluster B 0.015 0.00881048
Cluster C 0.027 0.00274446
For Cluster A, I find five data items in it, so the new cluster centre for it is:
Slope Height, H Slope Safety Core, C

0.004 0.00475745
0.0034 0.00019686
0.004 0.0059444
0.01 0.00080365
0.011 0.00043737
0.0028 0.00015054
0.013 0.003410207
0.011 0.004728996
0.0068 0.00008661
0.005 0
0.01 0.0040755
Page 26

0.0065 0
0.005 0
0.008 0.00058286
0.01 0.00088258
0.011 0.00081532
Sum is 0.1215 Sum is 0.026872343
Therefore, the sum of the slope height and sum of the slope safety core should be divided by 16,
so the new cluster centre for Cluster A is [0.1215/16, 0.026872343/16] = [0.00759, 0.00168].
For Cluster B, I find two data items in it, so the new cluster centre for it is:

0.015 0.00881048
0.015 0.00594852
Sum is 0.030 Sum is 0.014759
so the new cluster centre for Cluster B is [0.03/2, 0.014759/2] = [0.015, 0.007379].
For Cluster C, I find two data items in it, so the new cluster centre for it is:
0.017 0
0.027 0.0027446
Sum is 0.044 Sum is 0.0027446
so the new cluster center for Cluster C is [0.044/2, 0.0027446/2] = [0.022,0.00137].
Page 27
Secondary, I will investigate the relationship between the Slope Length and the Slope Safety
Score, however there will be only include 20 sample records from the slope database for
reference.
Data Point Slope ID # Slope Length, L Slope Safety Score, C Assigned Cluster
1 3SE-D/C 123 0.08 0.00881 B
2 3SE-D/C 124 0.026 0.004757 B
3 3SE-D/C 125 0.039 0.000197 A
4 3SE-D/CR 70 0.057 0.005944 B
5 3SE-D/CR 73 0.033 0.000804 A
6 3SE-D/CR 121 0.06 0.000437 A
7 3SE-D/CR 126 0.03 0.000151 A
8 3SE-D/F 24 0.1 0.00341 A
9 3SE-D/F 29 0.013 0.004729 B
10 3SE-D/F 30 0.033 0.000086 A
11 3SW-A/C 24 0.03 0 A
12 3SW-A/C 117 0.105 0.005949 B
13 3SW-A/C 118 0.065 0.004076 B
14 3SW-A/F 22 0.02 0 A
15 3SW-A/F 76 0.035 0 A
16 3SW-B/C 102 0.083 0 X
17 3SW-B/C 158 0.105 0.000583 C
18 3SW-B/C 161 0.093 0.002744 A
19 3SW-B/C 169 0.105 0.000883 C
20 3SW-B/C 170 0.165 0.000815 C
Table 3.3 Sample data for clustering slope length factor.
Page 28
Plot between Slope Length and Slope Safety Score in the slope items
0.01
0.009
0.008
Cluster B
0.007
0.006
0.005 Slope Safety Score, C
0.004
0.003
0.002
0.001
0
Cluster0 A 0.05 0.1 0.15 0.2C
Cluster
Slope Length, L
I find from the graph that the data set could be divided into three clusters and one of the data
items is chosen as a center for each cluster.
Cluster Name Slope Length, L Slope Safety Core, C

Cluster A 0.039 0.000197
Cluster B 0.057 0.005944
Cluster C 0.165 0.000815
For Cluster A, I find five data items in it, so the new cluster center for it is:
Slope Length Slope Safety Core, C

0.039 0.000197
0.033 0.000804
0.06 0.000437
0.03 0.000151
0.1 0.00341
0.033 0.000086
0.03 0
0.02 0
0.035 0
Page 29
Slope Length Slope Safety Core, C

0.093 0.002744
Sum is 0.473 Sum is 0.007829
Therefore, the sum of the slope length and sum of the slope safety core should be divided by 10,
so the new cluster center for Cluster A is [0.473/10, 0.007829/10] = [0.0473, 0.0007829].
For Cluster B, I find six data items in it, so the new cluster center for it is:
Slope Length, L Slope Safety Core, C

0.08 0.00881
0.026 0.004757
0.057 0.005944
0.013 0.004729
0.105 0.005949
0.065 0.004076
Sum is 0.346 Sum is 0.034265
so the new cluster center for Cluster B is [0.346/6, 0.034265/6] = [0.057667, 0.005711].
For Cluster C, I find three data items in it, so the new cluster center for it is:
Slope Length, L Slope Safety Core, C

0.105 0.000583
0.105 0.000883
0.165 0.000815
Sum is 0.375 Sum is 0.02281
so the new cluster center for Cluster C is [0.375/3, 0.02281/3] = [0.125, 0.0076].
Page 30
Finally, I will investigate the relationship between the Slope Angle and the Slope Safety Score,
however there will be only include 20 sample records from the slope database for reference.
Data Point Slope ID # Slope Angle, A Slope Safety Score, C Assigned Cluster
1 3SE-D/C 123 0.055 0.00881 B
2 3SE-D/C 124 0.06 0.004757 B
3 3SE-D/C 125 0.06 0.000197 C
4 3SE-D/CR 70 0.06 0.005944 B
5 3SE-D/CR 73 0.03 0.000804 A
6 3SE-D/CR 121 0.06 0.000437 C
7 3SE-D/CR 126 0.075 0.000151 C
8 3SE-D/F 24 0.04 0.00341 B
9 3SE-D/F 29 0.04 0.004729 B
10 3SE-D/F 30 0.04 8.66E-05 A
11 3SW-A/C 24 0.03 0 A
12 3SW-A/C 117 0.04 0.005949 B
13 3SW-A/C 118 0.04 0.004076 B
14 3SW-A/F 22 0.03 0 A
15 3SW-A/F 76 0.04 0 A
16 3SW-B/C 102 0.06 0 C
17 3SW-B/C 158 0.055 0.000583 C
18 3SW-B/C 161 0.06 0.002744 B
19 3SW-B/C 169 0.06 0.000883 C
20 3SW-B/C 170 0.06 0.000815 C
Table 3.4 Sample data for clustering slope angle factor.
Page 31
Plot between Slope Angle and Slope Safety Score in the slope
items
0.01
0.009
Cluster B
0.008
0.007
0.006
0.005 Slope Safety Score, C
0.004
0.003
0.002
0.001
0
0 0.02 0.04 0.06 0.08
Cluster C
Cluster A Slope Angle, A
I find from the graph that the data set could be divided into three clusters and one of the data
items is chosen as a center for each cluster.
Cluster Name Slope Angle, A Slope Safety Core, C

Cluster A 0.03 0
Cluster B 0.055 0.00881
Cluster C 0.075 0.000151
For Cluster A, I find five data items in it, so the new cluster center for it is:
Slope Angle, A Slope Safety Core, C

0.03 0.000804
0.04 0.000086
0.03 0
0.03 0
0.04 0
Sum is 0.17 Sum is 0.00089
Page 32
Therefore, the sum of the slope angle and sum of the slope safety core should be divided by 5, so
the new cluster center for Cluster A is [0.17/5, 0.0008903/5] = [0.034, 0.0001781].
For Cluster B, I find eight data items in it, so the new cluster center for it is:

0.055 0.0088105
0.06 0.0047575
0.06 0.0059444
0.04 0.0034102
0.04 0.004729
0.04 0.0059485
0.04 0.0040755
0.06 0.0027445
Sum is 0.395 Sum is 0.04042
the new cluster center for Cluster A is [0.395/8, 0.04042/8] = [0.049375,0.0050525].
For Cluster C, I find seven data items in it, so the new cluster center for it is:

0.06 0.0001969
0.06 0.0004374
0.075 0.0001505
0.06 0
0.055 0.0005829
0.06 0.0008826
0.06 0.0008153
Sum is 0.43 Sum is 0.0030655
the new cluster center for Cluster C is [0.43/7, 0.0030655/8] = [0.0614286,0.0003832].
Page 33
3.6 Non-supervised Clustering with K-Means Algorithm
In section 3.5, we have introduced supervised approach to get the clusters from the raw data,
however, in addition I will further the development in order to get the optimum clusters from the
raw data, and we call this is a Non-supervised approach. For the Non-supervised approach, we
will try different values of k from 1 to n where n is the number of desired clusters we want to get.
Then we measure the overall average distance mean for each value of k so as to achieve the
optimum k when the changes in the average distance mean is less than 5 %. As a result, we get
the optimum clusters in an automation manner.
Page 34
Chapter 4:
Case Study
Page 35
4.1 Supervised Approach Implementation
In this project, we implement the supervised approach by providing a user defined number of
cluster (k) and selecting the initial center for each cluster. During the mining process, the K-
Means algorithm will loop for improvement until all the cluster become stable. By the way, user
need to set the value of maximum iteration for the system run the mining algorithm so as to
mimimize the distance measure.
4.1.1 Implementation of slope height factor (Supervised)
In this project, I predict to get three clusters from the raw mining data prepared previously. So
I will select data points as a initial center for each of the desired clusters. In addition, I will set the
maximum number of iteration to 4 to investigate how this value affecting the mining result.
k Centre for Cluster Total number of data item Average distance mean for the
assign into Cluster Cluster
1 [0.000769,0.000128] 14 0.000594
2 [0.001616,0.001045] 4430 0.005712
3 [0.014616,0.017226] 640 0.017682
Table 4.1 Summary of the clustering result for the slope height.
Page 36
Diagram 4.1 Plots for the result of k-means when optimum k is equal to 3
Page 37
4.1.2 Implementation of slope Length factor (Supervised)
1 [0.003677,0.017171] 11 0.028344
2 [0.097647,0.000635] 2319 0.034448
3 [0.005662,0.007477] 2754 0.025752
Table 4.1 Summary of the clustering result for the slope length.
Page 38
Page 39
4.1.3 Implementation of slope Angle factor (Supervised)
1 [0.000065,0.000605] 1 0.003196
2 [0.027468,0.000807] 5072 0.023564
3 [0.007468,0.000245] 11 0.006082
Table 4.1 Summary of the clustering result for the slope angle.
Page 40
Page 41
4.2 Non-supervised Approach Implementation
In this project, in order to get the actual number of clusters in the raw data, it is better to run the
algorithm with different values for k from small to large. By the way, all data items will fall into
their own cluster and the average distance mean to the centroid of each cluster will be descrease
accordingly. For example, we assume that we try the values for k from 1 to 7 in the case study,
when we calculate the average distance mean for all data points to its own cluster centroid. We
should find that the average distance become smaller and smaller until the value of k reach the
optimum level, so that the changes in the average distance will be stable or even no changes.
Diagram 4.4 The changes of average distance mean for k increase from 1 to 7
Page 42
4.2.1 Implementation of slope height factor (Non-supervised)
For Slope Height Factor, I found the optimum k by trying different values of k from 1 to 7.
And I assume that if the change in the average distance mean is less than 5% from previous k to
the next k, I will select the previous k as the optimum k for this clustering.
Average Distance Mean VS k
0.01
Average Distance Mean
0.009
0.008
0.007
0.006
0.005
0.004
0.003
0.002
0.001
0
0 1 2 3 4 5 6 7 8
k
As a result, the optimum k for this clustering is 2. In this experiment, I selected 5084 data
records for analysis and totally used around 3.5 minutes to finish the operation. And the details
result will be shown as follow:
The auto-mining process start at 21:10:43

The auto-mining process end at 21:14:07
The overall average distance mean when k is equal to 2 are: 0.007036
k Centre for Cluster Total number of data item assign Average distance
into Cluster mean for the Cluster
1 [0.001784,0.000905] 4444 0.005525
2 [0.014784,0.017086] 640 0.017529
Page 43
Page 44
4.2.2 Result for the non-supervised clustering of slope height factor
k Centre for Cluster Total number of data Average Average Slope

item assign into Cluster Height Risky Score
1 [0.001784,0.000905] 4444 0.007157 0.000844
2 [0.014784,0.017086] 640 0.024586 0.008578
According to the non-supervised learning result from the clustering of the slope height factor, I
find that there are two data set fall into two different clusters.
In this sample data, 87.41 % data items assigned to the cluster 1 with the centre equal to
[0.001784,0.000905] and this cluster have the relatively low average height, so I classified this
cluster of sample data as a short height slope.
In addition, there 12.59% data items assigned to the cluster 2 with the centre equal to
[0.014784,0.017086] and this cluster have the relatively large average height, so I classified
this cluster of sample data as a tall height slope.
The last but not least, by comparison between cluster 1 and cluster 2, I find that cluster 2 with
higher average height than cluster 1 and also have a higher average risky score. In conclusion,
the slope with higher height also have larger risky score. In general, the slope height is directly
proportional to the slope risky score.
Page 45
4.2.3 Implementation of slope length factor (Non-supervised)
For Slope Length Factor, I found the optimum k by trying different values of k from 1 to 7.
And I assume that if the change in the average distance mean is less than 5% from previous k to
the next k, I will select the previous k as the optimum k for this clustering.
0.06
0.05
0.04
0.03
0.02
0.01
0
0 1 2 3 4 5 6 7 8
k

1 [0.006716,0.007423] 2818 0.025219
2 [0.099892,0.000156] 2266 0.034716
Page 46
Page 47
4.2.4 Result for the non-supervised clustering of slope length factor
k Centre for Cluster Total number of data Average Average Slope

item assign into Cluster Length Risky Score
1 [0.006716,0.007423] 2818 0.0305 0.001137
2 [0.099892,0.000156] 2266 0.100761 0.002665
According to the non-supervised learning result from the clustering of the slope length factor, I
In this sample data, 55.43 % data items assigned to the cluster 1 with the center equal to
[0.006716,0.007423] and this cluster have the relatively smaller average length, so I classified
this cluster of sample data as a narrow length slope.
In addition, there are 44.57% data items assigned to the cluster 2 with the center equal to
[0.099892,0.000156] and this cluster have the relatively larger average length, so I classified
this cluster of sample data as a wide length slope.
In conclusion, by comparison between cluster 1 and cluster 2, I find that cluster 2 with larger
average length than cluster 1 and also have a higher average risky score. So that the slope with
wider length also have larger risky score. In general, the slope length is also directly proportional
to the slope risky score.
Page 48
4.2.5 Implementation of slope angle factor (Non-upervised)
For Slope Angle Factor, I found the optimum k by trying different values of k from 1 to 7. And
I assume that if the change in the average distance mean is less than 5% from previous k to the
next k, I will select the previous k as the optimum k for this clustering.
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0 1 2 3 4 5 6 7 8
k

1 [0.002178,0.000020] 5 0.007274
2 [0.026411,0.001099] 5079 0.024521
Page 49
Page 50
4.2.6 Result for the non-supervised clustering of slope angle factor
K Center for Cluster Total number of data Average Average Slope

item assign into Cluster Angle Risky Score
1 [0.002178,0.000020] 5 0.0092 0.000618
2 [0.026411,0.001099] 5079 0.050319 0.001819
According to the non-supervised learning result from the clustering of the slope angle factor, I
Assume I want to catalogues the sample data into two classes of angle, these are smooth angle
and steep angle.
In this sample data, 0.1 % data items assigned to the cluster 1 with the center equal to
[0.002178,0.000020] and this cluster have the relatively less steep average angle, so I classified
this cluster of sample data as a smooth angle slope.
Also, there are 99.9% data items assigned to the cluster 2 with the center equal to
[0.026411,0.001099] and this cluster have the relatively more steep average angle, so I
classified this cluster of sample data as the steep angle slope.
The last but not least, by comparison between cluster 1 and cluster 2, I find that the steeper the
average slope angle, the higher the average slope risky score. So I will like to conclude that
the angle of the slope is directly proportional to the slope risky score. That mean in most
situation or over 99%, the slope with the steeper angle (i.e. 50 degrees or above) should be more
risky indeed even there are nearly 1% slope will have the exception.
Page 51
Chapter 5:
Prototype
Page 52
5.1 Technical Summary

In this project, I have developed a application named CityMiner to investigating the slope risky
factor by use of the K-Means Clustering algorithm. In the case study, I use the CityMiner to
evaluate the pattern and conclude some knowledges or rules from the mining result. CityMiner is
developed with the Microsoft Visual Basic 6.0 (VB) in which the K-Means algorithms are
implemented and also present the result of the mining process in a reach graphical interface. By
using the third generation language such as VB, it is easy to implement the additional algorithms
like descidion tree or association rule in the future. Since CityMiner could only be run in
wndows base computer rather than the cross-platform application, it is quite easy to install with
several simple steps and convenience to access.
Moreover, Microsoft Access 2000 Database is choosen in this project. Generally, MS Access is
not a powerful DBMS tool for the data transaction operations, however, I considerate that we
just use the MS Access database as a data source files and the data warehouse in the case study.
And it is only useful in the stages of data cleansing, data integration, data selection and data
transformation. While during the stage of data mining, all operations of the K-Means algorithms
will only be run in the main memory rather than read/write the database. So the performance of
the database is not an important issue in this case study.
Being developed with the Microsoft Visual Basic 6.0, CityMiner have to develop, install and run
in the Windows Operating System such as Windows 95/98, Windows ME, Windows NT,
Windows 2000 Workstation and Windows 2000 Server. In this project, I prefer to use
Windows 2000 Server as the development platform and require users to install the CityMiner in
Windows 2000 Professional as the operating platform.
At last, in stages of data cleaning, data integration, data selection and data transformation, we
have manipulate the Queries Wazird in Microsoft Access 2000 to execute all the SQL statements
for removing duty data or outlier, selecting the quality data into the data warehouse and
transforming the data for mining process.
Page 53
5.2 Technical Implementation
In this project, I have developed a application, CityMiner, to help us to generate the optimum
clusters with the supervised or non-supervised mining approach. In addition, the following
diagram will show the overall system architecture of the CityMiner and this diagram also tell us
how CityMiner implement the supervised and non-supervised clustering in this project.
5.2.1 Overall System Architecture of CityMiner
In this project, users can access the CityMiner once it have been installed on his/her computer
with the Windows 2000 Professional as an operating system. The mining data have been
prepared in the previous data preparation stage and store in the data warehouse. Through this
system, user could run the CityMiner in supervised mode by providing the desired number of
cluster (k) and result in a data set for each clustering. In addition, users can run the CityMiner in
non-supervised mode by specifying the maximum number of iteration in K-Means algorithm and
the maximum number of cluster (k) in order to get the optimum clusters (k) at the end of the
mining process.
System Architecture
Data Warehouse
Un-supervised
Supervised Mining
Mining
CityMiner User Interface
Computer Facillities Windows 2000 Professional

Database
Provide
Mining
Results
supervised
source
Mining by
supervised
given (k)
Slope
Mining to
optimum
Run un-
get the
Run
(k)
End users Mining Analysts
Diagram 5.1 Overall System Architecture of CityMiner
Page 54
5.2.2 Functional Description of CityMiner
In this project, I have developed a application, CityMiner, which use to perform case studies in
both the supervised and non-supervised approach by implementing the K-Means algorithm.
Generally, CityMiner is consisted of three main funcations. These functions are Data Preparation
Menu, Data Mining Menu and Data Result Analysis Menu.
5.2.2.1 Data Preparation Menu
In the functional page of Data Preparation Menu, it mainly responsible for capturing the mining
data from the data warehouse in which the data for cluster mining have been cleaned, norminated.
In addition, user also can filter out some data items by specifying some simple conditional rules in
order to remove some outliers.
Diagram 5.2 Screen Layout of Data Preparation Menu
Page 55
Step 1
Click to
select the
data
warehouse
source
location
Step 2 Click
[Load…] to
retrieve all
the available
data
attributes into
the selection
list.
Step 3 Select the

desired table
for clustering.
e.g. MINING
Step 4 Select the

Attributes for
clustering
e.g. SLOPE
_SCORE
Step 5 Select the

JOIN key for
these selected
attributes.
e.g. ID1
Step 6 Clcik [Join
Page 56
Together…]
to get the
data ready for
mining
process.
Page 57
5.2.2.2 Data Mining Menu
In the functional page of Data Mining Menu, it mainly responsible for running the mining
algorithm (K-Means) for the prepared mining data in the Data Preparation Menu. In this project,
we can get the clustering result by running the CityMiner in supervised or non-supervised mode.
For supervised mode, user need to select the initial centroid for each of the user defined cluster
and also require to specify the maximum iterations for the K-Means algorithm during the mining
process. For non-supervised mode, instead of user defining the number of clusters and initial
centroids, user only need to provide a maximum number of cluster (k) and the maximum
iterations, then the system will try the values for k from 1 to the used specified maximum number
of clusters (k). As a result, the system will return the optimum value of k for the mining data set.
Diagram 5.3 Screen Layout of Data Mining Menu
Page 58
5.2.2.2.1 The Supervised Approach
Step 1 Click on the

visual grid to
select the
desired data
items
Step 2 Click [OK] to

accept selected
data item.
Step 3 Repeat Step 1 to

Step 2 until all
the user defined
cluster have
selected.
(e.g. select three

clusters)
Step 4 Set the

maximum
number of
iterations
(e.g. input 3 in
the text box)
Step 5 Click
[Run Now!] to
start the mining
process
Page 59
Step 6 Display the

resultant graph
when the mining
process finished.
Step 7 Click
[Show
Label(s)] to
show the
indicators of
each cluster
5.2.2.2.2 The Non-Supervised Approach
Step 1 Set the

maximum
number of k try
(e.g. input 6 in
the text box, so
that the system
will try k from 1
to 6 and get the
optimum k
fininally)
Page 60
Step 2 Set the

maximum
number of
iterations
(e.g. input 3 in
the text box)
Step 3 Click
[Auto-Run] to
start the mining
process
Step 4 Display the

resultant graph
when the mining
process finished.
Step 5 Click
[Show
Label(s)] to
show the
indicators of
each cluster
Page 61
5.2.2.3 Data Result Analysis
For the functional page of Data Result Analysis Menu, it mainly responsible for showing the
summary of mining data process. In addition, user could see the diagram for the average distance
mean against the number of k. From this diagram, user would be more unstandarding why
system choose the resultant k as the optimum k for the clustering. At last, it is possible for the
user to export the mining summary and the resultant data for further analysis or for storage.
Diagram 5.4 Screen Layout of Data Result Analysis Menu
Step 1 Click
[Export…] to
export all the
summary and
the clustering
result data to
the user
Page 62
specified
directory path.
Step 2 Show the graph

of the average
distance mean
against (k)
Step3 List of the 1. Mining Summary.log

exported files
2. Mining Data.csv
Page 63
3. K Trend Summary.csv
4. Cluster Summary.csv
Page 64
Chapter 6:
Conclusion
Page 65
6.1 Project Review
In this project, I have carried out the case study and investigated the results from the clustering, I
find that the selecting the initial points for the clustering are extremely important for the K-Means
algorithms.
In addition, I have learnt much from the data mining process. I understand that the quality data is
highly important for the data mining results. I have studied and practiced different skill for the
data preparation process such as data cleaning and data transformation. I also have leant the
clustering skill by implementing the most basic and common algorithm, K-Means. Last but not
least, I have learnt the analysisal skill to analyize various clustering results and known how to
conclude rules from the mining results.
6.2 Project Achievements
In this project, we perform the case study with both of the supervised and non-supervised
approach to investigate the factors which affecting the slope risky. These slope risky factors
investiaged in the case study include slope geometry like slope height, slope length and slope
angle. We believe that the level of slope risky is highly related to the slope geometry factors such
as height, length and angle, so we manipulate the CityMiner to perform the Non-supervised
clustering approach to obtain the optimumal grouping from the prepared mining data. As a
result, we generate the clustering results for each geometry factors and defined THREE rules
from it.
v Rule 1, the slope height is directly proportional to the slope risky score.
v Rule 2, the slope length is directly proportional to the slope risky score.
v Rule 3, when the angle of the slope is over 50 degees, over 99% of the slope is risky.
In conclusion, we know that not all the geometry factors will affect the risky level of the slope.
In the case study of this project, it show that slope height, slope length and slope angle are the
dominant factors for the slope risky.
Page 66
6.3 Project Future Extensions
In this project, since I have concluded some rules in the project achievements section, I find that there are
still somethings could be improved or could be enhanced in the future study.
First of all, the quality of the case study results for the mining could be highly improved with increasing
the data volume of the sample data or applying more advance methods in the stages of data cleaning, data
integration, data transformation and data reduction. The higher the quality of the data, the higher the
quality of the mining results.
Secondary, the application of CityMiner could be enhanced or upgraded to the web-base solution so that
more and more mining analyst can provide useful comments and share experience through the internet.
Last but not least, it is suggested to include additional algorithms into the application CityMiner for
investigating the multi-dimensional effects. For example, we can study the relationship between the slope
height, slope length and slope angle as the same time and learn how thses factors affecting the slope risky.
As a result, it may provide a more comprehensive solutions for the slope risky prediction.
Page 67
Appendix A:
Project Reference
Page 68
Appendix A: Project Reference
[1] DATA MINING Introductory and Advanced Topic, MARGARET H. DUNHAM, Prentice
Hall.
[2] GEO REPORT NO.68, GEOTECHNICAL ENGINEERING OFFICE CIVIL

ENGINEERING DEPARTMENT THE GOVERNMENT OF THE HONG KONG
SPECIAL ADMINISTRATIVE REGION, C.K.L. Wong
[3] Data Mining on Advertising, Hui Siu Wo, Andrew
[4] http://bashful.bimcore.emory.edu:8080/Tutorial/MACD/Kmeans.htm
[5] http://www.kovcomp.com.uk/support/XL-Tut/demo-cluster2.html
[6] http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/Kmeans_Clust.htm
[7] http://www.resample.com/xlminer/help/kMClst/KMClust_intro.htm
[8] http://cne.gmu.edu/modules/dau/stat/clustgalgs/clust5_bdy.html
[9] http://www.predictivepatterns.com/docs/WebSiteDocs/Clustering/K-
Means_Clustering_Overview.htm
[10] http://www.ir.iit.edu/~dagr/DataMiningCourse/Spring2001/BookNotes/3prep.pdf
Page 69
Appendix B:
Progress Reports
Page 70
Appendix B: Progress Report Summary
Student Name: Chan Tat Wing Student No.: 96004020

Project Title: Data mining for slope risky evaluation by use of algorithm
Date of Review:01st,November, 2003
Summary of Progress since last review:
1. Finish the study of the reference book named “DATA MINING Introductory and
Advanced Topic”.
2. View two demonstrations that is using the similar approach from the Supervisor. One is
more academic level, and the other one is more commercial.
3. Copy one of the samples for study and reference purpose.
Recommendations:
1. Use of K-Means Clustering as an algorithm to start the project to investigate the slope
risky factors.
2. Try to work out some useful paper works for further discussion.
Page 71

Date of Review:07th,November, 2003
Have studied the following sample works from the Internet as listed below:
http://bashful.bimcore.emory.edu:8080/Tutorial/MACD/Kmeans.htm
http://www.kovcomp.com.uk/support/XL-Tut/demo-cluster2.html
http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/Kmeans_Clust.htm
http://www.resample.com/xlminer/help/kMClst/KMClust_intro.htm
http://cne.gmu.edu/modules/dau/stat/clustgalgs/clust5_bdy.html
http://www.predictivepatterns.com/docs/WebSiteDocs/Clustering/K-
Means_Clustering_Overview.htm
Have done some work in the interim report v1.01 by applying the K-Means Clustering in the
Slope Database analysis.
Recommendations:
1. Analyze number of clustering rational.

2. Looking for critical factors in clustering, i.e. Rules as a result
Page 72

Date of Review:15th,November, 2003
Have studied the past paper “Data Mining on Advertising (Hui Siu Wo, Andrew)” provided by
Supervisor.
Have studies the reference engineering document “GEO Report No. 68” which provide more
details information for estimating slope risky.
Have done some work in the interim report v1.02 by applying the K-Means Clustering in the
Slope Database analysis.
Recommendations:
Start document case study along with algorithms in Interim Report.
Page 73

Date of Review:31st,January, 2004
Doing document case study along with algorithms in Interim Report.
Developing the application and analyzing the sample slope database by using Visual Basic 6.0.
Recommendations:
1. Finish the clustering experiment for comparison between size (length, width, height) with Risky factor.
2. Analyze the performance of K means algorithm with and without initial focus points.
Page 74
Page 75

Date of Review:14th,Febuary, 2004
Continue the clustering experiment comparison between size(height, width and angle) with the risky factor.
Continue to develop the application and analyzing the sample slope database by using Visual Basic 6.0.
Recommendations:
1. Visualize the graph of clusters by both different colors and various shapes.
2. Work on the performance analysis of manually selecting center point for clustering.
Page 76

Date of Review: 06th,March, 2004
1. Have made the visual result of the cluster in different colors and different shapes.
2. Have finished the software module and find the optimum value for the cluster (k) for each slope risky
factor
3. Have generated the analysis result for different value of k for each slope risky factor.
Recommendations:
1. Define Risky factor value and description by levels.

2. Complete analysis of size and angle
3. In conclusion, combine all 3 factors together.
Page 77

Date of Review: 13th,March, 2004
1. Seeking the solution to define Risky factor value and description by levels. (Binning Method
with the Equal Width approach)
2. Have completed analysis of size and angle
3. Have writing the conclusion in the final report
Recommendations:
Page 78
Appendix C:
Installation Guide for CityMiner
Page 79
Appendix C: Installation Guide for CityMiner
Step1 Double click the

Setup.exe icon to
start the installation
process.
Step 2 Click the [OK]

button to follow the
setup wizard
instructions to
complete the setup
process.
Step 3
Click to
continue.
Step 4 Click the [Change

Directory] button
to select the user
desired directory
for installing the
application.
Page 80
Step 5 Input the [Program

Group] name.
Then Click the

[Continue] button
to continue the
setup process.
Step 6 Click the [OK]

button to complete
the setup process.
Step 7 Access CityMiner

by the following
order, [Start] ->
[Programs] ->
[CityMiner] ->
[CityMiner].
Page 81
The End
Page 82

Clustering

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering

Uploaded by

Copyright:

Available Formats

City University of Hong Kong

Department of Computer Science

BSCS Final Year Project Report 2003-2004

Data mining for slope risky evaluation

Student Name : Chan Tat Wing

Programme Code : BSCS2

Supervisor : Dr. Joseph Fong

1st Reader : Tao, Yufei

2nd Reader : Lee, C H

EXTENDED ABSTRACT .......................................................................................................3

1.1 PROJECT AIM........................................................................................................... 11

1.2 PROJECT OBJECTIVE............................................................................................. 12

1.3 PROJECT WORK ...................................................................................................... 12

2.1 DATA MINING HISTORY........................................................................................ 14

2.2 CLUSTERING ............................................................................................................ 14

2.3 GENERAL PROCEDURE TO PERFORM DATA MINING................................... 15

2.4 DATA GATHERING .................................................................................................. 15

2.5 DATA CLEANING ..................................................................................................... 15

2.6 DATA TRANSFORMATION .................................................................................... 16

2.6.1 NORMALIZATION BY DECIMAL SCALING .............................................................. 16

2.6.2 DISCRETIZATION OF SLOPE RISKY FACTOR VALUE ............................................. 17

2.7 CLUSTERING ALGORITHM................................................................................... 17

2.7.1 CONCEPT FOR K-MEANS CLUSTERING ................................................................. 17

3.1 INTRODUCTION ....................................................................................................... 20

3.2 DATA GATHERING .................................................................................................. 20

3.3 DATA CLEANING ..................................................................................................... 22

3.4 DATA TRANSFORMATION .................................................................................... 23

3.5 SUPERVISED CLUSTERING WITH K-MEANS ALGORITHM .......................... 25

3.6 NON-SUPERVISED CLUSTERING WITH K-MEANS ALGORITHM ................ 34

4.1 SUPERVISED APPROACH IMPLEMENTATION ................................................. 36

4.1.1 IMPLEMENTATION OF SLOPE HEIGHT FACTOR (SUPERVISED) .............................. 36

4.1.2 IMPLEMENTATION OF SLOPE LENGTH FACTOR (SUPERVISED) ............................. 38

4.1.3 IMPLEMENTATION OF SLOPE ANGLE FACTOR (SUPERVISED) ............................... 40

4.2 NON-SUPERVISED APPROACH IMPLEMENTATION ....................................... 42

4.2.1 IMPLEMENTATION OF SLOPE HEIGHT FACTOR (NON-SUPERVISED)...................... 43

4.2.3 IMPLEMENTATION OF SLOPE LENGTH FACTOR (NON-SUPERVISED) ..................... 46

4.2.5 IMPLEMENTATION OF SLOPE ANGLE FACTOR (NON-UPERVISED) ......................... 49

5.1 TECHNICAL SUMMARY......................................................................................... 53

5.2 TECHNICAL IMPLEMENTATION......................................................................... 54

5.2.1 OVERALL SYSTEM ARCHITECTURE OF CITYMINER............................................. 54

5.2.2 FUNCTIONAL DESCRIPTION OF CITYMINER ......................................................... 55

5.2.2.1 Data Preparation Menu ..................................................................................... 55

5.2.2.2 Data Mining Menu ............................................................................................ 58

5.2.2.2.1 The Supervised Approach ......................................................................... 59

5.2.2.2.2 The Non-Supervised Approach ................................................................. 60

5.2.2.3 Data Result Analysis .......................................................................................... 62

6.1 PROJECT REVIEW................................................................................................... 66

6.2 PROJECT ACHIEVEMENTS ................................................................................... 66

6.3 PROJECT FUTURE EXTENSIONS ......................................................................... 67

APPENDIX A: PROJECT REFERENCE ............................................................................ 69

APPENDIX B: PROGRESS REPORT SUMMARY ........................................................... 71

APPENDIX C: INSTALLATION GUIDE FOR CITYMINER........................................... 80

1.1 Project Aim

1.2 Project Objective

1.3 Project Work

2.1 Data Mining History

2.3 General procedure to perform data mining

Procedure Name Procedure Description

2.4 Data Gathering

2.5 Data Cleaning

2.6 Data Transformation

2.6.1 Normalization by Decimal Scaling

By using the formula: v’ = v / 10 j

UPDATE MINING SET SLOPE_SCORE = SLOPE_SCORE/(101010);

UPDATE MINING SET SLOPE_HEIGHT = SLOPE_HEIGHT/(101010);

UPDATE MINING SET SLOPE_LENGTH = SLOPE_LENGTH/(101010);

UPDATE MINING SET SLOPE_ANGLE = SLOPE_ANGLE/(101010);