You are on page 1of 82

‫سورة النحل‬

Ph. D.
Announced Discussion
Mon. , 13 October, 2014
Tanta University
Faculty of Science
Mathematics Department

A Thesis Submitted for the Degree of


Researcher / Soaad Abd El-Badie Attia El-Afify
Doctor ofS.Philosophy
Abd El-Badiein Science
Ph. D. Student, Mathematics Department, Faculty of Science, Tanta University,
Egypt. She got her Ms. C. Titled ”A New Data Reduction Approach”,2006,
IN
Faculty of Science, Tanta University, Egypt. She got the Best Student
Presentation and Best Student Paper Awards. She worked as a Teaching
(Mathematical Statistics)
Assistant at German University in Cairo (GUC), She is a member in,
ERS Group: http://www.cba.edu.kw/abo/rough-sets-working-group.html
IRSS: http://roughsets.home.pl/www/
Egyptian Mathematical Society : http://etms-eg.org
Email: savvymore@gmail.com Homepage: www.savvymore.mysite.com
Tel: 01000889394
‫اإلحصاء‬
‫‪Statistics‬‬
‫و‬
‫&‬
‫‪Data Mining‬‬
‫تنقيب البيانات‬
Prof.
Prof.Dr.
Dr. Wafaa
Supervisors
Dr.Mohamed
Abd Ezzat
El-Monem
Anwar Abd Abd El-
Mohamed
El-Latif Monsef
Kozae
Hassanein
Prof. Dr. El-Houssainy Abd El-Bar Rady
M.W.
E.E.
Abd
A.
A.A. El-Monsef
Hassanein
M. Kozea
Rady

Lecturer
Professor of PureMathematical Statistics, Mathematics Department, Faculty of
Professor of of Pure Mathematics,
Mathematics, Mathematics
Mathematics Department,
Department, Faculty Faculty of of Science,
Science,
Science,
Tanta Tanta
university, University,
Egypt.I.S.S.R, Egypt.
Hewas wasthe She got
theformer her
TreasurerBS. ofC.,the2000,
Syndicateby excellence
of with
Professor of Statistics
Tanta university, Egypt. He Cairo, Egypt.
dean His
of the Ph.D.
faculty. in He is aScientific
Statistics, from
member
honors degree,
Professions, the HerMidMs. C.(OSU),
.Delta. degree,
He is 2004. And
a supervisorshe ongot her Ph.D.
many MS. degree, 2007,
Oregon
in manyState University
mathematical societies. USA.
He He
got was the Director
the membership ofC.
of theand
the N​aPh. D.
Statistical
tional
Titled
Thesis
and “Uncertainty
in many
Econometrical in Statistics”,
universities
Consultation in and2007,
out
Center. Faculty
Heof Egypt.
got of He
many Science,
is
awards, a Tanta
member
Sarhan University,
inPh.
Awardmanyof
Committee for Mathematics. He is a supervisor on many MS. C. and D.
Egypt. She is aresearch
mathematical memberAcademy, in,
the
Thesis in manysocieties,
Scientific universities in and Egypt, Lee Award,
out of Egypt. He is aOSU,member USA, in, Member of
the Honor Society of Phi Kappa Phi, USA and Thabet El-Sherief Award, ISSR,
ERS
ERS Group:
Group:
ERS Group,
http://www.cba.edu.kw/abo/rough-sets-working-group.html
http://www.cba.edu.kw/abo/rough-sets-working-group.html, IRSS http://roughsets.home.pl/www/
http://www.cba.edu.kw/abo/rough-sets-working-group.html
Cairo
EgyptianUniversity.
Mathematical
Egyptian Mathematical Society: http://etms-eg.org
Society: http://etms-eg.org
Egyptian
Homepage: Mathematical Society: http://etms-eg.org
http://issr.cu.edu.eg
Homepage:
Homepage: http://tdb.tanta.edu.eg/staff_data/Staff%20Detailed%20Data-ar.aspx?MemberID=847
http://telc.tanta.edu.eg/hosting/pro6/pro6_index.html
Homepage: http://at.yorku.ca/h/a/a/a/48.htm
Outlines
(Chapter 1)5)
((Chapter
Chapter
(Chapter 4) 6)
1.WhatText New Conclusion
Conclusion
Predictor
is DataMining
Mining? it Using
Application
& What can deliver for us?
(Gamma
(CALAIS Correlation
(Major
Chapter &
& Coefficient
Textalyser)
2 &Statistics
3)
2.What isA
&
theNew Power&
Classifier
Misclassification
Linking Using& DM?
Error)
(Malaysia
SDM Airlines
Techniques
(Jaccard Flight
Detailed
Distance 370:
Discussion
Matrix
3.What is + theFuture
Future
Cross work
work
Validation
Statistical Data Mining with Process?
(SDM)
Is There& a Bermuda
Gini Index) Triangle
Bootstrapping
4.Statistical DataConnection)
Mining (SDM) Process in Egypt
++ GMPJMPRJava Program
Programming
Chapter 1

Statistics & Data Mining


Basic Concepts & Definitions
What is Data Mining?
Data Mining Process???

• -Problem
• -Data
Dataset
Database
Warehousing
Definition
Description
Extraction, (Columns
Transformation & Rows)(ETL) Process
& Loading
• Data Preparation
- Data Type
DM Process
HistoricalReply
Historical ReplyTrend
Trend
4 Reply Trends
Statistics
Data MiningHistory
History(1532)
(1989)
The Extraction of Hidden Predictive Information
What is The Major Power
from Large Databases
Linking
Kurt Thearling (2010)

Statistics & Data Mining?


Historical Theoretical

Significant
Linking Power
Comparison
DataSDM Softwares
Mining & Data Science
Software 2014 Cloud

Data Mining
13
Theoretical Reply Trend
 Methodology Similarity: This Concept leads that
most statisticians to consider DM as one of the
Statistics branches. Hence, most of the DM
softwares now were invented by statisticians.
 Statistics provides the theoretical basis of DM
Process: The previous studies of DM process focus
on the statistical prospective as a measure of the
DM validity.
 DM Process used many statistical algorithms such
as Cluster Analysis, Bayes Networks and
Regression.
Theoretical Reply Trend
Linking Power Reply Trend
Significant Comparison Reply Trend
Significant Comparison Reply Trend
Significant Comparison Reply Trend
Egypt New
Egypt Thinking
Governmental
Private Sectors
Sectors
SDM Process Field Study in
Effective
Egyptian
Faced Results
&
Problems in in
Foreigner
DM Experts Exploring Discussions
• Using Statistics and DM especially in the
governmental sectors in Egypt still needs more
Training
improvement Data
to use it to predictTest Data
the future not just
for presenting the past.
• In addition, some of the private sectors are making
their DM process using only statistical
New techniques
Classifier
for their analysis of data. So, this gives us motive
forces to our future work to be a project for
implementing of the current and real situation of
using SDM analysis of in EgyptNew Predictor
Chapter 2
Chapter 2

Statistical Data Mining


st
(SDM) Process 1 Target
Binning & Coding Conversions
Univariate Exploration Swift Summary
Bivariate Exploration
Bivariate Exploration Swift Summary
Multivariate
SDM Exploration
Exploration
Multivariate
Multivariate Process
Swift
Exploration
Exploration Summary
Types
Aims
Flowchart
Brief Comparison
SDM Process 1st Target
Multivariate Exploration
SDMClustering
Process 2 nd
Modeling
Target Types
Flowchart
Regression
Association
Classification Modeling
Rules Modeling
Modeling Types
Types
Types
Classification Measures
•Gini Index
Chapter 3
•Entropy
•Information Gain
•Misclassification
Statistical Data Error
Mining
nd
(SDM) Process 2 Target
JMP Materials & Methods
JMP
JMP Materials
Materials &
& Methods
Methods
Rheumatic Fever Fever
Rheumatic Data Description
Data
Chapter 4

A New SDM Classifier Using


Jaccard Mining Procedure (JMP)
Case Study: Rheumatic Fever Data
JMPJMP
JMP Algorithm
Jaccard
Jaccard
Rheumatic Fever Matrix
Classifiers
Classes
Flowchart
Data Frequencies
Rheumatic Fever Data Gini Index
BestClassifier
Rheumatic
Best JMP JMP Data
Fever Classifier
Gini
Gini Index
Averages
JMP Classifier Java Program
GMP Materials &
GMP Materials & Methods
Methods
Breast
Breast Cancer
Cancer
Breast Data
Cancer Test Data
Description
Training Data
Chapter 5

A New SDM Technique Using


Gamma Mining Procedure (GMP)
Case Study:
Breast Cancer Tumor Diagnosis
Training
Training
GMP
GMPGMP Data
DataGamma
Gamma
Materials & Classifiers
Classes
Methods
Flowchart
Materials & Methods
Training Data Gamma Matrix
Gamma Correlation Coefficient (G)
Misclassification Error (MError)

Naïve Bayes Procedure for Prediction


No. of Agreements (C): The number of cases that are
ranked in the same relative position on both variables.

No. of Inversions (D): The number of cases that ranked


differently on the two variables.
BreastGamma
Cancer Training
Average
Training Data
(GAve.M) Error
Data
Gamma Average Classes

Training Data Gamma Average Matrix


Training Data
BestAverages
Gamma GMP Classifier
Absorption

Training Data Gamma Average Absorption Classifiers


Training Data GMP Classifier MError Average
Naïve
TestBayes
Test Data Predicting
DataAttributes
Objects Process
Mining
Mining
Test Data???
Classification vs Clustering

GMP Validation Process

Clustering Cross Validation


Two-step Cluster KNN Classification Model
Breast Cancer Two-step Cluster Analysis
Two-step Cluster Analysis
Using SPSS Modeler
Sizes, Ratio & Model Summary
Two-step Clusters
Cross Validation Using using

Breast Cancer Data Prediction


Validation Using
Bootstrapping & Cross Validation
Applying KNN Classification Model
Testing Data KNN Prediction Process
Testing Data KNN Prediction Process
Testing Data Prediction Result
Breast Cancer Predicted Data Statistics
GMP Program
Chapter 6

Text Mining
&
Data Mining
Structured vs Unstructured
World Data Percentage
Structured Numerical or Coded Information

Unstructured or Semi-structured Information


One Minute on the Internet is a Very Long Time

6000,000 Facebook Views

200 Million Search Queries on Google

1.3 Million Videos View on YouTube


Text Mining Process
Text Mining Process
The combination of data and text mining is
referred to as “Duo-Mining” . Duo-mining
gives companies the edge on consolidated
information for better decision making.
This process combination has proven to be
especially useful to banking and credit card
companies. Instead of only being able to analyze
the structured data they collect from
transactions.
Duo-Mining Process
Benefits of Text Mining
Text-mining can aid in systematically reviewing a large body of
literature
Text-mining can help researchers keep up in their fields, reducing
the risk they've missed something relevant
Text-mining aids in the discovery of patterns and trends in data,
associations among entities, predictive rules, etc.
Text-mining has the ability to enrich unstructured text with
semantic tags and annotations (i.e., see FOAF - Friend of a Friend)
Text-mining assists authors with tools to develop semantic
annotations
Text-mining is a form of document and information management
Text-mining enables the enrichment of your digital libraries
Text-mining makes it easier for scientists to engage in intelligent
searching, linking and integration of text, databases
Disadvantages of Text Mining
 Data collection in text-mining requires managing a lot of
“free text”
 Often, the data is ill-organized and not described in any
way
 The data is both semi-structured and unstructured
 The use of natural language texts contains ambiguities and
context sensitivity which requires human intervention. ,
e.x., (automobile = car = vehicle = Toyota & Apple (the
company) or apple (the fruit).
 There are lexical, syntactic, semantic and pragmatic
ambiguities/ other challenges
Disadvantages of Text Mining
 Learning techniques for processing text typically need
annotated training examples
 Developing resources (ontologies, corpora) to improve
text mining research is not a simple matter
 Very high number of possible “dimensions” with all
possible word and phrase types in the language.
 Unlike data mining with respect to records (= docs) are
not structurally identical and not statistically independent
 Complex and subtle relationships between concepts in
text.
Text Mining Tools
Practical Application by CALAIS
(Malaysia Airlines Flight 370 Article)
CALAIS Documents Categorization
Malaysia Airlines Flight 370:
Is There a Bermuda Triangle Connection
Live Science , 12/3/2014 by Benjamin Radford
Malaysia Airlines Flight 370
CALAIS Relevance Percentages
Practical Application by Textalyser
Analysis (Malaysia Airlines Flight)
Malaysia Airlines Flight 370 Article
Textalyser General Analysis
Malaysia Airlines Flight 370 Article
Frequency and Top Words Analysis
Malaysia Airlines Flight 370 Article
Word Length Analysis
Malaysia Airlines Flight 370 Article
2,3.4 &5 Word Phrases Frequency
Analysis
1st Paper: Chapter 1
What is the Major Power Linking
Statistics & Data Mining? November 2013
Statistics & Data Mining
2nd Paper: Chapter 4

A New SDM Classifier Using


Jaccard Mining Procedure (JMP)

Case Study:Rheumatic Fever Data


March 2014
3rd Paper: Chapter 5
A New SDM Technique Using
Gamma Mining Procedure (GMP)
Case Study
Breast Cancer Tumor Diagnosis
4th Paper: Chapter 6

Malaysia Airlines Flight 370


Article Analysis

To Appear
Future Work
 The above significant conclusions give us motive
forces to our future work to be a project for
implementing the current and real situation of
using SDM analysis in Egypt

 A Generalization for the JMP & GMP new SDM


techniques will be generalized to suit any type of
Data.
Researcher: S. Abd El-Badie
Email: soaad.attia@guc.edu.eg
Homepage: www.savvymore.mysite.com
Tel: 01000889394

You might also like