Professional Documents
Culture Documents
Co-Founder, DSC
Feature Selection For Unsupervised Learning
Abstract
After reviewing popular techniques used in supervised, unsupervised and semi-supervised machine learning, we
focus on feature selection methods in these different contexts, especially the metrics used to assess the value of a
feature or set of features, be it binary, continuous or categorical variables.
We go in deeper details and review modern feature selection techniques for unsupervised learning, typically
relying on entropy-like criteria. While these criteria are usually model-dependent or scale-dependent, we introduce
a new model-free, data-driven methodology in this context, with an application to an interesting number theory
problem (simulated data set) in which each feature has a known theoretical entropy.
We also briefly discuss high precision computing as it is relevant to this peculiar data set, as well as units of
information smaller than the bit.
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation
Content
1. Review of supervised and unsupervised learning
2. Feature selection
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 3
Supervised Learning
Based on training sets, cross-validation, goodness-of-fit. Popular techniques:
3. Supervised classification
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 4
Unsupervised Learning
No training set. Popular techniques:
4. Graph-based methods
6. Semi-supervised
Note: Overlap between supervised and unsupervised. Example: Neural nets (can be
supervised or unsupervised.)
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 5
Unsupervised Clustering: Example
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 6
The Concept of Feature (1/2)
Two types of variables:
Y = a1 X1 + a2 X2 + ….
Here,
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 7
The Concept of Feature (2/2)
Variables (features) can be:
* Discrete / continuous
* Qualitative (gender, country, cluster label)
* Mixed (unstructured, text, email data)
* Binary (dummy variable to represent gender)
* Raw, binned, summarized, or compound variables
Potential issues:
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 8
Feature Selection: Supervised Learning
Select features with best predictive power.
Algorithms:
* Step-wise (add one feature at a time)
* Backward (start with all features, removing one at a time)
* Mixed
Notes:
* Unlike PCA, it leaves features unchanged (easier for interpretation)
* Combinatorial problem, local optimum OK, stopping rule needed
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 9
Feature Selection: Unsupervised Learning
Select features with highest entropy.
Drawbacks
* Model-dependent (not data-driven)
* Scale-dependent
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 10
New Methodology for Unsupervised Learning (1/3)
Create an artificial response Y, as if dealing with a linear regression problem. Set
regression coefficients to 1. Compute R-squared (or similar metrics) on subsets of
features, to identify best subsets given a fixed number of features. Add one (or two)
feature at a time. Proceed as in the supervised learning framework.
Benefits
* Scale-invariant
* Model-free, data-driven
* Simple, easy to understand
* Can handle categorical features, as dummy variables
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 11
New Methodology for Unsupervised Learning (2/3)
Example
Simulated data sets with two binary features X1 and X2, with artificial response Y =
X1 + X2, and with known theoretical entropy for each feature. Features are correlated
and also exhibit auto-correlation, as in real data sets. The data sets have 47
observations.
Results
Computed Shannon entropy, theoretical entropy, and the correlation-based
information metric introduced here, are almost equivalent: the higher the correlation,
the higher the entropy. Divergence occurs only when both features almost have the
same entropy, so not an issue. The best feature is the one maximizing the selection
criterion.
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 12
New Methodology for Unsupervised Learning (3/3)
Details about the Example
The data set in the spreadsheet consists of the first 47 digits of two numbers:
Interesting facts
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 13
Resources
The following web document (see URL below) contains:
Link: https://dsc.news/2L4ChPA
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 14
Thank you
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 15
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 16