Vincent Granville, Ph.D. Co-Founder, DSC

Vincent Granville, Ph.D.
Co-Founder, DSC
Feature Selection For Unsupervised Learning
Abstract
After reviewing popular techniques used in supervised, unsupervised and semi-supervised machine learning, we
focus on feature selection methods in these different contexts, especially the metrics used to assess the value of a
feature or set of features, be it binary, continuous or categorical variables.
We go in deeper details and review modern feature selection techniques for unsupervised learning, typically
relying on entropy-like criteria. While these criteria are usually model-dependent or scale-dependent, we introduce
a new model-free, data-driven methodology in this context, with an application to an interesting number theory
problem (simulated data set) in which each feature has a known theoretical entropy.
We also briefly discuss high precision computing as it is relevant to this peculiar data set, as well as units of
information smaller than the bit.
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation
Content
1. Review of supervised and unsupervised learning
2. Feature selection
* Supervised (goodness of fit)

* Unsupervised (entropy)
3. New approach (unsupervised)
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 3
Supervised Learning
Based on training sets, cross-validation, goodness-of-fit. Popular techniques:
1. Linear or logistic regression, predictive modeling
2. Neural nets, deep learning
3. Supervised classification
4. Regression and decision trees
Unsupervised Learning
No training set. Popular techniques:
1. Pattern recognition, association rules
2. Unsupervised clustering, taxonomy creation
3. Data reduction or compression
4. Graph-based methods
5. NLP, image processing
6. Semi-supervised
Note: Overlap between supervised and unsupervised. Example: Neural nets (can be
supervised or unsupervised.)
Unsupervised Clustering: Example
Star clusters classification based on age and metal content
The Concept of Feature (1/2)
Two types of variables:
1. Dependent variable (or response) – supervised learning
2. Independent variables (or predictor or feature) – usually cross-correlated
Y = a1 X1 + a2 X2 + ….
Here,
* Y is the dependent variable,

* the X’s are the features,
* the a’s are the model parameters (to be estimated)
The Concept of Feature (2/2)
Variables (features) can be:
* Discrete / continuous
* Qualitative (gender, country, cluster label)
* Mixed (unstructured, text, email data)
* Binary (dummy variable to represent gender)
* Raw, binned, summarized, or compound variables
Potential issues:
* Duplicated or Missing data

* Fuzzy data (when merging man-made databases, or due to typos)
Example: MIT = M.I.T = Massachusetts Institute of Technology

Solution: use lookup table, e.g. matching MIT with its variations
Feature Selection: Supervised Learning
Select features with best predictive power.
Algorithms:
* Step-wise (add one feature at a time)
* Backward (start with all features, removing one at a time)
* Mixed
Criteria for feature selection

* R-squared (or robust version of this metric)
* Goodness-of-fit-metric, confusion matrix (clustering)
Notes:
* Unlike PCA, it leaves features unchanged (easier for interpretation)
* Combinatorial problem, local optimum OK, stopping rule needed
Feature Selection: Unsupervised Learning
Select features with highest entropy.
Criteria for feature selection

* Shannon entropy (categorical features)
* Joint entropy (measured on a set of features)
* Differential entropy (for continuous features)
* Akaide information (related to maximum likelihood)
* Kullback-Leibler divergence (equivalent to Akaide information)
Drawbacks
* Model-dependent (not data-driven)
* Scale-dependent
New Methodology for Unsupervised Learning (1/3)
Create an artificial response Y, as if dealing with a linear regression problem. Set
regression coefficients to 1. Compute R-squared (or similar metrics) on subsets of
features, to identify best subsets given a fixed number of features. Add one (or two)
feature at a time. Proceed as in the supervised learning framework.
Benefits
* Scale-invariant
* Model-free, data-driven
* Simple, easy to understand
* Can handle categorical features, as dummy variables
More advanced version

* Test various sets of regression coefficients (Monte Carlo simulations)
Example
Simulated data sets with two binary features X1 and X2, with artificial response Y =
X1 + X2, and with known theoretical entropy for each feature. Features are correlated
and also exhibit auto-correlation, as in real data sets. The data sets have 47
observations.
Instead of entropy, we use correlation between response and a feature, to measure

the amount of information attached to the feature in question.
Results
Computed Shannon entropy, theoretical entropy, and the correlation-based
information metric introduced here, are almost equivalent: the higher the correlation,
the higher the entropy. Divergence occurs only when both features almost have the
same entropy, so not an issue. The best feature is the one maximizing the selection
criterion.
Details about the Example
The data set in the spreadsheet consists of the first 47 digits of two numbers:
* Feature #1: digits of log(3/2) in base b = 1.7

* Feature #2: digits of SQRT(1/2) in base b = 2
Interesting facts
* For more than 47 observations, HPC is needed due to machine precision

* A digit in base 1 < b < 2 carries less than one bit of information
* The theoretical entropy is proportional to the logarithm of the base b
* The methodology was also successfully tested on continuous features
Resources
The following web document (see URL below) contains:
* References to various entropy criteria and feature selection techniques

* Details about the new methodology
* Access to spreadsheet with data and detailed computations
* Reference to the underlying number theory context (numeration systems)
* Reference to high precision computing
Link: https://dsc.news/2L4ChPA
Thank you
Vincent Granville, Ph.D.

Co-Founder, DSC
—
vincentg@datasciencecentral.com
+1-925-759-7308
DataScienceCentral.com

Vincent Granville, Ph.D. Co-Founder, DSC

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vincent Granville, Ph.D. Co-Founder, DSC

Uploaded by

Copyright:

Available Formats

Vincent Granville, Ph.D.

* Supervised (goodness of fit)

3. New approach (unsupervised)

1. Linear or logistic regression, predictive modeling

2. Neural nets, deep learning

4. Regression and decision trees

1. Pattern recognition, association rules

2. Unsupervised clustering, taxonomy creation

3. Data reduction or compression

5. NLP, image processing

Star clusters classification based on age and metal content

1. Dependent variable (or response) – supervised learning

2. Independent variables (or predictor or feature) – usually cross-correlated

* Y is the dependent variable,

* Duplicated or Missing data

Example: MIT = M.I.T = Massachusetts Institute of Technology

Criteria for feature selection

Criteria for feature selection

More advanced version

Instead of entropy, we use correlation between response and a feature, to measure

* Feature #1: digits of log(3/2) in base b = 1.7

* For more than 47 observations, HPC is needed due to machine precision

* References to various entropy criteria and feature selection techniques

Vincent Granville, Ph.D.

You might also like