You are on page 1of 16

Vincent Granville, Ph.D.

Co-Founder, DSC
Feature Selection For Unsupervised Learning
Abstract

After reviewing popular techniques used in supervised, unsupervised and semi-supervised machine learning, we
focus on feature selection methods in these different contexts, especially the metrics used to assess the value of a
feature or set of features, be it binary, continuous or categorical variables.

We go in deeper details and review modern feature selection techniques for unsupervised learning, typically
relying on entropy-like criteria. While these criteria are usually model-dependent or scale-dependent, we introduce
a new model-free, data-driven methodology in this context, with an application to an interesting number theory
problem (simulated data set) in which each feature has a known theoretical entropy.

We also briefly discuss high precision computing as it is relevant to this peculiar data set, as well as units of
information smaller than the bit.

IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation
Content
1. Review of supervised and unsupervised learning

2. Feature selection

* Supervised (goodness of fit)


* Unsupervised (entropy)

3. New approach (unsupervised)

IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 3
Supervised Learning
Based on training sets, cross-validation, goodness-of-fit. Popular techniques:

1. Linear or logistic regression, predictive modeling

2. Neural nets, deep learning

3. Supervised classification

4. Regression and decision trees

IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 4
Unsupervised Learning
No training set. Popular techniques:

1. Pattern recognition, association rules

2. Unsupervised clustering, taxonomy creation

3. Data reduction or compression

4. Graph-based methods

5. NLP, image processing

6. Semi-supervised

Note: Overlap between supervised and unsupervised. Example: Neural nets (can be
supervised or unsupervised.)

IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 5
Unsupervised Clustering: Example

Star clusters classification based on age and metal content

IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 6
The Concept of Feature (1/2)
Two types of variables:

1. Dependent variable (or response) – supervised learning

2. Independent variables (or predictor or feature) – usually cross-correlated

Y = a1 X1 + a2 X2 + ….

Here,

* Y is the dependent variable,


* the X’s are the features,
* the a’s are the model parameters (to be estimated)

IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 7
The Concept of Feature (2/2)
Variables (features) can be:

* Discrete / continuous
* Qualitative (gender, country, cluster label)
* Mixed (unstructured, text, email data)
* Binary (dummy variable to represent gender)
* Raw, binned, summarized, or compound variables

Potential issues:

* Duplicated or Missing data


* Fuzzy data (when merging man-made databases, or due to typos)

Example: MIT = M.I.T = Massachusetts Institute of Technology


Solution: use lookup table, e.g. matching MIT with its variations

IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 8
Feature Selection: Supervised Learning
Select features with best predictive power.

Algorithms:
* Step-wise (add one feature at a time)
* Backward (start with all features, removing one at a time)
* Mixed

Criteria for feature selection


* R-squared (or robust version of this metric)
* Goodness-of-fit-metric, confusion matrix (clustering)

Notes:
* Unlike PCA, it leaves features unchanged (easier for interpretation)
* Combinatorial problem, local optimum OK, stopping rule needed
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 9
Feature Selection: Unsupervised Learning
Select features with highest entropy.

Criteria for feature selection


* Shannon entropy (categorical features)
* Joint entropy (measured on a set of features)
* Differential entropy (for continuous features)
* Akaide information (related to maximum likelihood)
* Kullback-Leibler divergence (equivalent to Akaide information)

Drawbacks
* Model-dependent (not data-driven)
* Scale-dependent

IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 10
New Methodology for Unsupervised Learning (1/3)
Create an artificial response Y, as if dealing with a linear regression problem. Set
regression coefficients to 1. Compute R-squared (or similar metrics) on subsets of
features, to identify best subsets given a fixed number of features. Add one (or two)
feature at a time. Proceed as in the supervised learning framework.

Benefits
* Scale-invariant
* Model-free, data-driven
* Simple, easy to understand
* Can handle categorical features, as dummy variables

More advanced version


* Test various sets of regression coefficients (Monte Carlo simulations)

IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 11
New Methodology for Unsupervised Learning (2/3)
Example
Simulated data sets with two binary features X1 and X2, with artificial response Y =
X1 + X2, and with known theoretical entropy for each feature. Features are correlated
and also exhibit auto-correlation, as in real data sets. The data sets have 47
observations.

Instead of entropy, we use correlation between response and a feature, to measure


the amount of information attached to the feature in question.

Results
Computed Shannon entropy, theoretical entropy, and the correlation-based
information metric introduced here, are almost equivalent: the higher the correlation,
the higher the entropy. Divergence occurs only when both features almost have the
same entropy, so not an issue. The best feature is the one maximizing the selection
criterion.

IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 12
New Methodology for Unsupervised Learning (3/3)
Details about the Example
The data set in the spreadsheet consists of the first 47 digits of two numbers:

* Feature #1: digits of log(3/2) in base b = 1.7


* Feature #2: digits of SQRT(1/2) in base b = 2

Interesting facts

* For more than 47 observations, HPC is needed due to machine precision


* A digit in base 1 < b < 2 carries less than one bit of information
* The theoretical entropy is proportional to the logarithm of the base b
* The methodology was also successfully tested on continuous features

IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 13
Resources
The following web document (see URL below) contains:

* References to various entropy criteria and feature selection techniques


* Details about the new methodology
* Access to spreadsheet with data and detailed computations
* Reference to the underlying number theory context (numeration systems)
* Reference to high precision computing

Link: https://dsc.news/2L4ChPA

IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 14
Thank you

Vincent Granville, Ph.D.


Co-Founder, DSC

vincentg@datasciencecentral.com
+1-925-759-7308
DataScienceCentral.com

IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 15
IBM Community Day: Data Science / July 24, 2018 / © 2018 IBM Corporation 16

You might also like