You are on page 1of 28

Feature Selection

by Rischan Mafrur

Techniques for Features Selection


Outlier Removal Data Normalization t-TEST The Receiver Operating Characteristic Curve Fishers Discriminant Ratio

Outlier Removal

Learning by Example
Problem Example 4.2.1 [page: 107] We have N(100) data random in 1 dimension Gaussian with mean value =1 & variance =0. 16 add five outlier point [6.2 , -6.4, 4.2, 15, 6.8]

How we can remove the outlier data?

Generate data set Adding some outliers value Scramble the data Find outliers and the index

Cont..
Result:

Now we can identify the value and the position of the outliers

Data Normalization

3 Normalization Methods
By Standard Deviation Min Max Value Range Softmax Normalization

Example 4.3.1 [page:109]


data :
The problem is how we can normalize this data?

Matlab code solution

NormalizeStd Function

NormalizeMnMx Function

NormalizeSoftMax Function

Result
Original Data by Min Max [-1,1]

by Std

by SoftMax [0.5]

t-TEST

Learning by Example
Problem in Example 4.4.1 [page :112] Assuming the data set is normally distributed. We have 2 Gaussian Class with m1= 8.75, and m2 =9, and the variance = 4.

Generate the vectors x1,x2 each containing N =1000. Assumed we dont know about mean and the variance, we just know about the vectors x1 and x2, and then we want to know the equality of means both of data. we use the significance level : 5% (level of confidence 95 %) and 0.1 % (level of confidence 99.9 %)

Cont...
In t-test we have two hypotheses : H0 : The mean values of the data in two classes are equal. H1 : The mean values of the data in two classes are not equal. In this case, when the significance level 5% the result h =1, which implies that the hypothesis of the equality of the means can be rejected. And when the significance level 0.1 % the result h=0, which implies that no evidence to reject the hypothesis of equality of the means. m1 = 8.75, m2 = 9, when significance level 5% implies the means of two classes is not equal but for the significance level 0.1 % implies the means of two classes is equal. so we can conclude :

ROC
Receiver Operating Characteristic
ROC is a measure of the class-discrimination capability of a specific feature. It measures the overlap between the pdfs describing the data distribution of the feature in two classes [Theo 09, Section 5.5].

Learning by Example
Problem in Example 4.5.1 [page: 113] We have 2 classes 1 dimensional Gaussian with m1=2, and m2 =0 We must plotting using plotHist Compute and Plot the corresponding AUC values using the function ROC.

We also can try using different m value: [m1,m2] =[0,0] [m1,m2] =[2,2] [m1,m2] =[5,5] [m1,m2] =[2,0] [m1,m2] =[5,0]

ROC Curve

AUC value

Plot
PlotHist [m1,m2] =[0,0] PlotHist [m1,m2] =[2,2] PlotHist [m1,m2] =[5,5]

Roc Curve [m1,m2] =[0,0]

Roc Curve [m1,m2] =[2,2]

Roc Curve [m1,m2] =[5,5]

PlotHist [m1,m2] =[2,0]

PlotHist [m1,m2] =[5,0]

Roc Curve [m1,m2] =[2,0]

Roc Curve [m1,m2] =[5,0]

Fishers Discriminant Ratio

FDR
FDR commonly used for quantify the discriminatory power of individual features between two classes.

Learning by Example
Problem in Example 4.6.2 [page: 115] In this case, we have a data like in Table 4.3. We have 2 data, Cirrhotic Liver and Fatty Liver with 4 features (mean, std, skew, & kurtosis) The problem is which one has to choose the most informative feature? so we can use FDR for select which the data has most informative feature.

Result
We can see the result : According to the result the higher FDR value is mean with FDR= 13.8893. so the most informative features is the mean.

Thank you :)

You might also like