Machine Learning: Methods of Feature Selection

We discuss a few ways of doing feature selection in ML.

Unsupervised method
- Check the linear correlation among the features (without the label)
- Remove features that are redundant by assuming a threashold
- Re-project features (PCA)
Supervised methods
- Filter method
  - Univariate feature selection – selecting based on univariate statical tests (one feature at a time)
    - Check the linear correlation of a feature with the target
      - ANOVA f-test (numeric features – categorical target)
        
        f-test tells if means between populations (groups) are significantly different
        
        Relationship to t-test
        
        t-test tells if a single variable is significant
        
        f-test tells if a group of variables is significant
        
        Calculation
        
        Total sum square (SST) = sum square within groups + sum square between groups
        
        f-stat (f value) = variation between group means / variation within the group
        
        f-stat = explained variance / unexplained variance
        
        variation between group means (explained) = sum square between groups / (K-1)
        
        K = number of groups
        
        K-1 = degree of freedom
        
        variation within the group (unexplained) = sum square between groups / (N-K)
        
        N = number of sample
        
        N-K = degree of freedom
        
        F(x; K-1,N-K) – f-distribution with degree of freedom of K-1, N-K
        
        this random variable is represented as the ratio of two independent chi-square random variables with degree of freedom of K-1 and N-K respectively
    - Kendall Tau Rank Correlation Coefficient
      - Monotonic relationship & small sample size
    - Spear Rank Correlation Coefficient
      - Monotonic relationship
    - Chi-squared (categorical features – categorical target)
      - a statistical test of independence to determine the dependency of two categorical variables. It shares similarities with coefficient of determination, R². However, chi-square test is only applicable to categorical or nominal data while R² is only applicable to numeric data.
      - calculate chi-square statistics between every feature variable and the target variable and observe the existence of a relationship between the variables and the target
      - Chi-distribution = Sum of square of standard random variable
      - Chi-squared test = (observed frequency – expected frequency)^2 / expected frequency
    - Mutual information
      - I(X,Y) = expected value of log(P(X,Y)/(P(X)*P(Y))) over (X,Y)
      - I(X,Y) = I(Y, X) = H(X) – H(X|Y) = H(Y) – H(Y|X)
      - X is a feature and Y is the target. You can select the best feature X with the most mutual information (aka information gain).
- Wrapper method – running a subset of features and iterate the model to evaluate the performance
  - Forward Feature Selection
    - Start with most important feature and keep adding one at a time
  - Recursive Feature Elimination (a type of backward elimination)
    - Start with all features and remove one at a time in a greedy way
- Embedded method
  - Feature importance
    - Some models have a feature importance built-in. You can just select the most important features based on that.
  - L1 (Lasso) regularization
    - Encourages sparsity by zero-ing out not important features

Machine Learning: Methods of Feature Selection

Related Posts

7 Game-Changing Strategies for Using Cold Emails in Your Data Science Job Search

Probability Recursion Question for DS/ML Interviews (Step-by-Step Simple Solution)

Leave a Reply Cancel reply