Machine Learning: Methods of Feature Selection

We discuss a few ways of doing feature selection in ML.

  • Unsupervised method
    • Check the linear correlation among the features (without the label)
    • Remove features that are redundant by assuming a threashold
    • Re-project features (PCA)
  • Supervised methods
    • Filter method
      • Univariate feature selection – selecting based on univariate statical tests (one feature at a time)
        • Check the linear correlation of a feature with the target
          • ANOVA f-test (numeric features – categorical target)
            • f-test tells if means between populations (groups) are significantly different
            • Relationship to t-test
              • t-test tells if a single variable is significant
              • f-test tells if a group of variables is significant
            • Calculation
              • Total sum square (SST) = sum square within groups + sum square between groups
              • f-stat (f value) = variation between group means / variation within the group
              • f-stat = explained variance / unexplained variance
                • variation between group means (explained) = sum square between groups / (K-1)
                  • K = number of groups
                  • K-1 = degree of freedom
                • variation within the group (unexplained) = sum square between groups / (N-K)
                  • N = number of sample
                  • N-K = degree of freedom
              • F(x; K-1,N-K) – f-distribution with degree of freedom of K-1, N-K
                • this random variable is represented as the ratio of two independent chi-square random variables with degree of freedom of K-1 and N-K respectively
        • Kendall Tau Rank Correlation Coefficient
          • Monotonic relationship & small sample size
        • Spear Rank Correlation Coefficient
          • Monotonic relationship
        • Chi-squared (categorical features – categorical target)
          • a statistical test of independence to determine the dependency of two categorical variables. It shares similarities with coefficient of determination, R². However, chi-square test is only applicable to categorical or nominal data while R² is only applicable to numeric data.
          • calculate chi-square statistics between every feature variable and the target variable and observe the existence of a relationship between the variables and the target
          • Chi-distribution = Sum of square of standard random variable
          • Chi-squared test = (observed frequency – expected frequency)^2 / expected frequency
        • Mutual information
          • I(X,Y) = expected value of log(P(X,Y)/(P(X)*P(Y))) over (X,Y)
          • I(X,Y) = I(Y, X) = H(X) – H(X|Y) = H(Y) – H(Y|X)
          • X is a feature and Y is the target. You can select the best feature X with the most mutual information (aka information gain).
    • Wrapper method – running a subset of features and iterate the model to evaluate the performance
      • Forward Feature Selection
        • Start with most important feature and keep adding one at a time
      • Recursive Feature Elimination (a type of backward elimination)
        • Start with all features and remove one at a time in a greedy way
    • Embedded method
      • Feature importance
        • Some models have a feature importance built-in. You can just select the most important features based on that.
      • L1 (Lasso) regularization
        • Encourages sparsity by zero-ing out not important features

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *