We discuss a few ways of doing feature selection in ML.

- Unsupervised method
- Check the linear correlation among the features (without the label)
- Remove features that are redundant by assuming a threashold
- Re-project features (PCA)

- Supervised methods
- Filter method
- Univariate feature selection – selecting based on univariate statical tests (one feature at a time)
- Check the linear correlation of a feature with the target
- ANOVA f-test (numeric features – categorical target)
- f-test tells if means between populations (groups) are significantly different
- Relationship to t-test
- t-test tells if a single variable is significant
- f-test tells if a group of variables is significant

- Calculation
- Total sum square (SST) = sum square within groups + sum square between groups
- f-stat (f value) = variation between group means / variation within the group
- f-stat = explained variance / unexplained variance
- variation between group means (explained) = sum square between groups / (K-1)
- K = number of groups
- K-1 = degree of freedom

- variation within the group (unexplained) = sum square between groups / (N-K)
- N = number of sample
- N-K = degree of freedom

- variation between group means (explained) = sum square between groups / (K-1)
- F(x; K-1,N-K) – f-distribution with degree of freedom of K-1, N-K
- this random variable is represented as the ratio of two independent chi-square random variables with degree of freedom of K-1 and N-K respectively

- ANOVA f-test (numeric features – categorical target)
- Kendall Tau Rank Correlation Coefficient
- Monotonic relationship & small sample size

- Spear Rank Correlation Coefficient
- Monotonic relationship

- Chi-squared (categorical features – categorical target)
- a statistical test of independence to determine the dependency of two categorical variables. It shares similarities with coefficient of determination, R². However, chi-square test is only applicable to categorical or nominal data while R² is only applicable to numeric data.
- calculate chi-square statistics between every feature variable and the target variable and observe the existence of a relationship between the variables and the target
- Chi-distribution = Sum of square of standard random variable
- Chi-squared test = (observed frequency – expected frequency)^2 / expected frequency

- Mutual information
- I(X,Y) = expected value of log(P(X,Y)/(P(X)*P(Y))) over (X,Y)
- I(X,Y) = I(Y, X) = H(X) – H(X|Y) = H(Y) – H(Y|X)
- X is a feature and Y is the target. You can select the best feature X with the most mutual information (aka information gain).

- Check the linear correlation of a feature with the target

- Univariate feature selection – selecting based on univariate statical tests (one feature at a time)
- Wrapper method – running a subset of features and iterate the model to evaluate the performance
- Forward Feature Selection
- Start with most important feature and keep adding one at a time

- Recursive Feature Elimination (a type of backward elimination)
- Start with all features and remove one at a time in a greedy way

- Forward Feature Selection
- Embedded method
- Feature importance
- Some models have a feature importance built-in. You can just select the most important features based on that.

- L1 (Lasso) regularization
- Encourages sparsity by zero-ing out not important features

- Feature importance

- Filter method