# Machine Learning: Methods of Feature Selection

We discuss a few ways of doing feature selection in ML.

• Unsupervised method
• Check the linear correlation among the features (without the label)
• Remove features that are redundant by assuming a threashold
• Re-project features (PCA)
• Supervised methods
• Filter method
• Univariate feature selection – selecting based on univariate statical tests (one feature at a time)
• Check the linear correlation of a feature with the target
• ANOVA f-test (numeric features – categorical target)
• f-test tells if means between populations (groups) are significantly different
• Relationship to t-test
• t-test tells if a single variable is significant
• f-test tells if a group of variables is significant
• Calculation
• Total sum square (SST) = sum square within groups + sum square between groups
• f-stat (f value) = variation between group means / variation within the group
• f-stat = explained variance / unexplained variance
• variation between group means (explained) = sum square between groups / (K-1)
• K = number of groups
• K-1 = degree of freedom
• variation within the group (unexplained) = sum square between groups / (N-K)
• N = number of sample
• N-K = degree of freedom
• F(x; K-1,N-K) – f-distribution with degree of freedom of K-1, N-K
• this random variable is represented as the ratio of two independent chi-square random variables with degree of freedom of K-1 and N-K respectively
• Kendall Tau Rank Correlation Coefficient
• Monotonic relationship & small sample size
• Spear Rank Correlation Coefficient
• Monotonic relationship
• Chi-squared (categorical features – categorical target)
• a statistical test of independence to determine the dependency of two categorical variables. It shares similarities with coefficient of determination, R². However, chi-square test is only applicable to categorical or nominal data while R² is only applicable to numeric data.
• calculate chi-square statistics between every feature variable and the target variable and observe the existence of a relationship between the variables and the target
• Chi-distribution = Sum of square of standard random variable
• Chi-squared test = (observed frequency – expected frequency)^2 / expected frequency
• Mutual information
• I(X,Y) = expected value of log(P(X,Y)/(P(X)*P(Y))) over (X,Y)
• I(X,Y) = I(Y, X) = H(X) – H(X|Y) = H(Y) – H(Y|X)
• X is a feature and Y is the target. You can select the best feature X with the most mutual information (aka information gain).
• Wrapper method – running a subset of features and iterate the model to evaluate the performance
• Forward Feature Selection
• Recursive Feature Elimination (a type of backward elimination)
• Start with all features and remove one at a time in a greedy way
• Embedded method
• Feature importance
• Some models have a feature importance built-in. You can just select the most important features based on that.
• L1 (Lasso) regularization
• Encourages sparsity by zero-ing out not important features