## Problems of high dimension

- Many ML algorithm relies on distances to compute similarity between samples
- Curse of dimensionality – as number of dimension increases, distances between any two sample from the dataset tends to be more alike (more concentrated) in a way that makes it hard to differentiate them. Every point is point is roughly in similar distance to each other, which results in more difficulty in grouping/cluster the samples
- Risk of overfitting – the model might try to fit to the irrelevant features (noise) just because of some noise in the training set but such relationship is not present in the validation/test data set
- Can even add more noise during prediction phase for testing

- Resource waste – more features means more columns of data to store/process/maintain
- Performance – the extra features can cause more time to process for an online system
- Redundant features can cost problem of collinearity
- which is harmful to some model like linear regression
- it’s hard for the model to be confident in the coefficients of two high correlated features, which results in high p-value of both coefficients

- which is harmful to some model like linear regression
- Features space grows exponentially for each feature added, which requires a lot data to train
- Feature space become sparse

- The Hughes effect – as number of dimension increases, it’s harder to find correct hypothesis (because the hypothesis space is larger) and it needs more examples to generalize

### Algorithms

- Latent Semantic Indexing/Analysis (LSI/LSA)
- Unsupervised
- Single Value Decomposition (SVD)
- For decomposing non-square matrix A = S*V*D
- Reconstruct original matrix with few dimensions

- Principle Component Analysis (PCA)
- Only for square matrices
- Eigen value decomposition

- Only for square matrices
- Independent Component Analysis (ICA – paper)
- Separates multivariate signal into addictive components that are maximally independent
- Used for separating super-imposed signals
- Difference with PCA
- PCA – looks for uncorrelated factors
- ICA – looks for independent factors (a factor is uncorrelated to any other factors)
- Not result in orthogonal components like PCA

- Removes high order dependence

- Non-negative Matrix Factorization (NMF)
- Sample features need to be non-negative

- Latent Dirichlet Allocation (LDA)

### Other dimension reduction technique

- Quantization
- This is a technique rather than a specific algorithm
- Benefit
- Lower cpu/gpu/memory/power consumption
- Faster inference performance/latency
- Made possible to run on mobile/embedded devices

- Disadvantage
- Might lose precision
- Values might be saturated/clipped if the quantization range is not set up correctly

- Quantize the input
- Ex: reduce a color image to a gray scale or even black and white image.

- Quantize model weight
- Use lower precision model weight and operations (float -> 8 bit)
- Input support is more available in embedded devices
- from: Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
- Type of model quantization
- Quantize model post training (support in Tensorflow Lite)
- Very simple to apply

- Quantization aware training (in Tensorflow)
- Need to add fake quantization nodes during training (see two “quant” nodes below)

- Quantize model post training (support in Tensorflow Lite)

- Pruning
- Cutting out nodes/edges in a neural network.
- Reduces number of parameters (small model to store)
- Reduces number of edges/connections (fewer operation to process)
- Act as a form of regularization
- Weight pruning based on magnitude during training
- Slow ramp up pruning using a schedule until desired sparsity is reached

- Lottery ticket hypothesis
- An over-parameterized network contains a subnetwork that perform as well as the entire network
- Observation: Post pruning + fine tuning might worse than just using original weights
- There is a subnetwork that “won the lottery”

- Sparse tensor in tensorflow

### The unavoidable

Sometimes you just have to use a large model to squeeze every bit of performance. You might need to use distributed training when the model is too large to fit in a machine (Training large models – distributed training)

## One thought on “Dimensionality Reduction in Machine Learning”