This article will present some of the most common mistakes in statistics when applied in real world application.
Mistakes vs Errors
First I would like to distinguish between statistical errors and mistakes.
- An error is a more technical terms that describes the difference between the actual value and expected value. It’s a natural consequence of the process itself, whether it’s noise in the data, the model, or the process. Ex: type I and type II errors.
- A mistake is usually the result of doing something wrong in the process (by a human). It can be avoided typically by improving carefulness.
Data Collection/Processing
- You accidentally mixed the label of some data.
- As a mistake: you used the “M” as male for the gender column but assumed anything is female. In reality, there might be other categories for transgender or maybe “m” is also used as male.
- As an error: in a sentiment analysis setting, what is labeled as a positive sentiment might not be consistent with another person’s bias.
- You did not read the documentation of the data set calculate and misunderstood the meaning of certain features. This can happen when working with new/unfamiliar data set.
- You accidentally included certain features that might have “leaked” label information into the feature unintentionally. This is quite a big problem and might be undetected depending how serious it is. For example, if you are trying to predict the nationality of a person from his behavior but you accidentally included the name of the person in the data set.
- Your data pipeline could have very subtle bugs, especially when you need to reconstruct all your features from multiple data source. Maintaining the data consistency between multiple data source is quite tricky. For example, you might have a log file for transaction with customer id, in which you can join the customer id against a customer table to get the customer address. Think about the case when the customer address is updated and you need to rejoin the data again. You might be changing some features without knowing.
- You did not look out for bias in the data set. For example, a survival bias is when you pull the data from stocks that is on the market today to predict stock prices but ignore the fact that those that went bankrupted were not even in the training data.
- You failed to collect the complete data. For example, you train a model that can predict 10 different cat breeds from classifying images of cat. However, after the model is deployed, you found that the accuracy is much lower than training because the users submitted many cat breeds that are outside of the 10 that you trained on.
Modeling
- You did not split the data into training, valid, and test data sets. The validation set is used for hyper-parameter testing. A test set is critical in identifying overfitting/underfitting.
- You did not evaluate the model with the right metric. For example, in class imbalanced data set, you have used accuracy instead of more robust metrics such as confusion matrix, F1-score, ROC AUC, etc.
- You jumped straight to a complex deep learning model without examining simpler based line model first. You might waste a lot time going in the wrong direction if you don’t start with simple model.
- Sometime even a rule based model might be better than using any statistical model if you factor in the all the costs.
- You jumped straight to “your favorite model” but did not explore all the possible model. Often, there are better approaches out there in the field.
- You used the most fancy model for the sake of showing off the latest technology.
- You did not solve the underlying business problem for the user. This can be associated with times when the metrics that you set to optimize/track are not a proxy to the user’s success.
Deployment
- You did not check for training and serving skew. There is a good chance that the data you collected to use in the training environment does not have the same distribution as the data in the production. There can be a few reasons:
- Data drift over time – the training data was collected a while back and the label distribution has shifted
- There are multiple data source and that might be represented in the same proportion in production
- The model code that is used in training might not be the same as the model code used in production
- One example might be that your training code is in python while the production code is in C++/Java
Conclusion
I might have only listed a small set of mistakes and errors but I would say the key to always keep a doubtful mind. I know I have made many common mistakes in statistics, so I look for ways that can challenge the result.