Most common classification models in data science interviews

Sloth, Hanging, Tree, Sleepy, Illustrator, Animal

If you are preparing for data science interviews, you will likely encounter one of these models.

    • Nearest Centroid
    • K-Nearest Neighbors
  • Naive Bayes
  • Decision Tree, Random Forest, and Boosted Tree
  • Support Vector Machine
  • Neural Network

Nearest Centroid

The nearest centroid is one of the most simple model, and it works as the follow

  • Compute the mean (centroid) of all the sample for each label
  • Predict the class by the identifying the centroid that is closest of the query point in the feature space
Ex: there are 2 classes (Green and Red) in a 2-D feature space, the query point is classified as Green because it’s closer to the centroid of the green cluster


  • Most intuitive model
  • Fast to train and evaluate


  • Only works for very simple data where data is easily distinguishable

Chances of encountering in data science interviews: ★★

  • Not because it’s too simple, but because people often have forgotten about it.

K-Nearest Neighbors

The K-nearest neighbor requires no training. It just predict the classification of the query by gathering the k nearing point in the feature space, and decide the classification by voting among these points.

Ex: there are 2 classes (Green and Red) in a 2-D feature space, the query point is classified as Green in the 3-Nearest Neighbor model because the Green is the majority out of the 3 nearest neighbors.


  • Simple to understand
  • Takes zero time to train


  • Can be slow during prediction when the number of training example is large
  • Performance tends to weaken as the number of feature increases

Chance of encountering in data science interviews: ★★★

  • It can often be brought up as the very first solution when working out a test case.

Naive Bayes

The Naive Bayes classifier involves some probability theory from the Bayes Theorem.

We first start by stating that each feature might have some correlation with the label that we are trying to predict. This correlation is bi-directional, which means

  • Observing the feature can inform you something about the probability distribution of the label
  • Vice versa, observing the label can also inform you something about the probability of the feature

The Naive Bayes classifier makes a naive assumption that each feature influences the label independently and thus we can multiply conditional probability of the label on each feature to get the conditional probability of the label on all feature jointly.


We would like to classified a person as male (class 0) or female (class 1) based on the 2 features

  • Feature 1: whether the person wears a dress
  • Feature 2: whether the person wears a hat
Label (y) Feature 1 (wears a dress) Feature 2 (wears a hat)
0 False False
1 True True
0 False True
1 True True
1 False False

Without making any assumption, if we are given a person who wears a dress and also wears a hat, we can make prediction about the label of the person by comparing

  • A1) P(y=0 | Feature 1 is True AND Feature 2 is True)
  • A2) P(y=1 | Feature 1 is True AND Feature 2 is True)

If (A1) > (A2), we classify the person as male (y=0), else as female (y=1)

Instead of doing the above comparison, we first apply Bayes Theorem to express the probability in an invert form:

  • B1) P(Feature 1 is True AND Feature 2 is True | y=0) * P(y=0) / P(Feature 1 is True AND Feature 2 is True)
  • B2) P(Feature 1 is True AND Feature 2 is True | y=1) * P(y=1) / P(Feature 1 is True AND Feature 2 is True)

Notice the denominator of (B1) and (B2) are the same, so we can remove it without affecting the comparison

  • C1) P(Feature 1 is True AND Feature 2 is True | y=0) * P(y=0)
  • C2) P(Feature 1 is True AND Feature 2 is True | y=1) * P(y=1)

Next step is the key part, we apply an independence assumption between Feature 1 and Feature 2 (this is what is Naive about the Naive Bayes method).

  • D1) P(Feature 1 is True | y=0) * P(Feature 2 is True | y=0) * P(y=0)
  • D2) P(Feature 1 is True | y=1) * P(Feature 2 is True | y=1) * P(y=1)

At this point, you might be wondering about a few things:

  • Q: Why is this assumption valid? How do we know if it’s true?
  • A: Actually we don’t know, but we are hoping so. But depending on how much Feature 1 and Feature 2 are correlated, it will impact the performance of the model
  • Q: Can’t we just calculate (A1) and (A2) directly anyway? Why do we prefer (D1) and (D2)?
  • A: The reason is (D1) and (D2) are easier to compute. (A1) and (A2) are joint probability. Image if we have 20 binary features, we would need to estimate pow(2, 20) = 1 million probabilities (think about all possible combinate of Feature 1 to 20). On the other hand, for 20 features, there are only 20*2 = 40 probabilities to estimate (only need to estimate P(Feature1|y=0) and P(Feature 1|y=1) for Feature 1 to 20).

As a result, the Naive Bayes method is very efficient computationally and in the right circumstance, it can provide great predictive problem (such as traditional email spam filtering)


  • Efficient to compute
  • Still effective in many practical situations


  • Independence assumption is too strong leading to bad performance in many circumstance

Chances of encountering in data science interviews: ★★★★

  • It’s actually quite a popular question to ask because of it wide use in practice.
  • It also allow the interviewer to test your probability skills.

Decision Trees, Random Forests, and Boosted Trees

Decision Tree

Decision is represented as a set of node where you can traverse from the root node until you read a leaf node.

Training: A decision tree is trained iteratively to result in the topology above with each node representing a decision.

Prediction: You traverse the tree from the root node on the top by evaluating at each decision.


  • Easy to understand and explain
  • Fast to execute and evaluation
  • Generally robust to outliers


  • Prone to overfitting
  • Can get messy if there are many dimension

Random Forest

The Random Forest model takes the Decision Tree model to the next level by bagging many trees into a forest. Each tree is trained similar to the training of a Decision Tree but with randomness injected to the training process to make sure the trees are not too similar. The prediction process is basically an averaging of the result from all the trees.


  • Much more robust and Decision Tree
  • Reasonably high accuracy and usually used the first model when a data scientist faces a new problem


  • Training and prediction can be slowed down as we increase the number of trees
  • There can be still some correlation among the trees
  • It is harder to explain and understand with a large number of trees

Boosted Tree

The Boosted Tree model is similar to the Random Forest model because it also use many Decision Trees. However, instead of training them as independent trees, it trains each trees in sequence which each tree enhances (boosts) the misprediction from all the trees before it. There is also mechanism to reweight the sample differently in each iteration so that the subsequent tree can “focus” on the samples that cause misprediction.


  • Prediction accuracy is the highest among all classical model
  • Relatively quick to evaluation
  • Still relevant today in many applications


  • Training can be slow compared to Random Forest because it trains each tree sequentially

Chances of encountering in interview: ★★★★

  • Tree models are so widely used.
  • Although tree models are no longer the state of the art, they are quite popular in practice.

Support Vector Machine

The SVM model is the go to model before deep learning. The training process is to find a margin maximizing hyperplane to separate the 2 classes. In short, the SVM model can be summarized as


It’s an be solved by formulating as a quadratic programming problem. It also have a few extension such as kernels and soft margin. The detail is beyond the scope of this post, so I would just skip it for the time being.


  • Good prediction performance
  • Fast evaluation speed


  • Works well for binary classification but has to tweak a bit to work for multi-classification

Chance of encountering in data science interviews: ★★★

  • It lost some popularity because of the rise of deep learning model.
  • It is not quite suitable for large amount of data but it’s still used as a last stage that can process the result from a deep learning model.

Neural Network

Neural Network is also knowns as Artificial Neural Network. It’s inspired by the biological neuron of the brains. It has a graphic representation where each node can take in multiple weighted input (fan-in) and transmit the result to other nodes (fan-out). It’s typically organized as Input/Hidden/Output layers as below:

  • The number of nodes in the input layer represent the number of features.
  • A deep network means there are multiple hidden layers.
  • The output layer should be just 1 node for a binary classification problem.

Note that each arrow in the diagram actually represents a weight from the node in the previous layer to the next layer. Within each node, there is typically an non-linear activation function. In a classification problem, it typically uses a sigmoid node at the last layer, and the objective function is usually the cross entropy as defined below:

y_i is the label and p_i is the predicted probability of class 1

The common training process of a neural network is thru gradient descent with a trick called back-propagation. There are quite a few details that is involved in solving this optimization problem of finding the optimal weights in the arrows to achieve the minimal prediction error over all training example.

  • The cost function – this is the cross entropy loss function above
  • The learning rate – how big of a step size to take each iteration
  • The optimizer (ex: Stochastic Gradient Descent, Adam, Adadelta) – this control the adaptiveness/momentum of the learning
  • The batch size – how many training samples to use in each iterations
  • Epoch – how time times we want to iterate the entire training data set
  • The structure of the network – how many layers, how many nodes per layers.
  • Weight Initialization – bad initialization can cause a network to fail to converge. This was actually a main reason why Neural Network “did not” work for a long time in the past.
  • Choice regularization – L1, L2, drop-out layers, data augmentation.

Further, there are many variants of NN such as Convolutional Neural Network and Recurrent Neural Network. They can have much better performance compared to vanilla Feed-forward Neural Network described above.

The evaluation of Neural Network also tends to be more extensive because we still don’t have a good understanding and have to treat it like a black box often. This means we need to apply a lot more caution when validating the model.

There are just too many things to talk about NN but I would stop here in this post.


  • Achieve state of the art accuracy
  • Able to improve from large amount of data


  • Hard to interpret/explain
  • Resource intensive both in computing and data volume

Chances of encountering in data science interviews: ★★★??

  • If you are applying to Machine learning position, then you will probably encounter Neural Network.
  • If you are just applying for data analyst position, you might not encounter it at all.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *