ML Online Assessment

Preparing for interview. This is an ML Online Assessment for your ML skills.

This is a data science online assessment to help you get a sense of your level in basic data science interview questions.

Part 1: General ML Online Assessment

The first part of the Data Science Online Assessment is more about traditional data science.

1. In which situation are you more likely to overfit your model?

A) Too much data
B) Too little data

Show Answer

Answer is (B)

2. Which of the following is the not an assumption of linear regression?

A) Linear relationship: a linear relationship between each independent variable, and the dependent variable
B) Gaussian error: the residuals of the model are normal distributed
C) Homoscedasticity: the residuals have constant variance across all range of the independent variables
D) Independence: the residuals are independently distributed
E) Collinearity: the x variables has a linear relationship with other

Show Answer

Answer is (E)

3. What is NoSQL?

A) A Query Language that does not use SQL
B) A kind of database that does not support SQL query
C) A product by MongoDB
D) A document datastore
E) Not Only SQL
F) Not SQL
G) A kind of distributed database

Show Answer

Answer is (E). NoSQL is a kind of database that attempts to improve the performance, scalability, and flexibility by relaxing the restriction from storing data in relational tables.

4. Which statements are true about type 1 error?

A) Also known as False Positive
B) Also known as False Negative
C) Can be reduced to 0 by adjust the decision threshold
D) There is a trade off between type 1 error and recall

Show Answer

Answer is (A) and (C). (C) is true because you always prediction everything as negative then you would not have any false positives as all in the extreme. (D) is not true because you don’t trade of an error which is negative quality against a positive quality (recall).

5. The confusion matrix is symmetric.

A) True
B) False

Show Answer

Answer is (B)

6. Which of the following is a regularization technique?

A) LASSO
B) ANOVA
C) Gradient Descent
D) L2
E) Elastic Net
F) PCA

Show Answer

Answer is (A), (D), (E)

7. If two feature variables (x1 and x2) are correlated, then you cannot use them for linear regression because of collinearity.

A) True
B) False

Show Answer

Answer is (B). It’s only a very when they are highly or perfectly correlated.

8. A medical test is testing whether a patient has cancer. If the null hypotheses is “the patient does not have cancer”, then which is Type 2 error?

A) The test predicts cancer while the patient does not have cancer
B) The test predicts no cancer while the patient has cancer

Show Answer

Answer is (B). It’s a false negative because the test fails to identify the positive case (cancer).

9. A correlation coefficient of 100% between x and y means that x has a causal relationship with y?

A) True
B) False

Show Answer

Answer is (B). No matter how correlated, it does not imply causality.

10. Which model has the fastest training time?

A) K-Nearest Neighbor
B) Logistic Regression
C) Support Vector Machine
D) Linear Regression
E) Naive Bayes

Show Answer

Answer is (A). There is not training at all for K-NN.

11. Why do you need to scale numeric feature to the range [-1, 1]?

A) For speed of convergence by gradient descent
B) To balance L1 & L2 regularization among features
C) Achieve highest floating point precision
D) Tree models operate more efficiently at this range
E) K-NN model would not behave well with Euclidian distance

Show Answer

Answer is (A), (B), (C), (E).

12. Which of the following is not a method of feature scaling?

A) Min-max scaling
B) Clipping
C) Z-score normalization
D) Winsorizing
E) Outlier dropping

Show Answer

Answer is (E).

13. When would you NOT engineer a hashed feature for categorical input?

A) Unknown vocabulary
B) High cardinality of the categorical input
C) The feature is the gender of the person
D) The feature is the hospital_id of medical sample
E) Cold-start problem when new data comes during serving

Show Answer

Answer is (C)

14. Which of the following does NOT describe unsupervised learning?

A) the data is not labeled
B) PCA is a type of unsupervised learning method
C) predict the price of a house based on the area, number of bedroom, etc
D) finding customer segments from a population of customers

Show Answer

Answer is (C), which is supervised learning.

Part 2: Deep Learning Online Assessment

This part of the Data Science Online Assessment is about deep learning

1. What is stochastic about the Stochastic Gradient Descent?

A) The data is random in each batch
B) The gradient takes a random direction each step
C) The magnitude of the descent is randomized
D) The dimension of the descent is randomized
E) The momentum of the descent is stochastic

Show Answer

Answer is (A) because the data is shuffled and then feed into each step.

2. Gradient descent searches for the global minimum

A) True
B) False
C) Depends

Show Answer

The answer is (C). It depends on whether the problem is convex. But in the general problem of neural network, the optimization problem is non-convex, so it will unlikely find the global minimum.

3. What is the loss function for a multi-classification problem?

A) Mean square error
B) Soft-max error
C) Cross-entropy loss
D) Logistic loss
E) Mean absolute error

Show Answer

Answer is (C). (B) is a layer before the the final cross-entropy loss layer. (D) is for binary classification.

4. You have a RGB Image from the previous layer with dimension (3, 100, 100). Which of the following is right dimension for output after a 32 channel 2D convolutional layer with filter size (5×5) with same padding?

A) 320000
B) 250000
C) 2400
D) 960000

Show Answer

Answer is (A), which is 100*100*32.

5. Why do we use non-linear activation functions (such as tanhtanh or relurelu) in neural networks?

A) Non-linear optimization is faster for convergence
B) Most real world datasets are non-linear
C) Stacking linear layers can only model linear relationship of the data which result in limited modeling capability
D) The error measure MSE (mean squared error) is a non-linear metric

Show Answer

Answer is (C). You might argue (B), but (C) is more appropriate.

6. CNN question: A 10×10 grayscale image, we apply convolutional layer with 32 filters with kernel size 3×3, stride 1 and no zero padding (‘VALID’ padding), followed by a 2×2 pooling layer. What are the dimensions of the output?

A) 5x5x32
B) 4x4x32
C) 4×4
D) 8x8x32

Show Answer

Answer is (B). The output of the convolution layer is 8x8x32, then the output of the pooling layer is 4x4x32.

7. How does increasing the stride affect the output of a convolutional layer?

A) The output becomes smaller
B) The output becomes bigger
C) The output does not change

Show Answer

Answer is (A)

8. What does it mean when a dense layer of a neural network has no activation?

A) The output of the layer is null value
B) The network does not compile
C) The output of the layer is disabled
D) The output of the layer is linear
E) The layer can be only used for classification by default

Show Answer

Answer is (D)

9. How many trainable parameters are in a flatten layer?

A) Depends on the previous layer’s output dimension
B) Depends on the input layer’s dimension
C) Depends on the next layer’s input dimension
D) Defined by the user when creating this layer
E) No trainer parameter

Show Answer

Answer is (E)

10. In tensorflow, this layer “Conv2D(16, (3,3), activation=’relu’, input_shape=(28,28,1))”, why do you need the “1” in the last dimension of the input_shape?

A) It does not matter, you can remove it since 28×28 is same as 28x28x1
B) It is for the color dimension of an image
C) Tensorflow’s Conv2D expects to perform convolution in 2D but over all a 3D volume
D) It does not compile if the you leave out the “1”

Show Answer

Answer is (B) and (C). You can thinking of Conv2D as a stacked convolution is that sweeping thru all color channels in every convolution steps.

11. Why are neural network rarely optimized with batch gradient descent (BGD) compared to stochastic gradient descent (SGD)?

A) It takes too much memory because of the dataset size
B) It does not converge to the global optimum
C) It needs to be stochastic because data is stochastic
D) It is too slow to converge compared to SGD

Show Answer

Answer is (A). There is some truth to (B) because SGD sometime can bring more randomness to help it jump out of bad minima. (D) is sometimes true depends on whether you are counting by number of epochs or the number of steps.

12. Given a dropout rate of 0.3 on a layer with 10 neurons, what is the probability that only one of the neurons is dropped out?

A) 0.3
B) 0.7
C) 10*0.3*(0.7^9)
D) 10*0.7*(0.3^9)

Show Answer

Answer is (C). The probability that the first neuron is dropped while the other 9 not dropped is 0.3*(0.7^9), and we 10 neurons to choose as the dropped neuron so we multiply by 10.

13. When should you use the Keras Functional API over the Sequential API?

A) When the model is simple to expressed as a sequence of layers
B) When you have multiple input layers
C) When you have multiple output layers
D) When your network topology requires shared layers

Show Answer

Answer is (B), (C), and (D)

14. What is transfer learning?

A) To reused knowledge learned by a pertained model on a related task
B) To compute gradients of the loss function by transferring parameters from another model
C) To adapter a regression model to a classification model
D) To create a smaller model from a large model but still retaining most capabilities

Show Answer

Answer is (A)

15. What are some properties of RNN (recurrent neural network)?

A) Size of the network increases with the sequence length
B) Only works for time series data
C) It can process sequential data of arbitrary length
D) It carries hidden state from one time step to the next

Show Answer

Answer is (C) and (D)

16. What is LSTM trying to address from vanilla RNN?

A) Allow RNN to process longer time series
B) Solve the exploding gradient problem in RNN
C) Solve the vanishing gradient problem in RNN
D) To allow bidirectional learning on sequential data

Show Answer

Answer is (B) and (C)

17. Why is the output size of an RNN layer fixed while the input size can be flexible?

A) Because the RNN sums up the hidden state from all time steps
B) Because the RNN averages the hidden state from all time steps
C) Because the RNN only return the hidden state as output from the last step
D) Not possible, the output size of an RNN varies as the input size

Show Answer

Answer is (C). The input length is manifested as repeatedly feeding the hidden state from the previous time step to the next, but only the hidden state from the last time step is returned as the output of the RNN for the next layer (assuming it’s not a stacked RNN).

18. What is the output dimension of the following Keras model?

Sequential([Embedding(1000, 16, input_length=128), LSTM(32)])

A) (None, 128, 16)
B) (None, 16)
C) (None, 32)
D) (None, None, 32)

Show Answer

Answer is (C). The output from LSTM is the final hidden state of 32-dimension.

19. What the ways to build neural network model in Keras?

A) using the Keras Sequential API
B) using the Keras Functional API
C) using Model Subclassing
D) using Layer Subclassing

Show Answer

Answer is all of the above

20. What is the value of dy_dx below?

import tensorflow as tf

x = tf.constant([0, 1], dtype=tf.float32)

with tf.GradientTape() as tape:
tape.watch(x)
y = tf.math.exp(x)
y = 2 * tf.reduce_sum(y)
dy_dx = tape.gradient(y, x)

A) tf.Tensor([0], shape=(), dtype=float32)
B) tf.Tensor([0.5, 2], shape=(2,), dtype=float32)
C) tf.Tensor([2, 5.4365635], shape=(2,), dtype=float32)
D) tf.Tensor([5.4365635], shape=(), dtype=float32)

Show Answer

Answer is (C). Gradient tape is the automatic differentiation tool in Tensorflow. You can verify by taking the derivative of y with respect to x_i, so dy/dx_i = 2 * exp(x_i). For the first dimension when x_i=0, dy/dx_i = 2 * exp(0) = 2.

21. What is aleatoric uncertainty?

A) Data errors from sensor malfunction
B) Object occlusion in an image segmentation task
C) Using a model that underfits the problem

Show Answer

Answer is (A) because it’s the irreducible error.

Part 1: General ML Online Assessment

1. In which situation are you more likely to overfit your model?

2. Which of the following is the not an assumption of linear regression?

3. What is NoSQL?

4. Which statements are true about type 1 error?

5. The confusion matrix is symmetric.

6. Which of the following is a regularization technique?

7. If two feature variables (x1 and x2) are correlated, then you cannot use them for linear regression because of collinearity.

8. A medical test is testing whether a patient has cancer. If the null hypotheses is “the patient does not have cancer”, then which is Type 2 error?

9. A correlation coefficient of 100% between x and y means that x has a causal relationship with y?

10. Which model has the fastest training time?

11. Why do you need to scale numeric feature to the range [-1, 1]?

12. Which of the following is not a method of feature scaling?

13. When would you NOT engineer a hashed feature for categorical input?

14. Which of the following does NOT describe unsupervised learning?

Part 2: Deep Learning Online Assessment

1. What is stochastic about the Stochastic Gradient Descent?

2. Gradient descent searches for the global minimum

3. What is the loss function for a multi-classification problem?

4. You have a RGB Image from the previous layer with dimension (3, 100, 100). Which of the following is right dimension for output after a 32 channel 2D convolutional layer with filter size (5×5) with same padding?

5. Why do we use non-linear activation functions (such as tanhtanh or relurelu) in neural networks?

6. CNN question: A 10×10 grayscale image, we apply convolutional layer with 32 filters with kernel size 3×3, stride 1 and no zero padding (‘VALID’ padding), followed by a 2×2 pooling layer. What are the dimensions of the output?

7. How does increasing the stride affect the output of a convolutional layer?

8. What does it mean when a dense layer of a neural network has no activation?

9. How many trainable parameters are in a flatten layer?

10. In tensorflow, this layer “Conv2D(16, (3,3), activation=’relu’, input_shape=(28,28,1))”, why do you need the “1” in the last dimension of the input_shape?

11. Why are neural network rarely optimized with batch gradient descent (BGD) compared to stochastic gradient descent (SGD)?

12. Given a dropout rate of 0.3 on a layer with 10 neurons, what is the probability that only one of the neurons is dropped out?

13. When should you use the Keras Functional API over the Sequential API?

14. What is transfer learning?

15. What are some properties of RNN (recurrent neural network)?

16. What is LSTM trying to address from vanilla RNN?

17. Why is the output size of an RNN layer fixed while the input size can be flexible?

18. What is the output dimension of the following Keras model?

19. What the ways to build neural network model in Keras?

20. What is the value of dy_dx below?

21. What is aleatoric uncertainty?

Related Posts

7 Game-Changing Strategies for Using Cold Emails in Your Data Science Job Search

Probability Recursion Question for DS/ML Interviews (Step-by-Step Simple Solution)

Leave a Reply Cancel reply