pml.pdf

KNN:--
1) [True or False] k-NN algorithm does more computation on test time rather than
train time.
A) TRUE
B) FALSE
2) In the image below, which would be the best value for k assuming that the
algorithm you are using is kNearest Neighbor.
A) 3
B) 10
C) 20
D 50
3) Which of the following distance metric can not be used in k-NN?
A) Manhattan
B) Minkowski
C) Tanimoto
D) Jaccard
E) Mahalanobis
F) All can be used
4) Which of the following option is true about k-NN algorithm?
A) It can be used for classification
B) It can be used for regression
C) It can be used in both classification and regression
5) Which of the following statement is true about k-NN algorithm?
1. k-NN performs much better if all of the data have the same scale
2. k-NN works well with a small number of input variables (p), but struggles when the
number of inputs is very large
3. k-NN makes no assumptions about the functional form of the problem being
solved
A) 1 and 2
B) 1 and 3
C) Only 1
D) All of the above
6) Which of the following machine learning algorithm can be used for imputing
missing values of both categorical and continuous variables?
A) K-NN
B) Linear Regression
C) Logistic Regression
7) Which of the following is true about Manhattan distance?
A) It can be used for continuous variables
B) It can be used for categorical variables
C) It can be used for categorical as well as continuous
D) None of these
8) Which of the following distance measure do we use in case of categorical

variables in k-NN?
1. Hamming Distance
2. Euclidean Distance
3. Manhattan Distance
A) 1
B) 2
C) 3
D) 1 and 2
E) 2 and 3
F) 1,2 and 3
9) Which of the following will be Euclidean Distance between the two data point
A(1,3) and B(2,3)?
A) 1
B) 2
C) 4
D) 8
Solution: A
sqrt( (1-2)^2 + (3-3)^2) = sqrt(1^2 + 0^2) = 1
10) Which of the following will be Manhattan Distance between the two data point
A(1,3) and B(2,3)?
A) 1
B) 2
C) 4
D) 8
Solution: A
sqrt( mod((1-2)) + mod((3-3))) = sqrt(1 + 0) = 1
11) Suppose, you want to predict the class of new data point x=1 and y=1 using
eucludian distance in 3- NN. In which class this data point belong to?
A) + Class
B) – Class
C) Can’t say
D) None of these
12) In the previous question, you are now want use 7-NN instead of 3-KNN
which of the following x=1 and y=1 will belong to?
A) + Class
B) – Class
C) Can’t say
13) Which of the following value of k in k-NN would minimize the leave one out
cross validation accuracy?
A) 3
B) 5
C) Both have same
D) None of these
14) Which of the following would be the leave on out cross validation accuracy

for k=5?
A) 2/14
B) 4/14
C) 6/14
D) 8/14
E) None of the above
15) Which of the following will be true about k in k-NN in terms of Bias?
A) When you increase the k the bias will be increases
B) When you decrease the k the bias will be increases
C) Can’t say
D) None of these
16) Which of the following will be true about k in k-NN in terms of variance?
A) When you increase the k the variance will increases
B) When you decrease the k the variance will increases
C) Can’t say
D) None of these
17) The following two distances(Eucludean Distance and Manhattan Distance)
have given to you which generally we used in K-NN algorithm. These distance
are between two points A(x1,y1) and B(x2,Y2). Your task is to tag the both
distance by seeing the following two graphs. Which of the following option is
true about below graph ?
A) Left is Manhattan Distance and right is euclidean Distance
B) Left is Euclidean Distance and right is Manhattan Distance
C) Neither left or right are a Manhattan Distance
D) Neither left or right are a Euclidian Distance
18) When you find noise in data which of the following option would you
consider in k-NN?
A) I will increase the value of k
B) I will decrease the value of k
C) Noise can not be dependent on value of k
D) None of these
19) In k-NN it is very likely to overfit due to the curse of dimensionality. Which
of the following option
would you consider to handle such problem?
1. Dimensionality Reduction
2. Feature selection
A) 1
B) 2
C) 1 and 2
D) None of these
20) Below are two statements given. Which of the following will be true both
statements?
1. k-NN is a memory-based approach is that the classifier immediately adapts as we
collect new training data.

2. The computational complexity for classifying new samples grows linearly with the
number of samples in the training dataset in the worst-case scenario.
A) 1
B) 2
C) 1 and 2
D) None of these
21) Suppose you have given the following images(1 left, 2 middle and 3 right),
Now your task is to find out the value of k in k-NN in each image where k1 is
for 1st, k2 is for 2nd and k3 is for 3rd figure.
A) k1 > k2> k3
B) k1<k2
C) k1 = k2 = k3
D) None of these
22) Which of the following value of k in the following graph would you give
least leave one out cross validation accuracy?
A) 1
B) 2
C) 3
D) 5
23) A company has build a kNN classifier that gets 100% accuracy on training
data. When they deployed this model on client side it has been found that the
model is not at all accurate. Which of the following thing might gone wrong?
Note: Model has successfully deployed and no technical issues are found at
client side except the model performance
A) It is probably a overfitted model
B) It is probably a underfitted model
C) Can’t say
D) None of these
24) You have given the following 2 statements, find which of these option
is/are true in case of k-NN?
1. In case of very large value of k, we may include points from other classes into the
neighborhood.
2. In case of too small value of k the algorithm is very sensitive to noise
A) 1
B) 2
C) 1 and 2
D) None of these
25) Which of the following statements is true for k-NN classifiers?
A) The classification accuracy is better with larger values of k
B) The decision boundary is smoother with smaller values of k
C) The decision boundary is linear
D) k-NN does not require an explicit training step
26) True-False: It is possible to construct a 2-NN classifier by using the 1-NN
classifier?

A) TRUE
B) FALSE
27) In k-NN what will happen when you increase/decrease the value of k?
A) The boundary becomes smoother with increasing value of K
B) The boundary becomes smoother with decreasing value of K
C) Smoothness of boundary doesn’t dependent on value of K
D) None of these
28) Following are the two statements given for k-NN algorthm, which of the
statement(s) is/are true?
1. We can choose optimal value of k with the help of cross validation
2. Euclidean distance treats each feature as equally important
A) 1
B) 2
C) 1 and 2
D) None of these
29) What would be the time taken by 1-NN if there are N(Very large)
observations in test data?
A) N*D
B) N*D*2
C) (N*D)/2
D) None of these
30) What would be the relation between the time taken by 1-NN,2-NN,3-NN.
A) 1-NN >2-NN >3-NN
B) 1-NN < 2-NN < 3-NN
C) 1-NN ~ 2-NN ~ 3-NN
D) None of these
Logistic Regression
1) True-False: Is Logistic regression a supervised machine learning algorithm?
A) TRUE
B) FALSE
2) True-False: Is Logistic regression mainly used for Regression?
A) TRUE
B) FALSE
3) True-False: Is it possible to design a logistic regression algorithm using a Neural
Network Algorithm?
A) TRUE
B) FALSE

4) True-False: Is it possible to apply a logistic regression algorithm on a 3-class
Classification problem?
A) TRUE
B) FALSE
5) Which of the following methods do we use to best fit the data in Logistic
Regression?
A) Least Square Error
B) Maximum Likelihood
C) Jaccard distance
D) Both A and B
6) Which of the following evaluation metrics can not be applied in case of logistic
regression output to compare with target?
A) AUC-ROC
B) Accuracy
C) Logloss
D) Mean-Squared-Error
Solution: D
7) One of the very good methods to analyze the performance of Logistic Regression
is AIC, which is similar to R-Squared in Linear Regression. Which of the following is
true about AIC?
A) We prefer a model with minimum AIC value
B) We prefer a model with maximum AIC value
C) Both but depend on the situation
D) None of these
8) [True-False] Standardisation of features is required before training a Logistic
Regression.
A) TRUE
B) FALSE
9) Which of the following algorithms do we use for Variable Selection?
A) LASSO
B) Ridge
C) Both
D) None of these
Context: 10-11

Consider a following model for logistic regression: P (y =1|x, w)= g(w0 + w1x)
where g(z) is the logistic function.
In the above equation the P (y =1|x; w) , viewed as a function of x, that we can get by
changing the parameters w.
10) What would be the range of p in such case?
A) (0, inf)
B) (-inf, 0 )
C) (0, 1)
D) (-inf, inf)
11) In above question what do you think which function would make p between
(0,1)?
A) logistic function
B) Log likelihood function
C) Mixture of both
D) None of them
Context: 12-13
Suppose you train a logistic regression classifier and your hypothesis function H is
12) Which of the following figure will represent the decision boundary as given by
above classifier?
A)
B) ANS

C)
D)
Solution: B
Option B would be the right answer. Since our line will be represented by y = g(-
6+x2) which is shown in the option A and option B. But option B is the right answer
because when you put the value x2 = 6 in the equation then y = g(0) you will get that
means y= 0.5 will be on the line, if you increase the value of x2 greater then 6 you
will get negative values so output will be the region y =0.
13) If you replace coefficient of x1 with x2 what would be the output figure?
A)
B)
C)

D)ANS
Solution: D
Same explanation as in previous question.
14) Suppose you have been given a fair coin and you want to find out the odds of
getting heads. Which of the following option is true for such a case?
A) odds will be 0
B) odds will be 0.5
C) odds will be 1
D) None of these
Solution: C
Odds are defined as the ratio of the probability of success and the probability of
failure. So in case of fair coin probability of success is 1/2 and the probability of
failure is 1/2 so odd would be 1
15) The logit function(given as l(x)) is the log of odds function. What could be the
range of logit function in the domain x=[0,1]?
A) (– ∞ , ∞)
B) (0,1)
C) (0, ∞)
D) (- ∞, 0)
16) Which of the following option is true?
A) Linear Regression errors values has to be normally distributed but in case
of Logistic Regression it is not the case
B) Logistic Regression errors values has to be normally distributed but in case of
Linear Regression it is not the case
C) Both Linear Regression and Logistic Regression error values have to be normally
distributed

D) Both Linear Regression and Logistic Regression error values have not to be
normally distributed
17) Which of the following is true regarding the logistic function for any value “x”?
Note:
Logistic(x): is a logistic function of any number “x”
Logit(x): is a logit function of any number “x”
Logit_inv(x): is a inverse logit function of any number “x”
A) Logistic(x) = Logit(x)
B) Logistic(x) = Logit_inv(x)
C) Logit_inv(x) = Logit(x)
D) None of these
18) How will the bias change on using high(infinite) regularisation?
Suppose you have given the two scatter plot “a” and “b” for two classes( blue for
positive and red for negative class). In scatter plot “a”, you correctly classified all
data points using logistic regression ( black line is a decision boundary).
A) Bias will be high
B) Bias will be low
C) Can’t say
D) None of these
Solution: A
Model will become very simple so bias will be very high.
19) Suppose, You applied a Logistic Regression model on a given data and got a
training accuracy X and testing accuracy Y. Now, you want to add a few new
features in the same data. Select the option(s) which is/are correct in such a case.

Note: Consider remaining parameters are same.
A) Training accuracy increases
B) Training accuracy increases or remains the same
C) Testing accuracy decreases
D) Testing accuracy increases or remains the same
Solution: A and D
Adding more features to model will increase the training accuracy because model
has to consider more data to fit the logistic regression. But testing accuracy
increases if feature is found to be significant
20) Choose which of the following options is true regarding One-Vs-All method in
Logistic Regression.
A) We need to fit n models in n-class classification problem
B) We need to fit n-1 models to classify into n classes
C) We need to fit only 1 model to classify into n classes
D) None of these
21) Below are two different logistic models with different values for β0 and β1.
Which of the following statement(s) is true about β0 and β1 values of two logistics
models (Green, Black)?
Note: consider Y = β0 + β1*X. Here, β0 is intercept and β1 is coefficient.
A) β1 for Green is greater than Black
B) β1 for Green is lower than Black
C) β1 for both models is same
D) Can’t Say

Solution: B
β0 and β1: β0 = 0, β1 = 1 is in X1 color(black) and β0 = 0, β1 = −1 is in X4 color
(green)
Context 22-24
Below are the three scatter plot(A,B,C left to right) and hand drawn decision
boundaries for logistic regression.
22) Which of the following above figure shows that the decision boundary is
overfitting the training data?
A) A
B) B
C) C
D)None of these
Solution: C
Since in figure 3, Decision boundary is not smooth that means it will over-fitting the
data.
23) What do you conclude after seeing this visualization?
1. The training error in first plot is maximum as compare to second and third plot.
2. The best model for this regression problem is the last (third) plot because it
has minimum training error (zero).
3. The second model is more robust than first and third because it will perform
best on unseen data.
4. The third model is overfitting more as compare to first and second.
5. All will perform same because we have not seen the testing data.
A) 1 and 3
B) 1 and 3
C) 1, 3 and 4
D) 5
Solution: C

The trend in the graphs looks like a quadratic trend over independent variable X. A
higher degree(Right graph) polynomial might have a very high accuracy on the train
population but is expected to fail badly on test dataset. But if you see in left graph we
will have training error maximum because it underfits the training data
24) Suppose, above decision boundaries were generated for the different value of
regularization. Which of the above decision boundary shows the maximum
regularization?
A) A
B) B
C) C
D) All have equal regularization
Solution: A
Since, more regularization means more penality means less complex decision
boundry that shows in first figure A.
25) The below figure shows AUC-ROC curves for three logistic regression models.
Different colors show curves for different hyper parameters values. Which of the
following AUC-ROC will give best result?
A) Yellow
B) Pink
C) Black
D) All are same

Solution: A
The best classification is the largest area under the curve so yellow line has largest
area under the curve.
26) What would do if you want to train logistic regression on same data that will take
less time as well as give the comparatively similar accuracy(may not be same)?
Suppose you are using a Logistic Regression model on a huge dataset. One of the
problem you may face on such huge data is that Logistic regression will take very
long time to train.
A) Decrease the learning rate and decrease the number of iteration
B) Decrease the learning rate and increase the number of iteration
C) Increase the learning rate and increase the number of iteration
D) Increase the learning rate and decrease the number of iteration
Solution: D
If you decrease the number of iteration while training it will take less time for surly but
will not give the same accuracy for getting the similar accuracy but not exact you
need to increase the learning rate.
27) Which of the following image is showing the cost function for y =1.
Following is the loss function in logistic regression(Y-axis loss function and x axis log
probability) for two class classification problem.
Note: Y is the target class
A) A
B) B
C) Both
D) None of these
Solution: A
A is the true answer as loss function decreases as the log probability increases
28) Suppose, Following graph is a cost function for logistic regression.

Now, How many local minimas are present in the graph?
A) 1
B) 2
C) 3
D) 4
Solution: C
There are three local minima present in the graph
29) Imagine, you have given the below graph of logistic regression which is shows
the relationships between cost function and number of iteration for 3 different
learning rate values (different colors are showing different curves at different learning
rates ).

Suppose, you save the graph for future reference but you forgot to save the value of
different learning rates for this graph. Now, you want to find out the relation between
the leaning rate values of these curve. Which of the following will be the true
relation?
Note:
1. The learning rate for blue is l1
2. The learning rate for red is l2
3. The learning rate for green is l3
A) l1>l2>l3
B) l1 = l2 = l3
C) l1 < l2 < l3
D) None of these
Solution: C
If you have low learning rate means your cost function will decrease slowly but in
case of large learning rate cost function will decrease very fast.
30) Can a Logistic Regression classifier do a perfect classification on the below
data?
Note: You can use only X1 and X2 variables where X1 and X2 can take only two
binary values(0,1).
A) TRUE
B) FALSE
C) Can’t say
D) None of these
Solution: B
No, logistic regression only forms linear decision surface, but the examples in the
figure are not linearly separable

Linear Rgression:-
1) True-False: Linear Regression is a supervised machine learning algorithm.
A) TRUE
B) FALSE
2) True-False: Linear Regression is mainly used for Regression.
A) TRUE
B) FALSE
3) True-False: It is possible to design a Linear regression algorithm using a neural
network?
A) TRUE
B) FALSE
4) Which of the following methods do we use to find the best fit line for data in Linear
Regression?
A) Least Square Error
B) Maximum Likelihood
C) Logarithmic Loss
D) Both A and B
5) Which of the following evaluation metrics can be used to evaluate a model while
modeling a continuous output variable?
A) AUC-ROC
B) Accuracy
C) Logloss
D) Mean-Squared-Error
6) True-False: Lasso Regularization can be used for variable selection in Linear
Regression.
A) TRUE
B) FALSE
Solution: (A)
7) Which of the following is true about Residuals ?

A) Lower is better
B) Higher is better
C) A or B depend on the situation
D) None of these
8) Suppose that we have N independent variables (X1,X2… Xn) and dependent
variable is Y. Now Imagine that you are applying linear regression by fitting the best
fit line using least square error on this data.
You found that correlation coefficient for one of it’s variable(Say X1) with Y is -0.95.
Which of the following is true for X1?
A) Relation between the X1 and Y is weak
B) Relation between the X1 and Y is strong
C) Relation between the X1 and Y is neutral
D) Correlation can’t judge the relationship
9) Looking at above two characteristics, which of the following option is the correct
for Pearson correlation between V1 and V2?
If you are given the two variables V1 and V2 and they are following below two
characteristics.
1. If V1 increases then V2 also increases
2. If V1 decreases then V2 behavior is unknown
options
A) Pearson correlation will be close to 1
B) Pearson correlation will be close to -1
C) Pearson correlation will be close to 0
D) None of these
10) Suppose Pearson correlation between V1 and V2 is zero. In such case, is it right
to conclude that V1 and V2 do not have any relation between them?
A) TRUE
B) FALSE
11) Which of the following offsets, do we use in linear regression’s least square line
fit? Suppose horizontal axis is independent variable and vertical axis is dependent
variable.

A) Vertical offset
B) Perpendicular offset
C) Both, depending on the situation
D) None of above
12) True- False: Overfitting is more likely when you have huge amount of data to
train?
A) TRUE
B) FALSE
13) We can also compute the coefficient of linear regression with the help of an
analytical method called “Normal Equation”. Which of the following is/are true about
Normal Equation?
1. We don’t have to choose the learning rate
2. It becomes slow when number of features is very large
3. Thers is no need to iterate
A) 1 and 2
B) 1 and 3
C) 2 and 3
D) 1,2 and 3
14) Which of the following statement is true about sum of residuals of A and B?
Below graphs show two fitted regression lines (A & B) on randomly generated data.
Now, I want to find the sum of residuals in both cases A and B.
Note:

1. Scale is same in both graphs for both axis.
2. X axis is independent variable and Y-axis is dependent variable.
A) A has higher sum of residuals than B
B) A has lower sum of residual than B
C) Both have same sum of residuals
D) None of these
Question Context 15-17:
Suppose you have fitted a complex regression model on a dataset. Now, you are
using Ridge regression with penality x.
15) Choose the option which describes bias in best manner.
A) In case of very large x; bias is low
B) In case of very large x; bias is high
C) We can’t say about bias
D) None of these
16) What will happen when you apply very large penalty?
A) Some of the coefficient will become absolute zero
B) Some of the coefficient will approach zero but not absolute zero
C) Both A and B depending on the situation
D) None of these
17) What will happen when you apply very large penalty in case of Lasso?
A) Some of the coefficient will become zero
B) Some of the coefficient will be approaching to zero but not absolute zero
C) Both A and B depending on the situation
D) None of these
18) Which of the following statement is true about outliers in Linear regression?

A) Linear regression is sensitive to outliers
B) Linear regression is not sensitive to outliers
C) Can’t say
D) None of these
19) Suppose you plotted a scatter plot between the residuals and predicted values in
linear regression and you found that there is a relationship between them. Which of
the following conclusion do you make about this situation?
A) Since the there is a relationship means our model is not good
B) Since the there is a relationship means our model is good
C) Can’t say
D) None of these
Suppose that you have a dataset D1 and you design a linear regression model of
degree 3 polynomial and you found that the training and testing error is “0” or in
another terms it perfectly fits the data.
20) What will happen when you fit degree 4 polynomial in linear regression?
A) There are high chances that degree 4 polynomial will over fit the data
B) There are high chances that degree 4 polynomial will under fit the data
C) Can’t say
D) None of these
21) What will happen when you fit degree 2 polynomial in linear regression?
A) It is high chances that degree 2 polynomial will over fit the data
B) It is high chances that degree 2 polynomial will under fit the data
C) Can’t say
D) None of these
22) In terms of bias and variance. Which of the following is true when you fit degree
2 polynomial?
A) Bias will be high, variance will be high
B) Bias will be low, variance will be high
C) Bias will be high, variance will be low
D) Bias will be low, variance will be low
Question Context 23:

Which of the following is true about below graphs(A,B, C left to right) between the
cost function and Number of iterations?
23) Suppose l1, l2 and l3 are the three learning rates for A,B,C respectively. Which
of the following is true about l1,l2 and l3?
A) l2 < l1 < l3
B) l1 > l2 > l3
C) l1 = l2 = l3
D) None of these
We have been given a dataset with n records in which we have input attribute as x
and output attribute as y. Suppose we use a linear regression method to model this
data. To test our linear regressor, we split the data in training set and test set
randomly.
24) Now we increase the training set size gradually. As the training set size
increases, what do you expect will happen with the mean training error?
A) Increase
B) Decrease
C) Remain constant
D) Can’t Say
25) What do you expect will happen with bias and variance as you increase the size
of training data?
A) Bias increases and Variance increases
B) Bias decreases and Variance increases
C) Bias decreases and Variance decreases
D) Bias increases and Variance decreases
E) Can’t Say False
Question Context 26:
Consider the following data where one input(X) and one output(Y) is given.

26) What would be the root mean square training error for this data if you run a
Linear Regression model of the form (Y = A0+A1X)?
A) Less than 0
B) Greater than zero
C) Equal to 0
D) None of these
Suppose you have been given the following scenario for training and validation error
for Linear Regression.
Scenario Learning Rate
Number of
iterations
Training Error Validation Error
1 0.1 1000 100 110
2 0.2 600 90 105
3 0.3 400 110 110
4 0.4 300 120 130
5 0.4 250 130 150
27) Which of the following scenario would give you the right hyper parameter?
A) 1
B) 2
C) 3
D) 4
28) Suppose you got the tuned hyper parameters from the previous question. Now,
Imagine you want to add a variable in variable space such that this added feature is
important. Which of the following thing would you observe in such case?

A) Training Error will decrease and Validation error will increase
B) Training Error will increase and Validation error will increase
C) Training Error will increase and Validation error will decrease
D) Training Error will decrease and Validation error will decrease
Suppose, you got a situation where you find that your linear regression model is
under fitting the data.
29) In such situation which of the following options would you consider?
1. Add more variables
2. Start introducing polynomial degree variables
3. Remove some variables
A) 1 and 2
B) 2 and 3
C) 1 and 3
D) 1, 2 and 3
30) Now situation is same as written in previous question(under fitting).Which of
following regularization algorithm would you prefer?
A) L1
B) L2
C) Any
D) None of these
Support Vector Machines
Question Context: 1 – 2
Suppose you are using a Linear SVM classifier with 2 class classification problem.
Now you have been given the following data in which some points are circled red
that are representing support vectors.

1) If you remove the following any one red points from the data. Does the decision
boundary will change?
A) Yes
B) No
2) [True or False] If you remove the non-red circled points from the data, the decision
boundary will change?
A) True
B) False
3) What do you mean by generalization error in terms of the SVM?
A) How far the hyperplane is from the support vectors
B) How accurately the SVM can predict outcomes for unseen data
C) The threshold amount of error in an SVM
4) When the C parameter is set to infinite, which of the following holds true?
A) The optimal hyperplane if exists, will be the one that completely separates
the data
B) The soft-margin classifier will separate the data
C) None of the above
5) What do you mean by a hard margin?
A) The SVM allows very low error in classification
B) The SVM allows high amount of error in classification

6) The minimum time complexity for training an SVM is O(n2). According to this fact,
what sizes of datasets are not best suited for SVM’s?
A) Large datasets
B) Small datasets
C) Medium sized datasets
D) Size does not matter
7) The effectiveness of an SVM depends upon:
A) Selection of Kernel
B) Kernel Parameters
C) Soft Margin Parameter C
D) All of the above
8) Support vectors are the data points that lie closest to the decision surface.
A) TRUE
B) FALSE
9) The SVM’s are less effective when:
A) The data is linearly separable
B) The data is clean and ready to use
C) The data is noisy and contains overlapping points
10) Suppose you are using RBF kernel in SVM with high Gamma value. What does
this signify?
A) The model would consider even far away points from hyperplane for modeling
B) The model would consider only the points close to the hyperplane for
modeling
C) The model would not be affected by distance of points from hyperplane for
modeling
D) None of the above
11) The cost parameter in the SVM means:
A) The number of cross-validations to be made
B) The kernel to be used

C) The tradeoff between misclassification and simplicity of the model
12) Suppose you are building a SVM model on data X. The data X can be error
prone which means that you should not trust any specific data point too much. Now
think that you want to build a SVM model which has quadratic kernel function of
polynomial degree 2 that uses Slack variable C as one of it’s hyper parameter.
Based upon that give the answer for following question.
What would happen when you use very large value of C(C->infinity)?
Note: For small C was also classifying all data points correctly
A) We can still classify data correctly for given setting of hyper parameter C
B) We can not classify data correctly for given setting of hyper parameter C
C) Can’t Say
D) None of these
13) What would happen when you use very small C (C~0)?
A) Misclassification would happen
B) Data will be correctly classified
C) Can’t say
D) None of these
14) If I am using all features of my dataset and I achieve 100% accuracy on my
training set, but ~70% on validation set, what should I look out for?
A) Underfitting
B) Nothing, the model is perfect
C) Overfitting
15) Which of the following are real world applications of the SVM?
A) Text and Hypertext Categorization
B) Image Classification
C) Clustering of News Articles
D) All of the above
Question Context: 16 – 18
Suppose you have trained an SVM with linear decision boundary after training SVM,
you correctly infer that your SVM model is under fitting.
16) Which of the following option would you more likely to consider iterating SVM
next time?

A) You want to increase your data points
B) You want to decrease your data points
C) You will try to calculate more variables
D) You will try to reduce the features
17) Suppose you gave the correct answer in previous question. What do you think
that is actually happening?
1. We are lowering the bias
2. We are lowering the variance
3. We are increasing the bias
4. We are increasing the variance
A) 1 and 2
B) 2 and 3
C) 1 and 4
D) 2 and 4
18) In above question suppose you want to change one of it’s(SVM) hyperparameter
so that effect would be same as previous questions i.e model will not under fit?
A) We will increase the parameter C
B) We will decrease the parameter C
C) Changing in C don’t effect
D) None of these
19) We usually use feature normalization before using the Gaussian kernel in SVM.
What is true about feature normalization?
1. We do feature normalization so that new feature will dominate other
2. Some times, feature normalization is not feasible in case of categorical variables
3. Feature normalization always helps when we use Gaussian kernel in SVM
A) 1
B) 1 and 2
C) 1 and 3
D) 2 and 3
Question Context: 20-22
Suppose you are dealing with 4 class classification problem and you want to train a
SVM model on the data for that you are using One-vs-all method. Now answer the
below questions?
20) How many times we need to train our SVM model in such case?
A) 1
B) 2

C) 3
D) 4
21) Suppose you have same distribution of classes in the data. Now, say for training
1 time in one vs all setting the SVM is taking 10 second. How many seconds would it
require to train one-vs-all method end to end?
A) 20
B) 40
C) 60
D) 80
22) Suppose your problem has changed now. Now, data has only 2 classes. What
would you think how many times we need to train SVM in such case?
A) 1
B) 2
C) 3
D) 4
Question context: 23 – 24
Suppose you are using SVM with linear kernel of polynomial degree 2, Now think
that you have applied this on data and found that it perfectly fit the data that means,
Training and testing accuracy is 100%.
23) Now, think that you increase the complexity(or degree of polynomial of this
kernel). What would you think will happen?
A) Increasing the complexity will overfit the data
B) Increasing the complexity will underfit the data
C) Nothing will happen since your model was already 100% accurate
D) None of these
24) In the previous question after increasing the complexity you found that training
accuracy was still 100%. According to you what is the reason behind that?
1. Since data is fixed and we are fitting more polynomial term or parameters so the
algorithm starts memorizing everything in the data
2. Since data is fixed and SVM doesn’t need to search in big hypothesis space
A) 1
B) 2
C) 1 and 2
D) None of these
25) What is/are true about kernel in SVM?

1. Kernel function map low dimensional data to high dimensional space
2. It’s a similarity function
A) 1
B) 2
C) 1 and 2
D) None of these
Dimensionality Reduction techniques
1) Imagine, you have 1000 input features and 1 target feature in a machine learning
problem. You have to select 100 most important features based on the relationship
between input features and the target features.
Do you think, this is an example of dimensionality reduction?
A. Yes
B. No
2) [ True or False ] It is not necessary to have a target variable for applying
dimensionality reduction algorithms.
A. TRUE
B. FALSE
3) I have 4 variables in the dataset such as – A, B, C & D. I have performed the
following actions:
Step 1: Using the above variables, I have created two more variables, namely E = A
+ 3 * B and F = B + 5 * C + D.
Step 2: Then using only the variables E and F I have built a Random Forest model.
Could the steps performed above represent a dimensionality reduction method?
A. True
B. False

4) Which of the following techniques would perform better for reducing dimensions of
a data set?
A. Removing columns which have too many missing values
B. Removing columns which have high variance in data
C. Removing columns with dissimilar data trends
D. None of these
5) [ True or False ] Dimensionality reduction algorithms are one of the possible ways
to reduce the computation time required to build a model.
A. TRUE
B. FALSE
6) Which of the following algorithms cannot be used for reducing the dimensionality
of data?
A. t-SNE
B. PCA
C. LDA False
D. None of these
7) [ True or False ] PCA can be used for projecting and visualizing data in lower
dimensions.
A. TRUE
B. FALSE
8) The most popularly used dimensionality reduction algorithm is Principal
Component Analysis (PCA). Which of the following is/are true about PCA?
1. PCA is an unsupervised method
2. It searches for the directions that data have the largest variance
3. Maximum number of principal components <= number of features
4. All principal components are orthogonal to each other
A. 1 and 2

B. 1 and 3
C. 2 and 3
D. 1, 2 and 3
E. 1,2 and 4
F. All of the above
9) Suppose we are using dimensionality reduction as pre-processing technique, i.e,
instead of using all the features, we reduce the data to k dimensions with PCA. And
then use these PCA projections as our features. Which of the following statement is
correct?
A. Higher ‘k’ means more regularization
B. Higher ‘k’ means less regularization
C. Can’t Say
10) In which of the following scenarios is t-SNE better to use than PCA for
dimensionality reduction while working on a local machine with minimal
computational power?
A. Dataset with 1 Million entries and 300 features
B. Dataset with 100000 entries and 310 features
C. Dataset with 10,000 entries and 8 features
D. Dataset with 10,000 entries and 200 features
11) Which of the following statement is true for a t-SNE cost function?
A. It is asymmetric in nature.
B. It is symmetric in nature.
C. It is same as the cost function for SNE.
Question 12
Imagine you are dealing with text data. To represent the words you are using word
embedding (Word2vec). In word embedding, you will end up with 1000 dimensions.
Now, you want to reduce the dimensionality of this high dimensional data such that,
similar words should have a similar meaning in nearest neighbor space.In such case,
which of the following algorithm are you most likely choose?
A. t-SNE

B. PCA
C. LDA
D. None of these
t-SNE stands for t-Distributed Stochastic Neighbor Embedding which consider
the nearest neighbours for reducing the data.
13) [True or False] t-SNE learns non-parametric mapping.
A. TRUE
B. FALSE
14) Which of the following statement is correct for t-SNE and PCA?
A. t-SNE is linear whereas PCA is non-linear
B. t-SNE and PCA both are linear
C. t-SNE and PCA both are nonlinear
D. t-SNE is nonlinear whereas PCA is linear
15) In t-SNE algorithm, which of the following hyper parameters can be tuned?
A. Number of dimensions
B. Smooth measure of effective number of neighbours
C. Maximum number of iterations
D. All of the above
16) What is of the following statement is true about t-SNE in comparison to PCA?
A. When the data is huge (in size), t-SNE may fail to produce better results.
B. T-NSE always produces better result regardless of the size of the data
C. PCA always performs better than t-SNE for smaller size data.
D. None of these
17) Xi and Xj are two distinct points in the higher dimension representation, where as
Yi & Yj are the representations of Xi and Xj in a lower dimension.
1. The similarity of datapoint Xi to datapoint Xj is the conditional probability p (j|i) .
2. The similarity of datapoint Yi to datapoint Yj is the conditional probability q (j|i) .

Which of the following must be true for perfect representation of xi and xj in lower
dimensional space?
A. p (j|i) = 0 and q (j|i) = 1
B. p (j|i) < q (j|i)
C. p (j|i) = q (j|i)
D. p (j|i) > q (j|i)
18) Which of the following is true about LDA?
A. LDA aims to maximize the distance between class and minimize the within
class distance
B. LDA aims to minimize both distance between class and distance within class
C. LDA aims to minimize the distance between class and maximize the distance
within class
D. LDA aims to maximize both distance between class and distance within class
19) In which of the following case LDA will fail?
A. If the discriminatory information is not in the mean but in the variance of the
data
B. If the discriminatory information is in the mean but not in the variance of the data
C. If the discriminatory information is in the mean and variance of the data
D. None of these
20) Which of the following comparison(s) are true about PCA and LDA?
1. Both LDA and PCA are linear transformation techniques
2. LDA is supervised whereas PCA is unsupervised

3. PCA maximize the variance of the data, whereas LDA maximize the
separation between different classes,
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. Only 3
E. 1, 2 and 3
21) What will happen when eigenvalues are roughly equal?
A. PCA will perform outstandingly
B. PCA will perform badly
C. Can’t Say
D.None of above
22) PCA works better if there is?
1. A linear structure in the data
2. If the data lies on a curved surface and not on a flat surface
3. If variables are scaled in the same unit
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. 1 ,2 and 3
23) What happens when you get features in lower dimensions using PCA?
1. The features will still have interpretability
2. The features will lose interpretability
3. The features must carry all information present in data
4. The features may not carry all information present in data
A. 1 and 3
B. 1 and 4
C. 2 and 3
D. 2 and 4

24) Imagine, you are given the following scatterplot between height and weight.
Select the angle which will capture maximum variability along a single axis?
A. ~ 0 degree
B. ~ 45 degree
C. ~ 60 degree
D. ~ 90 degree
Option B has largest possible variance in data.
25) Which of the following option(s) is / are true?
1. You need to initialize parameters in PCA
2. You don’t need to initialize parameters in PCA
3. PCA can be trapped into local minima problem
4. PCA can’t be trapped into local minima problem
A. 1 and 3
B. 1 and 4
C. 2 and 3
D. 2 and 4

Question Context 26
The below snapshot shows the scatter plot of two features (X1 and X2) with the
class information (Red, Blue). You can also see the direction of PCA and LDA.
26) Which of the following method would result into better class prediction?
A. Building a classification algorithm with PCA (A principal component in direction of
PCA)
B. Building a classification algorithm with LDA
C. Can’t say
D. None of these
27) Which of the following options are correct, when you are applying PCA on a
image dataset?
1. It can be used to effectively detect deformable objects.
2. It is invariant to affine transforms.
3. It can be used for lossy image compression.
4. It is not invariant to shadows.
A. 1 and 2
B. 2 and 3
C. 3 and 4

D. 1 and 4
28) Under which condition SVD and PCA produce the same projection result?
A. When data has zero median
B. When data has zero mean
C. Both are always same
D. None of these
Question Context 29
Consider 3 data points in the 2-d space: (-1, -1), (0,0), (1,1).
29) What will be the first principal component for this data?
1. [ √ 2 /2 , √ 2/ 2 ]
2. (1/ √ 3, 1/ √ 3)
3. ([ -√ 2/ 2 , √ 2/ 2 ])
4. (- 1/ √ 3, – 1/ √ 3)
A. 1 and 2
B. 3 and 4
C. 1 and 3
D. 2 and 4

30) If we project the original data points into the 1-d subspace by the principal
component [ √ 2 /2, √ 2 /2 ] T. What are their coordinates in the 1-d subspace?
A. (− √ 2 ), (0), (√ 2)
B. (√ 2 ), (0), (√ 2)
C. ( √ 2 ), (0), (-√ 2)
D. (-√ 2 ), (0), (-√ 2)
31) For the projected data you just obtained projections ( (− √ 2 ), (0), (√ 2) ). Now if
we represent them in the original 2-d space and consider them as the reconstruction
of the original data points, what is the reconstruction error? Context: 29-31:
A. 0%
B. 10%
C. 30%
D. 40%
32) In LDA, the idea is to find the line that best separates the two classes. In the
given image which of the following is a good projection?
A. LD1
B. LD2
C. Both
D. None of these

Question Context 33
PCA is a good technique to try, because it is simple to understand and is commonly
used to reduce the dimensionality of the data. Obtain the eigenvalues λ1 ≥ λ2 ≥ • • •
≥ λN and plot.
To see how f(M) increases with M and takes maximum value 1 at M = D. We have
two graph given below:
33) Which of the above graph shows better performance of PCA? Where M is first M
principal components and D is total number of features?
A. Left
B. Right
C. Any of A and B
D. None of these
34) Which of the following option is true?
A. LDA explicitly attempts to model the difference between the classes of data.
PCA on the other hand does not take into account any difference in class.
B. Both attempt to model the difference between the classes of data.
C. PCA explicitly attempts to model the difference between the classes of data. LDA
on the other hand does not take into account any difference in class.
D. Both don’t attempt to model the difference between the classes of data.
35) Which of the following can be the first 2 principal components after applying
PCA?

1. (0.5, 0.5, 0.5, 0.5) and (0.71, 0.71, 0, 0)
2. (0.5, 0.5, 0.5, 0.5) and (0, 0, -0.71, -0.71)
3. (0.5, 0.5, 0.5, 0.5) and (0.5, 0.5, -0.5, -0.5)
4. (0.5, 0.5, 0.5, 0.5) and (-0.5, -0.5, 0.5, 0.5)
A. 1 and 2
B. 1 and 3
C. 2 and 4
D. 3 and 4
36) Which of the following gives the difference(s) between the logistic regression and
LDA?
1. If the classes are well separated, the parameter estimates for logistic
regression can be unstable.
2. If the sample size is small and distribution of features are normal for each
class. In such case, linear discriminant analysis is more stable than logistic
regression.
A. 1
B. 2
C. 1 and 2
D. None of these
37) Which of the following offset, do we consider in PCA?
A. Vertical offset
B. Perpendicular offset
C. Both
D. None of these

38) Imagine you are dealing with 10 class classification problem and you want to
know that at most how many discriminant vectors can be produced by LDA. What is
the correct answer?
A. 20
B. 9
C. 21
D. 11
E. 10
Question Context 39
The given dataset consists of images of “Hoover Tower” and some other towers.
Now, you want to use PCA (Eigenface) and the nearest neighbour method to build a
classifier that predicts whether new image depicts “Hoover tower” or not. The figure
gives the sample of your input training images.
39) In order to get reasonable performance from the “Eigenface” algorithm, what pre-
processing steps will be required on these images?
1. Align the towers in the same position in the image.
2. Scale or crop all images to the same size.
A. 1

B. 2
C. 1 and 2
D. None of these
40) What are the optimum number of principle components in the below figure ?
A. 7
B. 30
C. 40
D. Can’t Say
1) Which of the following is/are true about bagging trees?
1. In bagging trees, individual trees are independent of each other
2. Bagging is the method for improving the performance by aggregating the
results of weak learners
A) 1
B) 2

C) 1 and 2
D) None of these
2) Which of the following is/are true about boosting trees?
1. In boosting trees, individual weak learners are independent of each other
2. It is the method for improving the performance by aggregating the results of
weak learners
A) 1
B) 2
C) 1 and 2
D) None of these
3) Which of the following is/are true about Random Forest and Gradient Boosting
ensemble methods?
1. Both methods can be used for classification task
2. Random Forest is use for classification whereas Gradient Boosting is use for
regression task
3. Random Forest is use for regression whereas Gradient Boosting is use for
Classification task
4. Both methods can be used for regression task
A) 1
B) 2
C) 3
D) 4
E) 1 and 4
4) In Random forest you can generate hundreds of trees (say T1, T2 …..Tn) and
then aggregate the results of these tree. Which of the following is true about
individual(Tk) tree in Random Forest?
1. Individual tree is built on a subset of the features
2. Individual tree is built on all the features
3. Individual tree is built on a subset of observations
4. Individual tree is built on full set of observations

A) 1 and 3
B) 1 and 4
C) 2 and 3
D) 2 and 4
5) Which of the following is true about “max_depth” hyperparameter in Gradient
Boosting?
1. Lower is better parameter in case of same validation accuracy
2. Higher is better parameter in case of same validation accuracy
3. Increase the value of max_depth may overfit the data
4. Increase the value of max_depth may underfit the data
A) 1 and 3
B) 1 and 4
C) 2 and 3
D) 2 and 4
6) Which of the following algorithm doesn’t uses learning Rate as of one of its
hyperparameter?
1. Gradient Boosting
2. Extra Trees
3. AdaBoost
4. Random Forest
A) 1 and 3
B) 1 and 4
C) 2 and 3
D) 2 and 4
7) Which of the following algorithm would you take into the consideration in your final
model building on the basis of performance?
Suppose you have given the following graph which shows the ROC curve for two
different classification algorithms such as Random Forest(Red) and Logistic
Regression(Blue)

A) Random Forest
B) Logistic Regression
C) Both of the above
D) None of these
8) Which of the following is true about training and testing error in such case?
Suppose you want to apply AdaBoost algorithm on Data D which has T
observations. You set half the data for training and half for testing initially. Now you
want to increase the number of data points for training T1, T2 … Tn where T1 <
T2…. Tn-1 < Tn.
A) The difference between training error and test error increases as number of
observations increases
B) The difference between training error and test error decreases as number of
observations increases
C) The difference between training error and test error will not change
D) None of These
9) In random forest or gradient boosting algorithms, features can be of any type. For
example, it can be a continuous feature or a categorical feature. Which of the
following option is true when you consider these types of features?
A) Only Random forest algorithm handles real valued attributes by discretizing them
B) Only Gradient boosting algorithm handles real valued attributes by discretizing
them
C) Both algorithms can handle real valued attributes by discretizing them
D) None of these

10) Which of the following algorithm are not an example of ensemble learning
algorithm?
A) Random Forest
B) Adaboost
C) Extra Trees
D) Gradient Boosting
E) Decision Trees
11) Suppose you are using a bagging based algorithm say a RandomForest in
model building. Which of the following can be true?
1. Number of tree should be as large as possible
2. You will have interpretability after using RandomForest
A) 1
B) 2
C) 1 and 2
D) None of these
Context 12-15
Consider the following figure for answering the next few questions. In the figure, X1
and X2 are the two features and the data point is represented by dots (-1 is negative
class and +1 is a positive class). And you first split the data based on feature X1(say
splitting point is x11) which is shown in the figure using vertical line. Every value less
than x11 will be predicted as positive class and greater than x will be predicted as
negative class.
12) How many data points are misclassified in above image?
A) 1
B) 2

C) 3
D) 4
13) Which of the following splitting point on feature x1 will classify the data correctly?
A) Greater than x11
B) Less than x11
C) Equal to x11
D) None of above
Solution: D
14) If you consider only feature X2 for splitting. Can you now perfectly separate the
positive class from negative class for any one split on X2?
A) Yes
B) No
15) Now consider only one splitting on both (one on X1 and one on X2) feature. You
can split both features at any point. Would you be able to classify all data points
correctly?
A) TRUE
B) FALSE
Context 16-17
Suppose, you are working on a binary classification problem with 3 input features.
And you chose to apply a bagging algorithm(X) on this data. You chose
max_features = 2 and the n_estimators =3. Now, Think that each estimators have
70% accuracy.
Note: Algorithm X is aggregating the results of individual estimators based on
maximum voting
16) What will be the maximum accuracy you can get?
A) 70%
B) 80%
C) 90%
D) 100%
Actual predictions M1 M2 M3 Output
1 1 0 1 1

1 1 0 1 1
1 1 0 1 1
1 0 1 1 1
1 0 1 1 1
1 0 1 1 1
1 1 1 1 1
1 1 1 0 1
1 1 1 0 1
1 1 1 0 1
17) What will be the minimum accuracy you can get?
A) Always greater than 70%
B) Always greater than and equal to 70%
C) It can be less than 70%
D) None of these
Actual predictions M1 M2 M3 Output
1 1 0 0 0
1 1 1 1 1
1 1 0 0 0
1 0 1 0 0
1 0 1 1 1
1 0 0 1 0
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
18) Suppose you are building random forest model, which split a node on the
attribute, that has highest information gain. In the below image, select the attribute
which has the highest information gain?

A) Outlook
B) Humidity
C) Windy
D) Temperature
19) Which of the following is true about the Gradient Boosting trees?
1. In each stage, introduce a new regression tree to compensate the
shortcomings of existing model
2. We can use gradient decent method for minimize the loss function
A) 1
B) 2
C) 1 and 2
D) None of these
20) True-False: The bagging is suitable for high variance low bias models?
A) TRUE
B) FALSE
21) Which of the following is true when you choose fraction of observations for
building the base learners in tree based algorithm?

A) Decrease the fraction of samples to build a base learners will result in
decrease in variance
B) Decrease the fraction of samples to build a base learners will result in increase in
variance
C) Increase the fraction of samples to build a base learners will result in decrease in
variance
D) Increase the fraction of samples to build a base learners will result in Increase in
variance
Context 22-23
Suppose, you are building a Gradient Boosting model on data, which has millions of
observations and 1000’s of features. Before building the model you want to consider
the difference parameter setting for time measurement.
22) Consider the hyperparameter “number of trees” and arrange the options in terms
of time taken by each hyperparameter for building the Gradient Boosting model?
Note: remaining hyperparameters are same
1. Number of trees = 100
A) 1~2~3
B) 1<2<3
C) 1>2>3
D) None of these
23) Now, Consider the learning rate hyperparameter and arrange the options in
terms of time taken by each hyperparameter for building the Gradient boosting
model?
Note: Remaining hyperparameters are same
1. learning rate = 1
A) 1~2~3
B) 1<2<3
C) 1>2>3
D) None of these
24) In greadient boosting it is important use learning rate to get optimum output.
Which of the following is true abut choosing the learning rate?

A) Learning rate should be as high as possible
B) Learning Rate should be as low as possible
C) Learning Rate should be low but it should not be very low
D) Learning rate should be high but it should not be very high
25) [True or False] Cross validation can be used to select the number of iterations in
boosting; this procedure may help reduce overfitting.
A) TRUE
B) FALSE
26) When you use the boosting algorithm you always consider the weak learners.
Which of the following is the main reason for having weak learners?
1. To prevent overfitting
2. To prevent under fitting
A) 1
B) 2
C) 1 and 2
D) None of these
27) To apply bagging to regression trees which of the following is/are true in such
case?
1. We build the N regression with N bootstrap sample
2. We take the average the of N regression tree
3. Each tree has a high variance with low bias
A) 1 and 2
B) 2 and 3
C) 1 and 3
D) 1,2 and 3
28) How to select best hyperparameters in tree based models?
A) Measure performance over training data
B) Measure performance over validation data
C) Both of these
D) None of these
29) In which of the following scenario a gain ratio is preferred over Information Gain?
A) When a categorical variable has very large number of category
B) When a categorical variable has very small number of category
C) Number of categories is the not the reason
D) None of these

30) Suppose you have given the following scenario for training and validation error
for Gradient Boosting. Which of the following hyper parameter would you choose in
such case?
Scenario Depth Training Error Validation Error
1 2 100 110
2 4 90 105
3 6 50 100
4 8 45 105
5 10 30 150
A) 1
B) 2 C) 3
D) 4
Q 1) The data scientists at “BigMart Inc” have collected 2013 sales data for 1559
products across 10 stores in different cities. Also, certain attributes of each product
based on these attributes and store have been defined. The aim is to build a
predictive model and find out the sales of each product at a particular store during a
defined period.
Which learning problem does this belong to?
A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. None
Q2) Before building our model, we first look at our data and make predictions
manually. Suppose we have only one feature as an independent variable
(Outlet_Location_Type) along with a continuous dependent variable
(Item_Outlet_Sales).
Outlet_Location_Type Item_Outlet_Sales
Tier 1 3735.14
Tier 3 443.42
Tier 1 2097.27
Tier 3 732.38
Tier 3 994.71

We see that we can possibly differentiate in Sales based on location (tier 1 or tier 3).
We can write simple if-else statements to make predictions.
Which of the following models could be used to generate predictions (may not be
most accurate)?
A. if “Outlet_Location” is “Tier 1”: then “Outlet_Sales” is 2000, else
“Outlet_Sales” is 1000
B. if “Outlet_Location” is “Tier 1”: then “Outlet_Sales” is 1000, else
C. if “Outlet_Location” is “Tier 3”: then “Outlet_Sales” is 500, else “Outlet_Sales”
is 5000
D. Any of the above
Q3) The below created if-else statement is called a decision stump:
Our model: if “Outlet_Location” is “Tier 1”: then “Outlet_Sales” is 2000, else
Now let us evaluate the model we created above on following data:
Evaluation Data:
Tier 1 3735.1380
Tier 3 443.4228
Tier 1 2097.2700
Tier 3 732.3800
Tier 3 994.7052
We will calculate RMSE to evaluate this model.
The root-mean-square error (RMSE) is a measure of the differences between values
predicted by a model or an estimator and the values actually observed.
The formula is :
rmse = (sqrt(sum(square(predicted_values - actual_values)) / number of
observations))
What would be the RMSE value for this model?
A. ~23
B. ~824
C. ~680318

D. ~2152
Q4) For the same data, let us evaluate our models. The root-mean-square error
(RMSE) is a measure of the differences between values predicted by a model or an
estimator and the values actually observed.
Tier 1 3735.1380
Tier 3 443.4228
Tier 1 2097.2700
Tier 3 732.3800
Tier 3 994.7052
The formula is :
rmse = (sqrt(sum(square(predicted_values - actual_values)) / num_samples))
Which of the following will be the best model with respect to RMSE scoring?
A. if “Outlet_Location_Type” is “Tier 1”: then “Outlet_Sales” is 2000, else
B. if “Outlet_Location_Type” is “Tier 1”: then “Outlet_Sales” is 1000, else
C. if “Outlet_Location_Type” is “Tier 3”: then “Outlet_Sales” is 500, else
D. if “Outlet_Location_Type” is “Tier 3”: then “Outlet_Sales” is 2000, else
Q5) Now let’s take multiple features into account.
Outlet_Location_Type Outlet_Type Item_Outlet_Sales
Tier 1
Supermarket
Type1
3735.1380
Tier 3
Supermarket
Type2
443.4228
Tier 1
Supermarket
Type1
2097.2700
Tier 3 Grocery Store 732.3800

Outlet_Location_Type Outlet_Type Item_Outlet_Sales
Tier 3
Supermarket
Type1
994.7052
If have multiple if-else ladders, which model is best with respect to RMSE?
A] if “Outlet_Location_Type” is 'Tier 1':
return 2500
else:
if “Outlet_Type” is 'Supermarket Type1':
return 1000
elif “Outlet_Type” is 'Supermarket Type2':
return 400
else:
return 700
B] if "Outlet_Location_Type" is 'Tier 3':
return 2500
else:
if "Outlet_Type" is 'Supermarket Type1':
return 1000
elif "Outlet_Type" is 'Supermarket Type2':
return 400
else:
return 700
C ] if "Outlet_Location_Type" is 'Tier 3':
return 3000
else:
return 1000
else:
return 500
D ] if "Outlet_Location_Type" is 'Tier 1':
return 3000
else:

return 1000
else:
return 450
Solution: D
A. RMSE value: 581.50
B. RMSE value: 1913.36
C. RMSE value: 2208.36
D. RMSE value: 535.75
Q6) Till now, we have just created predictions using some intuition based rules.
Hence our predictions may not be optimal.What could be done to optimize the
approach of finding better predictions from the given data?
A. Put predictions which are the sum of all the actual values of samples present.
For example, in “Tier 1”, we have two values 3735.1380 and 2097.2700, so
we will take ~5832 as our prediction
B. Put predictions which are the difference of all the actual values of samples
present. For example, in “Tier 1”, we have two values 3735.1380 and
2097.2700, so we will take ~1638 as our prediction
C. Put predictions which are mean of all the actual values of samples
present. For example, in “Tier 1”, we have two values 3735.1380 and
2097.2700, so we will take ~2916 as our prediction
Q7) We could improve our model by selecting the feature which gives a better
prediction when we use it for splitting (It is a process of dividing a node into two or
more sub-nodes).
Outlet_Location_Type Item_Fat_Content Item_Outlet_Sales
Tier 1 Low Fat 3735.1380
Tier 3 Regular 443.4228
In this example, we want to find which feature would be better for splitting root node
(entire population or sample and this further gets divided into two or more
homogeneous sets).

Assume splitting method is “Reduction in Variance” i.e. we split using a variable,
which results in overall lower variance.
What is the resulting variance if we split using Outlet_Location_Type?
A. ~298676
B. ~298676
C. ~3182902
D. ~2222733
E. None of these
Q8) Next, we want to find which feature would be better for splitting root node (where
root node represents entire population). For this, we will set “Reduction in Variance”
as our splitting method.
Outlet_Location_Type Item_Fat_Content Item_Outlet_Sales
The split with lower variance is selected as the criteria to split the population.
Among Between Outlet_Location_Type and Item_Fat_Content, which was a better
feature to split?
A. Outlet_Location_Type
B. Item_Fat_Content
C. will not split on both
Q9) Look at the below image: The red dots represent original data input, while the
green line is the resultant model.

How do you propose to make this model better while working with decision tree?
A. Let it be. The model is general enough
B. Set the number of nodes in the tree beforehand so that it does not overdo its
task
C. Build a decision tree model, use cross validation method to tune tree
parameters
D. Both B and C
E. All A, B and C
F. None of these
Q10) Which methodology does Decision Tree (ID3) take to decide on first split?
A. Greedy approach
B. Look-ahead approach
C. Brute force approach
D. None of these
Q11) There are 24 predictors in a dataset. You build 2 models on the dataset:
1. Bagged decision trees and
2. Random forest
Let the number of predictors used at a single split in bagged decision tree is A and
Random Forest is B.
Which of the following statement is correct?
A. A >= B
B. A < B

C. A >> B
D. Cannot be said since different iterations use different numbers of predictors
Q12) Why do we prefer information gain over accuracy when splitting?
A. Decision Tree is prone to overfit and accuracy doesn’t help to generalize
B. Information gain is more stable as compared to accuracy
C. Information gain chooses more impactful features closer to root
D. All of these
Q13) Random forests (While solving a regression problem) have the higher variance
of predicted result in comparison to Boosted Trees (Assumption: both Random
Forest and Boosted Tree are fully optimized).
A. True
B. False
C. Cannot be determined
Q14) Assume everything else remains same, which of the following is the right
statement about the predictions from decision tree in comparison with predictions
from Random Forest?
A. Lower Variance, Lower Bias
B. Lower Variance, Higher Bias
C. Higher Variance, Higher Bias
D. Lower Bias, Higher Variance
Q15) Which of the following tree based algorithm uses some parallel (full or partial)
implementation?
A. Random Forest
B. Gradient Boosted Trees
C. XGBOOST
D. Both A and C
E. A, B and C
Q16) Which of the following could not be result of two-dimensional feature space
from natural recursive binary split?

A. 1 only
B. 2 only
C. 1 and 2
D. None
Q17) Which of the following is not possible in a boosting algorithm?
A. Increase in training error.
B. Decrease in training error
C. Increase in testing error
D. Decrease in testing error
E. Any of the above
Q18) Which of the following is a decision boundary of Decision Tree?
A. B

B. A
C. D
D. C
E. Can’t Say
Q19) Let’s say we have m numbers of estimators (trees) in a boosted tree. Now, how
many intermediate trees will work on modified version (OR weighted) of data set?
A. 1
B. m-1
C. m
D. Can’t say
E. None of the above
Q20) Boosted decision trees perform better than Logistic Regression on anomaly
detection problems (Imbalanced Class problems).
A. True, because they give more weight for lesser weighted class in
successive rounds
B. False, because boosted trees are based on Decision Tree, which will try to
overfit the data
Q21) Provided n < N and m < M. A Bagged Decision Tree with a dataset of N rows
and M columns uses____rows and ____ columns for training an individual
intermediate tree.
A. N, M
B. N, M
C. n, M
D. n, m
Q22) Given 1000 observations, Minimum observation required to split a node equals
to 200 and minimum leaf size equals to 300 then what could be the maximum depth
of a decision tree?
A. 1
B. 2
C. 3
D. 4

E. 5
The leaf nodes will be as follows for minimum observation to split is 200 and
minimum leaf size is 300:
So only after 2 split, the tree is created. Therefore depth is 2.
Q23) Consider a classification tree for whether a person watches ‘Game of Thrones’
based on features like age, gender, qualification and salary. Is it possible to have
following leaf node?

A. Yes
B. No
C. Can’t say
Q24) Generally, in terms of prediction performance which of the following
arrangements are correct:
A. Bagging>Boosting>Random Forest>Single Tree
B. Boosting>Random Forest>Single Tree>Bagging
C. Boosting>Random Forest>Bagging>Single Tree
D. Boosting >Bagging>Random Forest>Single Tree
Q25) In which of the following application(s), a tree based algorithm can be applied
successfully?
A. Recognizing moving hand gestures in real time
B. Predicting next move in a chess game
C. Predicting sales values of a company based on their past sales
D. A and B
E. A, B, and C
Q26) When using Random Forest for feature selection, suppose you permute values
of two features – A and B. Permutation is such that you change the indices of
individual values so that they do not remain associated with the same target as
before.
For example:

You notice that permuting values does not affect the score of model built on A,
whereas the score decreases on the model trained on B.Which of the following
features would you select from the following solely based on the above finding?
A. (A)
B. (B)
Q27) Boosting is said to be a good classifier because:
A. It creates all ensemble members in parallel, so their diversity can be boosted.
B. It attempts to minimize the margin distribution
C. It attempts to maximize the margins on the training data
D. None of these
Q28) Which splitting algorithm is better with categorical variable having high
cardinality?
A. Information Gain
B. Gain Ratio
C. Change in Variance
D. None of these

Q29) There are “A” features in a dataset and a Random Forest model is built over it.
It is given that there exists only one significant feature of the outcome – “Feature1”.
What would be the % of total splits that will not consider the “Feature1” as one of the
features involved in that split (It is given that m is the number of maximum features
for random forest)?
Note: Considering random forest select features space for every node split.
A. (A-m)/A
B. (m-A)/m
C. m/A
D. Cannot be determined
Q30) Suppose we have missing values in our data. Which of the following method(s)
can help us to deal with missing values while building a decision tree?
A. Let it be. Decision Trees are not affected by missing values
B. Fill dummy value in place of missing, such as -1
C. Impute missing value with mean/median
D. All of these
Q31) To reduce under fitting of a Random Forest model, which of the following
method can be used?
A. Increase minimum sample leaf value
B. increase depth of trees
C. Increase the value of minimum samples to split
D. None of these
Q32) While creating a Decision Tree, can we reuse a feature to split a node?
A. Yes
B. No
Q33) Which of the following is a mandatory data pre-processing step(s) for
XGBOOST?
1. Impute Missing Values

2. Remove Outliers
3. Convert data to numeric array / sparse matrix
4. Input variable must have normal distribution
5. Select the sample of records for each tree/ estimators
A. 1 and 2
B. 1, 2 and 3
C. 3, 4 and 5
D. 3
E. 5
F. All
Q34) Decision Trees are not affected by multicollinearity in features:
A. TRUE
B. FALSE
Q35) For parameter tuning in a boosting algorithm, which of the following search
strategies may give best tuned model:
A. Random Search.
B. Grid Search.
C. A or B
D. Can’t say
Q36) Imagine a two variable predictor space having 10 data points. A decision tree is
built over it with 5 leaf nodes. The number of distinct regions that will be formed in
predictors space?
A. 25
B. 10
C. 2
D. 5
Q37) In Random Forest, which of the following is randomly selected?
A. Number of decision trees

B. features to be taken into account when building a tree
C. samples to be given to train individual tree in a forest
D. B and C
E. A, B and C
Q38) Which of the following are the disadvantage of Decision Tree algorithm?
A. Decision tree is not easy to interpret
B. Decision tree is not a very stable algorithm
C. Decision Tree will over fit the data easily if it perfectly memorizes it
D. Both B and C
Q39) While tuning the parameters “Number of estimators” and “Shrinkage
Parameter”/”Learning Rate” for boosting algorithm.Which of the following relationship
should be kept in mind?
A. Number of estimators is directly proportional to shrinkage parameter
B. Number of estimators is inversely proportional to shrinkage parameter
C. Both have polynomial relationship
Q40) Let’s say we have m number of estimators (trees) in a XGBOOST model. Now,
how many trees will work on bootstrapped data set?
A. 1
B. m-1
C. m
D. Can’t say
Q41) Which of the following statement is correct about XGBOOST parameters:
1. Learning rate can go upto 10
2. Sub Sampling / Row Sampling percentage should lie between 0 to 1
3. Number of trees / estimators can be 1
4. Max depth can not be greater than 10
A. 1
B. 1 and 3

C. 1, 3 and 4
D. 2 and 3
E. 2
F. 4
Q42) What can be the maximum depth of decision tree (where k is the number of
features and N is the number of samples)? Our constraint is that we are considering
a binary decision tree with no duplicate rows in sample (Splitting criterion is not
fixed).
A. N
B. N – k – 1
C. N – 1
D. k – 1
Q43) Boosting is a general approach that can be applied to many statistical learning
methods for regression or classification.
A. True
B. False
Q44) Predictions of individual trees of bagged decision trees have lower correlation
in comparison to individual trees of random forest.
A. TRUE
B. FALSE
Q45) Below is a list of parameters of Decision Tree. In which of the following cases
higher is better?
A. Number of samples used for split
B. Depth of tree
C. Samples for leaf
D. Can’t Say
1. How do we perform Bayesian classification when some features are missing?
(A) We assuming the missing values as the mean of all values.

(B) We ignore the missing features.
(C) We integrate the posteriors probabilities over the missing features.
(D) Drop the features completely.
2. Which of the following statement is False in the case of the KNN Algorithm?
(A) For a very large value of K, points from other classes may be included in the
neighborhood.
(B) For the very small value of K, the algorithm is very sensitive to noise.
(C) KNN is used only for classification problem statements.
(D) KNN is a lazy learner.
3. Which of the following statement is TRUE?
(A) Outliers should be identified and removed always from a dataset.
(B) Outliers can never be present in the testing dataset.
(C) Outliers is a data point that is significantly close to other data points.
(D) The nature of our business problem determines how outliers are used.
4. The following data is used to apply a linear regression algorithm with least
squares regression line Y=a1X. Then, the approximate value of a1 is given by:(X-
Independent variable, Y-Dependent variable)
(A) 27.876 (B) 32.650 (C) 40.541 (D) 28.956
X 1 20 30 40
Y 1 400 800 1300
Explanation: Hint: Use the ordinary least square method.
5. The robotic arm will be able to paint every corner in the automotive parts while
minimizing the quantity of paint wasted in the process. Which learning technique is
used in this problem?

(A) Supervised Learning.
(B) Unsupervised Learning.
(C) Reinforcement Learning.
(D) Both (A) and (B).
6. Which one of the following statements is TRUE for a Decision Tree?
(A) Decision tree is only suitable for the classification problem statement.
(B) In a decision tree, the entropy of a node decreases as we go down a
decision tree.
(C) In a decision tree, entropy determines purity.
(D) Decision tree can only be used for only numeric valued and continuous
attributes.
7. How do you choose the right node while constructing a decision tree?
(A) An attribute having high entropy
(B) An attribute having high entropy and information gain
(C) An attribute having the lowest information gain.
(D) An attribute having the highest information gain.
8. What kind of distance metric(s) are suitable for categorical variables to find the
closest neighbors?
(A) Euclidean distance.
(B) Manhattan distance.
(C) Minkowski distance.
(D) Hamming distance.

9. In the Naive Bayes algorithm, suppose that prior for class w1 is greater than class
w2, would the decision boundary shift towards the region R1(region for deciding w1)
or towards region R2(region for deciding w2)?
(A) towards region R1.
(B) towards region R2.
(C) No shift in decision boundary.
(D) It depends on the exact value of priors.
10. Which of the following statements is FALSE about Ridge and Lasso Regression?
(A) These are types of regularization methods to solve the overfitting problem.
(B) Lasso Regression is a type of regularization method.
(C) Ridge regression shrinks the coefficient to a lower value.
(D) Ridge regression lowers some coefficients to a zero value.
11. Which of the following is FALSE about Correlation and Covariance?
(A) A zero correlation does not necessarily imply independence between variables.
(B) Correlation and covariance values are the same.
(C) The covariance and correlation are always the same sign.
(D) Correlation is the standardized version of Covariance.
12. In Regression modeling we develop a mathematical equation that
describes how, (Predictor-Independent variable, Response-Dependent variable)
(A) one predictor and one or more response variables are related.
(B) several predictors and several response variables response are related.
(C) one response and one or more predictors are related.
(D) All of these are correct.

13. True or False: In a naive Bayes algorithm, when an attribute value in the testing
record has no example in the training set, then the entire posterior probability will be
zero.
(A) True (B) False (C) Can’t determined (D) None of these.
14. Which of the following is NOT True about Ensemble Techniques?
(A) Bagging decreases the variance of the classifier.
(B) Boosting helps to decrease the bias of the classifier.
(C) Bagging combines the predictions from different models and then finally gives
the results.
(D) Bagging and Boosting are the only available ensemble techniques.
15. Which of the following statement is TRUE about the Bayes classifier?
(A) Bayes classifier works on the Bayes theorem of probability.
(B) Bayes classifier is an unsupervised learning algorithm.
(C) Bayes classifier is also known as maximum apriori classifier.
(D) It assumes the independence between the independent variables or features.
16. Which of the following SGD variants is based on both momentum and adaptive
learning?
(A) RMSprop.
(B) Adagrad.
(C) Adam.
(D) Nesterov.
17. Which of the following activation function output is zero centered?
(A) Hyperbolic Tangent.
(B) Sigmoid.

(C) Softmax.
(D) Rectified Linear unit(ReLU).
18. Which of the following is FALSE about Radial Basis Function Neural Network?
(A) It resembles Recurrent Neural Networks(RNNs) which have feedback loops.
(B) It uses radial basis function as activation function.
(C) While outputting, it considers the distance of a point with respect to the center.
(D) The output given by the Radial basis function is always an absolute value.
19. In which of the following situations, you should NOT prefer Keras over
TensorFlow?
(A) When you want to quickly build a prototype using neural networks.
(B) When you want to implement simple neural networks in your initial learning
phase.
(C) When you are doing critical and intensive research in any field.
(D) When you want to create simple tutorials for your students and friends.
20. Which of the following is FALSE about Deep Learning and Machine Learning
algorithms?
(A) Deep Learning algorithms work efficiently on a high amount of data.
(B) Feature Extraction needs to be done manually in both ML and DL
algorithms.
(C) Deep Learning algorithms are best suited for unstructured data.
(D) Deep Learning algorithms require high computational power.
21. Which of the following is FALSE for neural networks?
(A) Artificial neurons are similar in operation to biological neurons.
(B) Training time for a neural network depends on network size.

(C) Neural networks can be simulated on conventional computers.
(D) The basic unit of neural networks are neurons.
22. Which of the following logic function cannot be implemented by a perceptron
having 2 inputs?
(A) AND. (B) OR. (C) NOR. (D) XOR.
23. Inappropriate selection of learning rate value in gradient descent gives rise to:
(A) Local Minima.
(B) Oscillations.
(C) Slow convergence.
(D) All of the above.
Answer: Option-D
24. What will be the output of the following code?
import numpy as np
n_array = np.array([1, 0, 2, 0, 3, 0, 0, 5, 6, 7, 5, 0, 8])
res = np.where(n_array == 0)[0]
print(res.sum( ))
(A) 25 (B) 26 (C) 6 (D) None of these
import numpy as np
p = [[1, 0], [0, 1]]
q = [[1, 2], [3, 4]]
result1 = np.cross(p, q)
result2 = np.cross(q, p)
print((result1==result2).shape[0])
(A) 0 (B) 1 (C) 2 (D) Code is not executable.

import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(2))
print(s.size)
(A) 0 (B) 1 (C) 2 (D)Answer not fixed due to randomness.
import numpy as np
student_id = np.array([1023, 5202, 6230, 1671, 1682, 5241, 4532])
i = np.argsort(student_id)
print(i[5])
(A) 2 (B) 3 (C) 4 (D) 5
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(4))
print(s.ndim)
(A) 1 (B) 2 (C) 0 (D) 3
import numpy as np
my_array = np.arange(6).reshape(2,3)
result = np.trace(my_array)
print(result)
(A) 2 (B) 4 (C) 6 (D) 8

import numpy as np
from numpy import linalg
a = np.array([[1, 0], [1, 2]])
print(type(np.linalg.det(a)))
(A) INT (B) FLOAT (C) STR (D) BOOL.
Q1. Which of the following algorithm is not an example of an ensemble method?
A. Extra Tree Regressor
B. Random Forest
C. Gradient Boosting
D. Decision Tree
Q2. What is true about an ensembled classifier?
1. Classifiers that are more “sure” can vote with more conviction
2. Classifiers can be more “sure” about a particular part of the space
3. Most of the times, it performs better than a single classifier
A. 1 and 2
B. 1 and 3
C. 2 and 3
D. All of the above
Q3. Which of the following option is / are correct regarding benefits of ensemble
model?
1. Better performance
2. Generalized models
3. Better interpretability
A. 1 and 3
B. 2 and 3
C. 1 and 2
D. 1, 2 and 3
Q4) Which of the following can be true for selecting base learners for an ensemble?
1. Different learners can come from same algorithm with different hyper parameters
2. Different learners can come from different algorithms
3. Different learners can come from different training spaces
A. 1
B. 2

C. 1 and 3
D. 1, 2 and 3
Q5. True or False: Ensemble learning can only be applied to supervised learning
methods.
A. True
B. False
Q6. True or False: Ensembles will yield bad results when there is significant diversity
among the models.
Note: All individual models have meaningful and good predictions.
A. True
B. False
Q7. Which of the following is / are true about weak learners used in ensemble
model?
1. They have low variance and they don’t usually overfit
2. They have high bias, so they can not solve hard learning problems
3. They have high variance and they don’t usually overfit
A. 1 and 2
B. 1 and 3
C. 2 and 3
D. None of these
Q8.True or False: Ensemble of classifiers may or may not be more accurate than
any of its individual model.
A. True
B. False
Q9. If you use an ensemble of different base models, is it necessary to tune the
hyper parameters of all base models to improve the ensemble performance?
A. Yes
B. No
C. can’t say

Q10. Generally, an ensemble method works better, if the individual base models
have ____________?
Note: Suppose each individual base models have accuracy greater than 50%.
A. Less correlation among predictions
B. High correlation among predictions
C. Correlation does not have any impact on ensemble output
D. None of the above
Context – Question 11
In an election, N candidates are competing against each other and people are voting
for either of the candidates. Voters don’t communicate with each other while casting
their votes.
Q.11 Which of the following ensemble method works similar to above-discussed
election procedure?
Hint: Persons are like base models of ensemble method.
A. Bagging
B. Boosting
C. A Or B
D. None of these
Q12. Suppose you are given ‘n’ predictions on test data by ‘n’ different models (M1,
M2, …. Mn) respectively. Which of the following method(s) can be used to combine
the predictions of these models?
Note: We are working on a regression problem
1. Median
2. Product
3. Average
4. Weighted sum
5. Minimum and Maximum
6. Generalized mean rule
A. 1, 3 and 4
B. 1,3 and 6
C. 1,3, 4 and 6
D. All of above

Context: Question 13 -14
Suppose, you are working on a binary classification problem. And there are 3 models
each with 70% accuracy.
Q13. If you want to ensemble these models using majority voting method. What will
be the maximum accuracy you can get?
A. 100%
B. 78.38 %
C. 44%
D. 70
Refer below table for models M1, M2 and M3.
Actual output M1 M2 M3 Output
1 1 0 1 1
1 1 0 1 1
1 1 0 1 1
1 0 1 1 1
1 0 1 1 1
1 0 1 1 1
1 1 1 1 1
1 1 1 0 1
1 1 1 0 1
1 1 1 0 1
Q14. If you want to ensemble these models using majority voting. What will be the
minimum accuracy you can get?
A. Always greater than 70%
B. Always greater than and equal to 70%
C. It can be less than 70%
D. None of these
Refer below table for models M1, M2 and M3.
Actual Output M1 M2 M3 Output
1 1 0 0 0
1 1 1 1 1
1 1 0 0 0
1 0 1 0 0

1 0 1 1 1
1 0 0 1 0
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
Q15. How can we assign the weights to output of different models in an ensemble?
1. Use an algorithm to return the optimal weights
2. Choose the weights using cross validation
3. Give high weights to more accurate models
A. 1 and 2
B. 1 and 3
C. 2 and 3
D. All of above
Q16. Which of the following is true about averaging ensemble?
A. It can only be used in classification problem
B. It can only be used in regression problem
C. It can be used in both classification as well as regression
D. None of these
Context Question 17
Suppose you have given predictions on 5 test observations.
predictions = [0.2,0.5,0.33,0.8]
Which of the following will be the ranked average output for these predictions?
Hint: You are using min-max scaling
A. [ 0., 0.66666667, 0.33333333, 1. ]
B. [ 0.1210, 0.66666667, 0.95,0.33333333 ]
C. [ 0.1210, 0.66666667, 0.33333333, 0.95 ]
D. None of above

Q18.
In above snapshot, line A and B are the predictions for 2 models (M1, M2
respectively). Now, You want to apply an ensemble which aggregates the results of
these two models using weighted averaging. Which of the following line will be more
likely of the output of this ensemble if you give 0.7, 0.3 weights to models M1 and M2
respectively.
A) A
B) B
C) C
D) D
E) E
Q19. Which of the following is true about weighted majority votes?
1. We want to give higher weights to better performing models
2. Inferior models can overrule the best model if collective weighted votes for inferior
models is higher than best model
3. Voting is special case of weighted voting
A. 1 and 3
B. 2 and 3
C. 1 and 2
D. 1, 2 and 3
E. None of above
Context – Question 20-21
Suppose in a classification problem, you have following probabilities for three
models: M1, M2, M3 for five observations of test data set.
M1 M2 M3 Output
.70 .80 .75

.50 .64 .80
.30 .20 .35
.49 .51 .50
.60 .80 .60
Q20. Which of the following will be the predicted category for these observations if
you apply probability threshold greater than or equals to 0.5 for category “1” or less
than 0.5 for category “0”?
Note: You are applying the averaging method to ensemble given predictions by three
models.
A.
M1 M2 M3 Output
.70 .80 .75 1
.50 .64 .80 1
.30 .20 .35 0
.49 .51 .50 0
.60 .80 .60 1
B.
M1 M2 M3 Output
.70 .80 .75 1
.50 .64 .80 1
.30 .20 .35 0
.49 .51 .50 1
.60 .80 .60 1
C.
M1 M2 M3 Output
.70 .80 .75 1
.50 .64 .80 1
.30 .20 .35 1
.49 .51 .50 0
.60 .80 .60 0
D. None of these

Q21: Which of the following will be the predicted category for these observations if
you apply probability threshold greater than or equals to 0.5 for category “1” or less
than 0.5 for category “0”?
A.
M1 M2 M3 Output
.70 .80 .75 1
.50 .64 .80 1
.30 .20 .35 0
.49 .51 .50 0
.60 .80 .60 1
B.
M1 M2 M3 Output
.70 .80 .75 1
.50 .64 .80 1
.30 .20 .35 0
.49 .51 .50 1
.60 .80 .60 1
C.
M1 M2 M3 Output
.70 .80 .75 1
.50 .64 .80 1
.30 .20 .35 1
.49 .51 .50 0
.60 .80 .60 0
D. None of these
Context: Question 22-23
Suppose in binary classification problem, you have given the following predictions of
three models (M1, M2, M3) for five observations of test data set.
M1 M2 M3 Output
1 1 0
0 1 0
0 1 1
1 0 1
1 1 1

Q22: Which of the following will be the output ensemble model if we are using
majority voting method?
A.
M1 M2 M3 Output
1 1 0 0
0 1 0 1
0 1 1 0
1 0 1 0
1 1 1 1
B.
M1 M2 M3 Output
1 1 0 1
0 1 0 0
0 1 1 1
1 0 1 1
1 1 1 1
C.
M1 M2 M3 Output
1 1 0 1
0 1 0 0
0 1 1 1
1 0 1 0
1 1 1 1
D. None of these
Q23. When using the weighted voting method, which of the following will be the
output of an ensemble model?
Hint: Count the vote of M1,M2,M3 as 2.5 times, 6.5 times and 3.5 times respectively.
A.
M1 M2 M3 Output
1 1 0 0
0 1 0 1
0 1 1 0
1 0 1 0
1 1 1 1

B.
M1 M2 M3 Output
1 1 0 1
0 1 0 0
0 1 1 1
1 0 1 1
1 1 1 1
C.
M1 M2 M3 Ouput
1 1 0 1
0 1 0 1
0 1 1 1
1 0 1 0
1 1 1 1
D. None of these
Q24. Which of the following are correct statement(s) about stacking?
1. A machine learning model is trained on predictions of multiple machine
learning models
2. A Logistic regression will definitely work better in the second stage as
compared to other classification methods
3. First stage models are trained on full / partial feature space of training data
A.1 and 2
B. 2 and 3
C. 1 and 3
D. All of above
Q25. Which of the following are advantages of stacking?
 More robust model
 better prediction
 Lower time of execution
A. 1 and 2

B. 2 and 3
C. 1 and 3
D. All of the above
Q26: Which of the following figure represents stacking?
A.Ans
B.
C. None of these
Solution: (A)
Q27. Which of the following can be one of the steps in stacking?
1. Divide the training data into k folds
2. Train k models on each k-1 folds and get the out of fold predictions for remaining
one fold
3. Divide the test data set in “k” folds and get individual fold predictions by different
algorithms
A. 1 and 2
B. 2 and 3

C. 1 and 3
D. All of above
Q28. Which of the following is the difference between stacking and blending?
A. Stacking has less stable CV compared to Blending
B. In Blending, you create out of fold prediction
C. Stacking is simpler than Blending
D. None of these
Q29. Suppose you are using stacking with n different machine learning algorithms
with k folds on data.
Which of the following is true about one level (m base models + 1 stacker) stacking?
Note:
 Here, we are working on binary classification problem
 All base models are trained on all features
 You are using k folds for base models
A. You will have only k features after the first stage
B. You will have only m features after the first stage
C. You will have k+m features after the first stage
D. You will have k*n features after the first stage
Q30. Which of the following is true about bagging?
1. Bagging can be parallel
2. The aim of bagging is to reduce bias not variance
3. Bagging helps in reducing overfitting
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. All of these
Q31.True or False: In boosting, individual base learners can be parallel.
A. True
B. False

Q32. Below are the two ensemble models:
1. E1(M1, M2, M3) and
2. E2(M4, M5, M6)
Above, Mx is the individual base models.
Which of the following are more likely to choose if following conditions for E1 and E2
are given?
E1: Individual Models accuracies are high but models are of the same type or in
another term less diverse
E2: Individual Models accuracies are high but they are of different types in another
term high diverse in nature
A. E1
B. E2
C. Any of E1 and E2
D. None of these
Q33. Suppose, you have 2000 different models with their predictions and want to
ensemble predictions of best x models. Now, which of the following can be a
possible method to select the best x models for an ensemble?
A. Step wise forward selection
B. Step wise backward elimination
C. Both
D. None of above
Q34. Suppose, you want to apply a stepwise forward selection method for choosing
the best models for an ensemble model. Which of the following is the correct order of
the steps?
Note: You have more than 1000 models predictions
1. Add the models predictions (or in another term take the average) one by one in
the ensemble which improves the metrics in the validation set.
2. Start with empty ensemble
3. Return the ensemble from the nested set of ensembles that has maximum
performance on the validation set
A. 1-2-3
B. 1-3-4
C. 2-1-3
D. None of above
Q35. True or False: Dropout is computationally expensive technique w.r.t. bagging
A. True
B. False

Q36.Dropout in a neural network can be considered as an ensemble technique,
where multiple sub-networks are trained together by “dropping” out certain
connections between neurons.
Suppose, we have a single hidden layer neural network as shown below.
How many possible combinations of subnetworks can be used for classification?
How many possible combinations of subnetworks can be used for classification?
A. 1
B. 9
C. 12
D. 16
Q37. How is the model capacity affected with dropout rate (where model capacity
means the ability of a neural network to approximate complex functions)?
A. Model capacity increases in increase in dropout rate
B. Model capacity decreases in increase in dropout rate
C. Model capacity is not affected on increase in dropout rate
D. None of these

Q38. Which of the following parameters can be tuned for finding good ensemble
model in bagging based algorithms?
1. Max number of samples
2. Max features
3. Bootstrapping of samples
4. Bootstrapping of features
A. 1 and 3
B. 2 and 4
C. 1,2 and 3
D. 1,3 and 4
E. All of above
Q39. In machine learning, an algorithm (or learning algorithm) is said to be unstable
if a small change in training data cause the large change in the learned classifiers.
True or False: Bagging of unstable classifiers is a good idea.
A. True
B. False
Q40. Suppose there are 25 base classifiers. Each classifier has error rates of e =
0.35.
Suppose you are using averaging as ensemble technique. What will be the
probabilities that ensemble of above 25 classifiers will make a wrong prediction?
Note: All classifiers are independent of each other
A. 0.05
B. 0.06
C. 0.07
D. 0.09

Recommendation systems
Q1. Movie Recommendation systems are an example of:
1. Classification
2. Clustering
3. Reinforcement Learning
4. Regression
Options:
A. 2 Only
B. 1 and 2
C. 1 and 3
D. 2 and 3
Q2. Sentiment Analysis is an example of:
1. Regression
2. Classification
3. Clustering
4. Reinforcement Learning
Options:
A. 1 Only
B. 1 and 2
C. 1 and 3
D. 1, 2 and 3
E. 1, 2 and 4
F. 1, 2, 3 and 4
Q3. Can decision trees be used for performing clustering?
A. True
B. False

Q4. Which of the following is the most appropriate strategy for data cleaning before
performing clustering analysis, given less than desirable number of data points:
1. Capping and flouring of variables
2. Removal of outliers
Options:
A. 1 only
B. 2 only
C. 1 and 2
Q5. What is the minimum no. of variables/ features required to perform clustering?
A. 0
B. 1
C. 2
D. 3
Q6. For two runs of K-Mean clustering is it expected to get same clustering results?
A. Yes
B. No
Q7. Is it possible that Assignment of observations to clusters does not change
between successive iterations in K-Means
A. Yes
B. No
C. Can’t say
D. None of these
Q8. Which of the following can act as possible termination conditions in K-Means?
1. For a fixed number of iterations.
2. Assignment of observations to clusters does not change between iterations.
Except for cases with a bad local minimum.
3. Centroids do not change between successive iterations.
4. Terminate when RSS falls below a threshold.

Options:
A. 1, 3 and 4
B. 1, 2 and 3
C. 1, 2 and 4
D. All of the above
Q9. Which of the following clustering algorithms suffers from the problem of
convergence at local optima?
1. K- Means clustering algorithm
2. Agglomerative clustering algorithm
3. Expectation-Maximization clustering algorithm
4. Diverse clustering algorithm
Options:
A. 1 only
B. 2 and 3
C. 2 and 4
D. 1 and 3
Q10. Which of the following algorithm is most sensitive to outliers?
A. K-means clustering algorithm
B. K-medians clustering algorithm
C. K-modes clustering algorithm
D. K-medoids clustering algorithm
Q11. After performing K-Means Clustering analysis on a dataset, you observed the
following dendrogram. Which of the following conclusion can be drawn from the
dendrogram?

A. There were 28 data points in clustering analysis
B. The best no. of clusters for the analyzed data points is 4
C. The proximity function used is Average-link clustering
D. The above dendrogram interpretation is not possible for K-Means clustering
analysis
Q12. How can Clustering (Unsupervised Learning) be used to improve the accuracy
of Linear Regression model (Supervised Learning):
1. Creating different models for different cluster groups.
2. Creating an input feature for cluster ids as an ordinal variable.
3. Creating an input feature for cluster centroids as a continuous variable.
4. Creating an input feature for cluster size as a continuous variable.
Options:
A. 1 only
B. 1 and 2
C. 1 and 4
D. All of the above

Q13. What could be the possible reason(s) for producing two different dendrograms
using agglomerative clustering algorithm for the same dataset?
A. Proximity function used
B. of data points used
C. of variables used
D. B and c only
E. All of the above
Q14. In the figure below, if you draw a horizontal line on y-axis for y=2. What will be
the number of clusters formed?
A. 1
B. 2
C. 3
D. 4
Q15. What is the most appropriate no. of clusters for the data points represented by
the following dendrogram:

A. 2
B. 4
C. 6
D. 8
Q16. In which of the following cases will K-Means clustering fail to give good results?
1. Data points with outliers
2. Data points with different densities
3. Data points with round shapes
4. Data points with non-convex shapes
Options:
A. 1 and 2
B. 2 and 3
C. 2 and 4
D. 1, 2 and 4

Q17. Which of the following metrics, do we have for finding dissimilarity between two
clusters in hierarchical clustering?
1. Single-link
2. Complete-link
3. Average-link
Options:
A. 1 and 2
B. 1 and 3
C. 2 and 3
D. 1, 2 and 3
Q18. Which of the following are true?
1. Clustering analysis is negatively affected by multicollinearity of features
2. Clustering analysis is negatively affected by heteroscedasticity
Options:
A. 1 only
B. 2 only
C. 1 and 2
D. None of them
Q19. Given, six points with the following attributes:

Which of the following clustering representations and dendrogram depicts the use of
MIN or Single link proximity function in hierarchical clustering:
A. Ans
B.
C.
D.

Q20 Given, six points with the following attributes:
MAX or Complete link proximity function in hierarchical clustering:
A.
B.ANS
C.
D.

Q21 Given, six points with the following attributes:
Group average proximity function in hierarchical clustering:
A.
B.
C. ANS
D.

Q22. Given, six points with the following attributes:
Ward’s method proximity function in hierarchical clustering:
A.
B.
C.
D.

Q23. What should be the best choice of no. of clusters based on the following
results:
A. 1
B. 2
C. 3
D. 4
Q24. Which of the following is/are valid iterative strategy for treating missing values
before clustering analysis?
A. Imputation with mean
B. Nearest Neighbor assignment
C. Imputation with Expectation Maximization algorithm
D. All of the above
Q25. K-Mean algorithm has some limitations. One of the limitation it has is, it makes
hard assignments(A point either completely belongs to a cluster or not belongs at all)
of points to clusters.
Note: Soft assignment can be consider as the probability of being assigned to each
cluster: say K = 3 and for some point xn, p1 = 0.7, p2 = 0.2, p3 = 0.1)
Which of the following algorithm(s) allows soft assignments?
1. Gaussian mixture models
2. Fuzzy K-means
Options:
A. 1 only

B. 2 only
C. 1 and 2
D. None of these
Q26. Assume, you want to cluster 7 observations into 3 clusters using K-Means
clustering algorithm. After first iteration clusters, C1, C2, C3 has following
observations:
C1: {(2,2), (4,4), (6,6)}
C2: {(0,4), (4,0)}
C3: {(5,5), (9,9)}
What will be the cluster centroids if you want to proceed for second iteration?
A. C1: (4,4), C2: (2,2), C3: (7,7)
B. C1: (6,6), C2: (4,4), C3: (9,9)
C. C1: (2,2), C2: (0,0), C3: (5,5)
D. None of these
Q27. Assume, you want to cluster 7 observations into 3 clusters using K-Means
clustering algorithm. After first iteration clusters, C1, C2, C3 has following
observations:
C1: {(2,2), (4,4), (6,6)}
C2: {(0,4), (4,0)}
C3: {(5,5), (9,9)}
What will be the Manhattan distance for observation (9, 9) from cluster centroid C1.
In second iteration.
A. 10
B. 5*sqrt(2)
C. 13*sqrt(2)
D. None of these
Q28. If two variables V1 and V2, are used for clustering. Which of the following are
true for K means clustering with k =3?
1. If V1 and V2 has a correlation of 1, the cluster centroids will be in a straight
line
2. If V1 and V2 has a correlation of 0, the cluster centroids will be in straight line

Options:
A. 1 only
B. 2 only
C. 1 and 2
Q29. Feature scaling is an important step before applying K-Mean algorithm. What is
reason behind this?
A. In distance calculation it will give the same weights for all features
B. You always get the same clusters. If you use or don’t use feature scaling
C. In Manhattan distance it is an important step but in Euclidian it is not
D. None of these
Q30. Which of the following method is used for finding optimal of cluster in K-Mean
algorithm?
A. Elbow method
B. Manhattan method
C. Ecludian mehthod
D. All of the above
E. None of these
Q31. What is true about K-Mean Clustering?
1. K-means is extremely sensitive to cluster center initializations
2. Bad initialization can lead to Poor convergence speed
3. Bad initialization can lead to bad overall clustering
Options:
A. 1 and 3
B. 1 and 2
C. 2 and 3
D. 1, 2 and 3

Q32. Which of the following can be applied to get good results for K-means algorithm
corresponding to global minima?
1. Try to run algorithm for different centroid initialization
2. Adjust number of iterations
3. Find out the optimal number of clusters
Options:
A. 2 and 3
B. 1 and 3
C. 1 and 2
D. All of above
Q33. What should be the best choice for number of clusters based on the following
results:
A. 5
B. 6
C. 14
D. Greater than 14

Q34. What should be the best choice for number of clusters based on the following
results:
A. 2
B. 4
C. 6
D. 8
Q35. Which of the following sequences is correct for a K-Means algorithm using
Forgy method of initialization?
1. Specify the number of clusters
2. Assign cluster centroids randomly
3. Assign each data point to the nearest cluster centroid
4. Re-assign each point to nearest cluster centroids
5. Re-compute cluster centroids
Options:
A. 1, 2, 3, 5, 4
B. 1, 3, 2, 4, 5
C. 2, 1, 3, 4, 5
D. None of these
Q36. If you are using Multinomial mixture models with the expectation-maximization
algorithm for clustering a set of data points into two clusters, which of the
assumptions are important:
A. All the data points follow two Gaussian distribution
B. All the data points follow n Gaussian distribution (n >2)
C. All the data points follow two multinomial distribution
D. All the data points follow n multinomial distribution (n >2)

Q37. Which of the following is/are not true about Centroid based K-Means clustering
algorithm and Distribution based expectation-maximization clustering algorithm:
1. Both starts with random initializations
2. Both are iterative algorithms
3. Both have strong assumptions that the data points must fulfill
4. Both are sensitive to outliers
5. Expectation maximization algorithm is a special case of K-Means
6. Both requires prior knowledge of the no. of desired clusters
7. The results produced by both are non-reproducible.
Options:
A. 1 only
B. 5 only
C. 1 and 3
D. 6 and 7
Q38. Which of the following is/are not true about DBSCAN clustering algorithm:
1. For data points to be in a cluster, they must be in a distance threshold to a
core point
2. It has strong assumptions for the distribution of data points in dataspace
3. It has substantially high time complexity of order O(n3)
4. It does not require prior knowledge of the no. of desired clusters
5. It is robust to outliers
Options:
A. 1 only
B. 2 only
C. 4 only
D. 2 and 3
Q39. Which of the following are the high and low bounds for the existence of F-
Score?
A. [0,1]
B. (0,1)
C. [-1,1]

Q40. Following are the results observed for clustering 6000 data points into 3
clusters: A, B and C:
What is the F1-Score with respect to cluster B?
A. 3
B. 4
C. 5
D. 6
Deep Learning
1) The difference between deep learning and machine learning algorithms is that
there is no need of feature engineering in machine learning algorithms, whereas, it is
recommended to do feature engineering first and then apply deep learning.
A) TRUE
B) FALSE
2) Which of the following is a representation learning algorithm?
A) Neural network
B) Random Forest
C) k-Nearest neighbor
3) Which of the following option is correct for the below-mentioned techniques?
1. AdaGrad uses first order differentiation
2. L-BFGS uses second order differentiation
3. AdaGrad uses second order differentiation
4. L-BFGS uses first order differentiation

A) 1 and 2
B) 3 and 4
C) 1 and 4
D) 2 and 3
4) Increase in size of a convolutional kernel would necessarily increase the
performance of a convolutional neural network.
A) TRUE
B) FALSE
Question Context
Suppose we have a deep neural network model which was trained on a vehicle
detection problem. The dataset consisted of images on cars and trucks and the aim
was to detect name of the vehicle (the number of classes of vehicles are 10).
Now you want to use this model on different dataset which has images of only Ford
Mustangs (aka car) and the task is to locate the car in an image.
5) Which of the following categories would be suitable for this type of problem?
A) Fine tune only the last couple of layers and change the last layer
(classification layer) to regression layer
B) Freeze all the layers except the last, re-train the last layer
C) Re-train the model for the new dataset
D) None of these
6) Suppose you have 5 convolutional kernel of size 7 x 7 with zero padding and
stride 1 in the first layer of a convolutional neural network. You pass an input of
dimension 224 x 224 x 3 through this layer. What are the dimensions of the data
which the next layer will receive?
A) 217 x 217 x 3
B) 217 x 217 x 8
C) 218 x 218 x 5
D) 220 x 220 x 7

7) Suppose we have a neural network with ReLU activation function. Let’s say, we
replace ReLu activations by linear activations.
Would this new neural network be able to approximate an XNOR function?
Note: The neural network was able to approximate XNOR function with activation
function ReLu.
A) Yes
B) No
8) Suppose we have a 5-layer neural network which takes 3 hours to train on a GPU
with 4GB VRAM. At test time, it takes 2 seconds for single data point.
Now we change the architecture such that we add dropout after 2nd and 4th layer
with rates 0.2 and 0.3 respectively.
What would be the testing time for this new architecture?
A) Less than 2 secs
B) Exactly 2 secs
C) Greater than 2 secs
D) Can’t Say
9) Which of the following options can be used to reduce overfitting in deep learning
models?
1. Add more data
2. Use data augmentation
3. Use architecture that generalizes well
4. Add regularization
5. Reduce architectural complexity
A) 1, 2, 3
B) 1, 4, 5
C) 1, 3, 4, 5
D) All of these

10) Perplexity is a commonly used evaluation technique when applying deep
learning for NLP tasks. Which of the following statement is correct?
A) Higher the perplexity the better
B) Lower the perplexity the better
11) Suppose an input to Max-Pooling layer is given above. The pooling size of
neurons in the layer is (3, 3).
What would be the output of this Pooling layer?
A) 3
B) 5
C) 5.5
D) 7
12) Suppose there is a neural network with the below configuration.
If we remove the ReLU layers, we can still use this neural network to model non-
linear functions.
A) TRUE
B) FALSE
1

3) Deep learning can be applied to which of the following NLP tasks?
A) Machine translation
B) Sentiment analysis
C) Question Answering system
D) All of the above
14) Scenario 1: You are given data of the map of Arcadia city, with aerial
photographs of the city and its outskirts. The task is to segment the areas into
industrial land, farmland and natural landmarks like river, mountains, etc.
Scenario 2: You are given data of the map of Arcadia city, with detailed roads and
distances between landmarks. This is represented as a graph structure. The task is
to find out the nearest distance between two landmarks.
Deep learning can be applied to Scenario 1 but not Scenario 2.
A) TRUE
B) FALSE
15) Which of the following is a data augmentation technique used in image
recognition tasks?
1. Horizontal flipping
2. Random cropping
3. Random scaling
4. Color jittering
5. Random translation
6. Random shearing
A) 1, 2, 4
B) 2, 3, 4, 5, 6
C) 1, 3, 5, 6
D) All of these
16) Given an n-character word, we want to predict which character would be the
n+1th character in the sequence. For example, our input is “predictio” (which is a 9
character word) and we have to predict what would be the 10th character.
Which neural network architecture would be suitable to complete this task?
A) Fully-Connected Neural Network

B) Convolutional Neural Network
C) Recurrent Neural Network (best for sequential data)
D) Restricted Boltzmann Machine
17) What is generally the sequence followed when building a neural network
architecture for semantic segmentation for image?
A) Convolutional network on input and deconvolutional network on output
B) Deconvolutional network on input and convolutional network on output
18) Sigmoid was the most commonly used activation function in neural network, until
an issue was identified. The issue is that when the gradients are too large in positive
or negative direction, the resulting gradients coming out of the activation function get
squashed. This is called saturation of the neuron.
That is why ReLU function was proposed, which kept the gradients same as before
in the positive direction.
A ReLU unit in neural network never gets saturated.
A) TRUE
B) FALSE
19) What is the relationship between dropout rate and regularization?
Note: we have defined dropout rate as the probability of keeping a neuron active?
A) Higher the dropout rate, higher is the regularization
B) Higher the dropout rate, lower is the regularization

20) What is the technical difference between vanilla backpropagation algorithm and
backpropagation through time (BPTT) algorithm?
A) Unlike backprop, in BPTT we sum up gradients for corresponding weight
for each time step
B) Unlike backprop, in BPTT we subtract gradients for corresponding weight for each
time step
21) Exploding gradient problem is an issue in training deep networks where the
gradient getS so large that the loss goes to an infinitely high value and then
explodes.
What is the probable approach when dealing with “Exploding Gradient” problem in
RNNs?
A) Use modified architectures like LSTM and GRUs
B) Gradient clipping
C) Dropout
D) None of these
22) There are many types of gradient descent algorithms. Two of the most notable
ones are l-BFGS and SGD. l-BFGS is a second order gradient descent technique
whereas SGD is a first order gradient descent technique.
In which of the following scenarios would you prefer l-BFGS over SGD?
1. Data is sparse
2. Number of parameters of neural network are small
A) Both 1 and 2
B) Only 1
C) Only 2
D) None of these
23) Which of the following is not a direct prediction technique for NLP tasks?
A) Recurrent Neural Network
B) Skip-gram model
C) PCA
D) Convolutional neural network

24) Which of the following would be the best for a non-continuous objective during
optimization in deep neural net?
A) L-BFGS
B) SGD
C) AdaGrad
D) Subgradient method
25) Which of the following is correct?
1. Dropout randomly masks the input weights to a neuron
2. Dropconnect randomly masks both input and output weights to a neuron
A) 1 is True and 2 is False
B) 1 is False and 2 is True
C) Both 1 and 2 are True
D) Both 1 and 2 are False
26) While training a neural network for image recognition task, we plot the graph of
training error and validation error for debugging.
What is the best place in the graph for early stopping?
A) A
B) B
C) C
D) D

27) Research is going on to solve image inpainting problems using computer vision
with deep learning. For this, which loss function would be appropriate for computing
the pixel-wise region to be inpainted?
Image inpainting is one of those problems which requires human expertise for
solving it. It is particularly useful to repair damaged photos or videos. Below is an
example of input and output of an image inpainting example.
A) Euclidean loss
B) Negative-log Likelihood loss
C) Any of the above
28) Backpropagation works by first calculating the gradient of ___ and then
propagating it backwards.
A) Sum of squared error with respect to inputs
B) Sum of squared error with respect to weights
C) Sum of squared error with respect to outputs
29) Mini-Batch sizes when defining a neural network are preferred to be multiple of
2’s such as 256 or 512. What is the reason behind it?
A) Gradient descent optimizes best when you use an even number
B) Parallelization of neural network is best when the memory is used optimally
C) Losses are erratic when you don’t use an even number
D) None of these

30) Xavier initialization is most commonly used to initialize the weights of a neural
network. Below is given the formula for initialization.
1. If weights at the start are small, then signals reaching the end will be too tiny.
2. If weights at the start are too large, signals reaching the end will be too large.
3. Weights from Xavier’s init are drawn from the Gaussian distribution.
Xavier’s init helps reduce vanishing gradient problem.
Xavier’s init is used to help the input signals reach deep into the network. Which of
the following statements are true?
A) 1, 2, 4
B) 2, 3, 4
C) 1, 3, 4
D) 1, 2, 3
31) As the length of sentence increases, it becomes harder for a neural translation
machine to perform as sentence meaning is represented by a fixed dimensional
vector. To solve this, which of the following could we do?
A) Use recursive units instead of recurrent
B)Use attention mechanism
C) Use character level translation
D) None of these
32) A recurrent neural network can be unfolded into a full-connected neural network
with infinite length.
A) TRUE
B) FALSE
33) Which of the following is a bottleneck for deep learning algorithm?
A) Data related to the problem
B) CPU to GPU communication
C) GPU memory
D) All of the above

34) Dropout is a regularization technique used especially in the context of deep
learning. It works as following, in one iteration we first randomly choose neurons in
the layers and masks them. Then this network is trained and optimized in the same
iteration. In the next iteration, another set of randomly chosen neurons are selected
and masked and the training continues.
Dropout technique is not an advantageous technique for which of the following
layers?
A) Affine layer
B) Convolutional layer
C) RNN layer
D) None of these
35) Suppose your task is to predict the next few notes of song when you are given
the preceding segment of the song.
For example: The input given to you is an image depicting the music symbols as
given below,
Your required output is an image of succeeding symbols.
Which architecture of neural network would be better suited to solve the problem?
A) End-to-End fully connected neural network
B) Convolutional neural network followed by recurrent units
C) Neural Turing Machine
D) None of these

36) When deriving a memory cell in memory networks, we choose to read values as
vector values instead of scalars. Which type of addressing would this entail?
A) Content-based addressing
B) Location-based addressing
37) It is generally recommended to replace pooling layers in generator part of
convolutional generative adversarial nets with ________ ?
A) Affine layer
B) Strided convolutional layer
C) Fractional strided convolutional layer
D) ReLU layer
Question Context 38-40
GRU is a special type of Recurrent Neural Networks proposed to overcome the
difficulties of classical RNNs. This is the paper in which they were proposed: “On the
Properties of Neural Machine Translation: Encoder–Decoder Approaches. Read the
full paper here.
38) Which of the following statements is true with respect to GRU?
1. Units with short-term dependencies have reset gate very active.
2. Units with long-term dependencies have update gate very active
A) Only 1
B) Only 2
C) None of them
D) Both 1 and 2
39) If calculation of reset gate in GRU unit is close to 0, which of the following would
occur?
A) Previous hidden state would be ignored
B) Previous hidden state would be not be ignored
40) If calculation of update gate in GRU unit is close to 1, which of the following
would occur?
A) Forgets the information for future time steps
B) Copies the information through many time steps

Natural Language Processing
Q1 Which of the following techniques can be used for the purpose of keyword
normalization, the process of converting a keyword into its base form?
1. Lemmatization
2. Levenshtein
3. Stemming
4. Soundex
A) 1 and 2
B) 2 and 4
C) 1 and 3
D) 1, 2 and 3
2) N-grams are defined as the combination of N keywords together. How many bi-
grams can be generated from a given sentence:
“Analytics Vidhya is a great source to learn data science”
A) 7
B) 8
C) 9
D) 10
E) 11
Bigrams: Analytics Vidhya, Vidhya is, is a, a great, great source, source to, To
learn, learn data, data science
3) How many trigrams phrases can be generated from the following sentence, after
performing following text cleaning steps:
 Stopword Removal
 Replacing punctuations by a single space
“#Analytics-vidhya is a great source to learn @data_science.”
A) 3
B) 4
C) 5
D) 6
E) 7
After performing stopword removal and punctuation replacement the text
becomes: “Analytics vidhya great source learn data science”
Trigrams – Analytics vidhya great, vidhya great source, great source learn,
source learn data, learn data science

4) Which of the following regular expression can be used to identify date(s) present
in the text object:
“The next meetup on data science will be held on 2017-09-21, previously it
happened on 31/03, 2016”
A) d{4}-d{2}-d{2}
B) (19|20)d{2}-(0[1-9]|1[0-2])-[0-2][1-9] C) (19|20)d{2}-(0[1-9]|1[0-2])-([0-2][1-9]|3[0-
1])
You have collected a data of about 10,000 rows of tweet text and no other
information. You want to create a tweet classification model that categorizes each of
the tweets in three buckets – positive, negative and neutral.
5) Which of the following models can perform tweet classification with regards to
context mentioned above?
A) Naive Bayes
B) SVM
6) You have created a document term matrix of the data, treating every tweet as one
document. Which of the following is correct, in regards to document term matrix?
1. Removal of stopwords from the data will affect the dimensionality of data
2. Normalization of words in the data will reduce the dimensionality of data
3. Converting all the words in lowercase will not affect the dimensionality of the
data
A) Only 1
B) Only 2
C) Only 3
D) 1 and 2
E) 2 and 3
F) 1, 2 and 3
7) Which of the following features can be used for accuracy improvement of a
classification model?
A) Frequency count of terms
B) Vector Notation of sentence
C) Part of Speech Tag
D) Dependency Grammar
E) All of these

8) What percentage of the total statements are correct with regards to Topic
Modeling?
1. It is a supervised learning technique
2. LDA (Linear Discriminant Analysis) can be used to perform topic modeling
3. Selection of number of topics in a model does not depend on the size of data
4. Number of topic terms are directly proportional to size of the data
A) 0
B) 25
C) 50
D) 75
E) 100
9) In Latent Dirichlet Allocation model for text classification purposes, what does
alpha and beta hyperparameter represent-
A) Alpha: number of topics within documents, beta: number of terms within topics
False
B) Alpha: density of terms generated within topics, beta: density of topics generated
within terms False
C) Alpha: number of topics within documents, beta: number of terms within topics
False
D) Alpha: density of topics generated within documents, beta: density of terms
generated within topics True
10) Solve the equation according to the sentence “I am planning to visit New Delhi to
attend Analytics Vidhya Delhi Hackathon”.
A = (# of words with Noun as the part of speech tag)
B = (# of words with Verb as the part of speech tag)
C = (# of words with frequency count greater than one)
What are the correct values of A, B, and C?
A) 5, 5, 2
B) 5, 5, 0
C) 7, 5, 1
D) 7, 4, 2
Nouns: I, New, Delhi, Analytics, Vidhya, Delhi, Hackathon (7)
Verbs: am, planning, visit, attend (4)
Words with frequency counts > 1: to, Delhi (2)

11) In a corpus of N documents, one document is randomly picked. The document
contains a total of T terms and the term “data” appears K times.
What is the correct value for the product of TF (term frequency) and IDF (inverse-
document-frequency), if the term “data” appears in approximately one-third of the
total documents?
A) KT * Log(3)
B) K * Log(3) / T
C) T * Log(3) / K
D) Log(3) / KT
Question Context 12 to 14:
Refer the following document term matrix
12) Which of the following documents contains the same number of terms and the
number of terms in the one of the document is not equal to least number of terms in
any document in the entire corpus.
A) d1 and d4
B) d6 and d7
C) d2 and d4
D) d5 and d6
Both of the documents d2 and d4 contains 4 terms and does not contain the
least number of terms which is 3.
13) Which are the most common and the rarest term of the corpus?
A) t4, t6
B) t3, t5
C) t5, t1
D) t5, t6
T5 is most common terms across 5 out of 7 documents, T6 is rare term only
appears in d3 and d4
14) What is the term frequency of a term which is used a maximum number of times
in that document?
A) t6 – 2/5
B) t3 – 3/6

C) t4 – 2/6
D) t1 – 2/6
t3 is used max times in entire corpus = 3, tf for t3 is 3/6
15) Which of the following technique is not a part of flexible text matching?
A) Soundex
B) Metaphone
C) Edit Distance
D) Keyword Hashing
16) True or False: Word2Vec model is a machine learning model used to create
vector notations of text objects. Word2vec contains multiple deep neural networks
A) TRUE
B) FALSE
17) Which of the following statement is(are) true for Word2Vec model?
A) The architecture of word2vec consists of only two layers – continuous bag of
words and skip-gram model
B) Continuous bag of word (CBOW) is a Recurrent Neural Network model
C) Both CBOW and Skip-gram are shallow neural network models
D) All of the above
18) With respect to this context-free dependency graphs, how many sub-trees exists
in the sentence?
A) 3
B) 4
C) 5
D) 6
19) What is the right order for a text classification model components
1. Text cleaning
2. Text annotation
3. Gradient descent
4. Model tuning
5. Text to predictors

A) 12345
B) 13425
C) 12534
D) 13452
20) Polysemy is defined as the coexistence of multiple meanings for a word or
phrase in a text object. Which of the following models is likely the best choice to
correct this problem?
A) Random Forest Classifier
B) Convolutional Neural Networks
C) Gradient Boosting
D) All of these
21) Which of the following models can be used for the purpose of document
similarity?
A) Training a word 2 vector model on the corpus that learns context present in the
document
B) Training a bag of words model that learns occurrence of words in the document
C) Creating a document-term matrix and using cosine similarity for each document
D) All of the above
22) What are the possible features of a text corpus
1. Count of word in a document
2. Boolean feature – presence of word in a document
3. Vector notation of word
4. Part of Speech Tag
5. Basic Dependency Grammar
6. Entire document as a feature
A) 1
B) 12
C) 123
D) 1234
E) 12345
23) While creating a machine learning model on text data, you created a document
term matrix of the input data of 100K documents. Which of the following remedies
can be used to reduce the dimensions of data –
1. Latent Dirichlet Allocation
2. Latent Semantic Indexing

3. Keyword Normalization
A) only 1
B) 2, 3
C) 1, 3
D) 1, 2, 3
24) Google Search’s feature – “Did you mean”, is a mixture of different techniques.
Which of the following techniques are likely to be ingredients?
1. Collaborative Filtering model to detect similar user behaviors (queries)
2. Model that checks for Levenshtein distance among the dictionary terms
3. Translation of sentences into multiple languages
A) 1
B) 2
C) 1, 2
D) 1, 2, 3
25) While working with text data obtained from news sentences, which are structured
in nature, which of the grammar-based text parsing techniques can be used for noun
phrase detection, verb phrase detection, subject detection and object detection.
A) Part of speech tagging
B) Dependency Parsing and Constituency Parsing
C) Skip Gram and N-Gram extraction
D) Continuous Bag of Words
26) Social Media platforms are the most intuitive form of text data. You are given a
corpus of complete social media data of tweets. How can you create a model that
suggests the hashtags?
A) Perform Topic Models to obtain most significant words of the corpus
B) Train a Bag of Ngrams model to capture top n-grams – words and their
combinations
C) Train a word2vector model to learn repeating contexts in the sentences
D) All of these
27) While working with context extraction from a text data, you encountered two
different sentences: The tank is full of soldiers. The tank is full of nitrogen. Which of
the following measures can be used to remove the problem of word sense
disambiguation in the sentences?
A) Compare the dictionary definition of an ambiguous word with the terms
contained in its neighborhood
B) Co-reference resolution in which one resolute the meaning of ambiguous word

with the proper noun present in the previous sentence
C) Use dependency parsing of sentence to understand the meanings
28) Collaborative Filtering and Content Based Models are the two popular
recommendation engines, what role does NLP play in building such algorithms.
A) Feature Extraction from text
B) Measuring Feature Similarity
C) Engineering Features for vector space learning model
D) All of these
29) Retrieval based models and Generative models are the two popular techniques
used for building chatbots. Which of the following is an example of retrieval model
and generative model respectively.
A) Dictionary based learning and Word 2 vector model
B) Rule-based learning and Sequence to Sequence model
C) Word 2 vector and Sentence to Vector model
D) Recurrent neural network and convolutional neural network
30) What is the major difference between CRF (Conditional Random Field) and
HMM (Hidden Markov Model)?
A) CRF is Generative whereas HMM is Discriminative model
B) CRF is Discriminative whereas HMM is Generative model
C) Both CRF and HMM are Generative model
D) Both CRF and HMM are Discriminative model
1. The number of False Positives(FP) and False Negatives(FN) for both systems
respectively are:
(a) System A: FP = 20,FN = 25 ; System B: FP = 15, FN = 30
(b) System A: FP = 15,FN = 30 ; System B: FP = 20, FN = 25
(c) System A: FP = 15,FN = 25 ; System B: FP = 20, FN = 30
(d) System A: FP = 30,FN = 20 ; System B: FP = 15, FN = 25
2. The Sensitivity and Specificity for System-A respectively are:
(a) Sensitivity = 0.75, Specificity = 0.80
(b) Sensitivity = 0.70, Specificity = 0.85
(c) Sensitivity = 0.75, Specificity = 0.85
(d) Sensitivity = 0.70, Specificity = 0.80

3. Which system should we use to rule out the presence of COVID-19?
(a) System-A
(b) System-B
(c) Anyone can be preferred
(d) Can’t be determined
4. If N is the number of rows/instances in the training dataset, then what is the time
complexity of the K- nearest neighbors algorithm run in Big-O notation?
(a) O(1)
(b) O( N )
(c) O( log N )
(d) O( N2 )
5. A company manager wants to predict the time before a break-down of its
production machines. As a Machine Learning student, you are asked to solve the
problem. How will you formulate it?
(a) as a classification problem statement
(b) as a regression problem statement
(c) as a clustering problem statement
(d) as an association rule-based problem statement
6. Which of the following statements are correct about the Regression line?
(a) The Regression line always goes through the mean of the data.
(b) The sum of the deviation of the values from their regression line is always zero.
(c) The sum of the squared deviation of the values from their regression line is
always minimum.
(d) If regression lines coincide with each other, then there is no correlation.
Answer: [ a, b, c ]
7. Which of the following options are incorrect about the Mahalanobis distance?
(a) It transforms the columns into correlated variables.
(b) It changes the values of the features so that the standard deviation becomes
zero.

(c) It calculates the mean and variance with the help of new columns.
(d) It includes only variances in its formula while calculating the distance
Answer: [a, b, c, d ]
Explanation: Mahalanobis distance takes Covariance into account while
calculating distances.
8. Choose the correct options for Random Variables X1 and X2:
(a) If Cov(X1, X2)=0, then the random variables X1 and X2 are independent
(b) If random variables X1 and X2 are independent, then Cov(X1, X2)=0
(c) if Cov(X1, X2)=0 and X1 and X2 are normally distributed, then X1 and X2 are
independent.
(d) If Cov(X1, X2)=0, then Corr(X1, X2)=0
Answer: [ b, c, d ]
Explanation: Independence implies zero covariance but zero covariance not
necessarily implies Independence.
9. Which of the following statements are TRUE?
(a) Supervised learning does not require target attributes while unsupervised
learning requires it.
(b) In a supermarket, categorization of the items to be placed in aisles and on the
shelves can be an application of unsupervised learning.
(c) Sentiment analysis can be posed as a classification task, not as a clustering task.
(d) Decision trees can also be used to do clustering tasks.
Answer: [ b, d ]
10. The algorithm which can only be used when the training data are linearly
separable is:
(a) Linear hard-margin SVM
(b) Linear Logistic Regression
(c) Linear soft-margin SVM
(d) The centroid method

11. Which of the following statements are correct about the Backpropagation
Algorithm?
(a) It is also known as the Generalized delta rule
(b) In Backpropagation, error in output is propagated backward only to determine
weight updates
(c) Backpropagation learning is based on gradient descent along the surface of the
defined loss function.
(d)It is an algorithm for unsupervised learning of artificial neural networks
Answer: [ a, b, c ]
12.How many of the following statements are incorrect about the K-Means Clustering
algorithm?
(a) In presence of possible outliers in the data, one should not go for ‘complete link’
distance measures during the clustering tasks
(b) Two different runs of k-means clustering algorithms always result in the same
clusters
(c) It is always better to assign 10 to 20 iterations as a stopping criterion for k-means
clustering
(d) In k-means clustering, the number of centroids change during the algorithm run
(e) It tries to maximize the within class-variance for a given number of clusters.
(f) It converges to the global optimum if and only if the initial means(initialization) is
chosen as some of the samples themselves.
(g) It requires the dimension of the feature space to be no bigger than the number of
samples.
Answer: [ {b, c, d, e, f, g} – 6 ]
13. Which of the following statements are correct about the characteristics of
Hierarchical clustering?
(a) It is a Merging approach
(b) Measuring distance between two clusters
(c) Divisive hierarchical clustering works in a bottom-up approach
(d) It is a semi-unsupervised clustering algorithm
Answer: [ a, b ]

14. Which of the following statements are correct about Bayesian Classification?
(a) Decision boundary in Bayesian classification depends on evidence
(b) Decision boundary in Bayesian classification depends on priors
(c) Bayes classification is a supervised machine learning algorithm
Answer: [ b, c ]
15. How many of the following statements are incorrect about neural networks?
(a) An activation function must be monotonic in neural networks
(b) The logistic function is a monotonically increasing function
(c) A non-differentiable function can not be used as an activation function
(d) They can only be trained with stochastic gradient descent
(e) Optimize a convex objective function.
(f) Can use a mix of different activation functions
Answer: [ {a, c, d, e} – 4 ]
Explanation: Neural networks can use a mix of different activation functions
like sigmoid, tanh, and RELU functions.
16. The capacity of a neural network model i.e. the ability of the network to model a
complex function _____________ with the increase in dropout rate.
(a) Increases
(b) Decreases
(c) Remain same
(d) First decreases and then increases
Answer: [ b ]
17. How many of the following options are TRUE about Support Vector Machines
(SVMs)?
(a) Support vectors only have non-zero Lagrangian multipliers in the formulation of
SVMs.
(b) SVMs linear discriminant function focuses on a dot product between the test point
and the support vectors.
(c) In soft margin SVM, we give freedom to model for some misclassifications.
(d) Support vectors are the data points that are farthest from the decision boundary.

(e) The only training points necessary to compute f(x) in an SVM are support
vectors.
Answer: [ {a, b, c, d, e} – 5 ]
18. The linear discriminant function(classifier) with the maximum margin in SVMs is
the best since it is robust to outliers and has a strong generalization ability.
Answer: [ True ]
19. For the given Dendrogram, if you draw a horizontal line on the y-axis for y=0.50.
What will be the number of clusters formed?
(a) 1 (b) 3 (c) 4 (d) 7
Answer: [ c ]
20. How do you handle missing values or corrupted data in a dataset for categorical
variables?
(a) Drop missing rows or columns
(b) Replace missing value with the most frequent value
(c) Develop a model to predict those missing values
(d) All of the above

Q1. A neural network model is said to be inspired from the human brain.
The neural network consists of many neurons, each neuron takes an input,
processes it and gives an output. Here’s a diagrammatic representation of a real
neuron.
Which of the following statement(s) correctly represents a real neuron?
A. A neuron has a single input and a single output only
B. A neuron has multiple inputs but a single output only
C. A neuron has a single input but multiple outputs
D. A neuron has multiple inputs and multiple outputs
E. All of the above statements are valid
Q2. Below is a mathematical representation of a neuron.
The different components of the neuron are denoted as:
 x1, x2,…, xN: These are inputs to the neuron. These can either be the actual
observations from input layer or an intermediate value from one of the hidden
layers.
 w1, w2,…,wN: The Weight of each input.

 bi: Is termed as Bias units. These are constant values added to the input of
the activation function corresponding to each weight. It works similar to an
intercept term.
 a: Is termed as the activation of the neuron which can be represented as
 and y: is the output of the neuron
Considering the above notations, will a line equation (y = mx + c) fall into the
category of a neuron?
A. Yes
B. No
Q3. Let us assume we implement an AND function to a single neuron. Below is a
tabular representation of an AND function:
X1 X2 X1 AND X2
0 0 0
0 1 0
1 0 0
1 1 1
The activation function of our neuron is denoted as:

What would be the weights and bias?
(Hint: For which values of w1, w2 and b does our neuron implement an AND
function?)
A. Bias = -1.5, w1 = 1, w2 = 1
B. Bias = 1.5, w1 = 2, w2 = 2
C. Bias = 1, w1 = 1.5, w2 = 1.5
D. None of these
Q4. A network is created when we multiple neurons stack together. Let us take an
example of a neural network simulating an XNOR function.
You can see that the last neuron takes input from two neurons before it. The
activation function for all the neurons is given by:
Suppose X1 is 0 and X2 is 1, what will be the output for the above neural network?
A. 0
B. 1
Q5. In a neural network, knowing the weight and bias of each neuron is the most
important step. If you can somehow get the correct value of weight and bias for each
neuron, you can approximate any function. What would be the best way to approach
this?
A. Assign random values and pray to God they are correct

B. Search every possible combination of weights and biases till you get the best
value
C. Iteratively check that after assigning a value how far you are from the best
values, and slightly change the assigned values values to make them better
D. None of these
Q6. What are the steps for using a gradient descent algorithm?
1. Calculate error between the actual value and the predicted value
2. Reiterate until you find the best weights of network
3. Pass an input through the network and get values from output layer
4. Initialize random weight and bias
5. Go to each neurons which contributes to the error and change its respective
values to reduce the error
A. 1, 2, 3, 4, 5
B. 5, 4, 3, 2, 1
C. 3, 2, 1, 5, 4
D. 4, 3, 1, 5, 2
Q7. Suppose you have inputs as x, y, and z with values -2, 5, and -4 respectively.
You have a neuron ‘q’ and neuron ‘f’ with functions:
q = x + y
f = q * z
Graphical representation of the functions is as follows:
What is the gradient of F with respect to x, y, and z?
(HINT: To calculate gradient, you must find (df/dx), (df/dy) and (df/dz))
A. (-3,4,4)
B. (4,4,3)

C. (-4,-4,3)
D. (3,-4,-4)
Q8. Now let’s revise the previous slides. We have learned that:
 A neural network is a (crude) mathematical representation of a brain, which
consists of smaller components called neurons.
 Each neuron has an input, a processing function, and an output.
 These neurons are stacked together to form a network, which can be used to
approximate any function.
 To get the best possible neural network, we can use techniques like gradient
descent to update our neural network model.
Given above is a description of a neural network. When does a neural network
model become a deep learning model?
A. When you add more hidden layers and increase depth of neural network
B. When there is higher dimensionality of data
C. When the problem is an image recognition problem
D. None of these
Q9. A neural network can be considered as multiple simple equations stacked
together. Suppose we want to replicate the function for the below mentioned
decision boundary.
Using two simple inputs h1 and h2

What will be the final equation?
A. (h1 AND NOT h2) OR (NOT h1 AND h2)
B. (h1 OR NOT h2) AND (NOT h1 OR h2)
C. (h1 AND h2) OR (h1 OR h2)
D. None of these
Q10. “Convolutional Neural Networks can perform various types of transformation
(rotations or scaling) in an input”. Is the statement correct True or False?
A. True
B. False
Solution: (B)
Q11. Which of the following techniques perform similar operations as dropout in a
neural network?
A. Bagging
B. Boosting
C. Stacking
D. None of these
Q 12. Which of the following gives non-linearity to a neural network?
A. Stochastic Gradient Descent
B. Rectified Linear Unit
C. Convolution function
Q13. In training a neural network, you notice that the loss does not decrease in the
few starting epochs.

The reasons for this could be:
1. The learning is rate is low
2. Regularization parameter is high
3. Stuck at local minima
What according to you are the probable reasons?
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. Any of these
Q14. Which of the following is true about model capacity (where model capacity
means the ability of neural network to approximate complex functions) ?
A. As number of hidden layers increase, model capacity increases
B. As dropout ratio increases, model capacity increases
C. As learning rate increases, model capacity increases
D. None of these
Q15. If you increase the number of hidden layers in a Multi Layer Perceptron, the
classification error of test data always decreases. True or False?
A. True
B. False
Q16. You are building a neural network where it gets input from the previous layer as
well as from itself.
Which of the following architecture has feedback connections?
A. Recurrent Neural network

B. Convolutional Neural Network
C. Restricted Boltzmann Machine
D. None of these
Q17. What is the sequence of the following tasks in a perceptron?
1. Initialize weights of perceptron randomly
2. Go to the next batch of dataset
3. If the prediction does not match the output, change the weights
4. For a sample input, compute an output
A. 1, 2, 3, 4
B. 4, 3, 2, 1
C. 3, 1, 2, 4
D. 1, 4, 3, 2
Q18. Suppose that you have to minimize the cost function by changing the
parameters. Which of the following technique could be used for this?
A. Exhaustive Search
B. Random Search
C. Bayesian Optimization
D. Any of these
Q19. First Order Gradient descent would not work correctly (i.e. may get stuck) in
which of the following graphs?
A.

B. Ans
C.
Q20. The below graph shows the accuracy of a trained 3-layer convolutional neural
network vs the number of parameters (i.e. number of feature kernels).
The trend suggests that as you increase the width of a neural network, the accuracy
increases till a certain threshold value, and then starts decreasing.
What could be the possible reason for this decrease?
A. Even if number of kernels increase, only few of them are used for prediction
B. As the number of kernels increase, the predictive power of neural network
decrease
C. As the number of kernels increase, they start to correlate with each other
which in turn helps overfitting
D. None of these

Q21. Suppose we have one hidden layer neural network as shown above. The
hidden layer in this network works as a dimensionality reductor. Now instead of using
this hidden layer, we replace it with a dimensionality reduction technique such as
PCA.
Would the network that uses a dimensionality reduction technique always give same
output as network with hidden layer?
A. Yes
B. No
Q22. Can a neural network model the function (y=1/x)?
A. Yes
B. No
Q23. In which neural net architecture, does weight sharing occur?
A. Convolutional neural Network
B. Recurrent Neural Network
C. Fully Connected Neural Network
D. Both A and B
Q24. Batch Normalization is helpful because
A. It normalizes (changes) all the input before sending it to the next layer
B. It returns back the normalized mean and standard deviation of weights
C. It is a very efficient backpropagation technique
D. None of these

Q25. Instead of trying to achieve absolute zero error, we set a metric called bayes
error which is the error we hope to achieve. What could be the reason for using
bayes error?
A. Input variables may not contain complete information about the output variable
B. System (that creates input-output mapping) may be stochastic
C. Limited training data
D. All the above
Q26. The number of neurons in the output layer should match the number of classes
(Where the number of classes is greater than 2) in a supervised learning task. True
or False?
A. True
B. False
Q27. In a neural network, which of the following techniques is used to deal with
overfitting?
A. Dropout
B. Regularization
C. Batch Normalization
D. All of these
Q28. Y = ax^2 + bx + c (polynomial equation of degree 2)
Can this equation be represented by a neural network of single hidden layer with
linear threshold?
A. Yes
B. No
Q29. What is a dead unit in a neural network?
A. A unit which doesn’t update during training by any of its neighbour
B. A unit which does not respond completely to any of the training patterns
C. The unit which produces the biggest sum-squared error
D. None of these

Q30. Which of the following statement is the best description of early stopping?
A. Train the network until a local minimum in the error function is reached
B. Simulate the network on a test dataset after every epoch of training. Stop
training when the generalization error starts to increase
C. Add a momentum term to the weight update in the Generalized Delta Rule, so
that training converges more quickly
D. A faster version of backpropagation, such as the `Quickprop’ algorithm
Q31. What if we use a learning rate that’s too large?
A. Network will converge
B. Network will not converge
C. Can’t Say
Q32. The network shown in Figure 1 is trained to recognize the characters H and T
as shown below:
What would be the output of the network?
A.
B.

C.
D. Could be A or B depending on the weights of neural network
Q33. Suppose a convolutional neural network is trained on ImageNet dataset (Object
recognition dataset). This trained model is then given a completely white image as
an input.The output probabilities for this input would be equal for all classes. True or
False?
A. True
B. False
Q34. When pooling layer is added in a convolutional neural network, translation in-
variance is preserved. True or False?
A. True
B. False
Q35. Which gradient technique is more advantageous when the data is too big to
handle in RAM simultaneously?
A. Full Batch Gradient Descent
B. Stochastic Gradient Descent
Q36. The graph represents gradient flow of a four-hidden layer neural network which
is trained using sigmoid activation function per epoch of training. The neural network
suffers with the vanishing gradient problem.
Which of the following statements is true?

A. Hidden layer 1 corresponds to D, Hidden layer 2 corresponds to C, Hidden
layer 3 corresponds to B and Hidden layer 4 corresponds to A
B. Hidden layer 1 corresponds to A, Hidden layer 2 corresponds to B, Hidden layer 3
corresponds to C and Hidden layer 4 corresponds to D
Q37. For a classification task, instead of random weight initializations in a neural
network, we set all the weights to zero. Which of the following statements is true?
A. There will not be any problem and the neural network will train properly
B. The neural network will train but all the neurons will end up recognizing the
same thing
C. The neural network will not train as there is no net gradient change
D. None of these
Q38. There is a plateau at the start. This is happening because the neural network
gets stuck at local minima before going on to global minima.
To avoid this, which of the following strategy should work?
A. Increase the number of parameters, as the network would not get stuck at local
minima
B. Decrease the learning rate by 10 times at the start and then use momentum
C. Jitter the learning rate, i.e. change the learning rate for a few epochs
D. None of these
Q39. For an image recognition problem (recognizing a cat in a photo), which
architecture of neural network would be better suited to solve the problem?
A. Multi Layer Perceptron
B. Convolutional Neural Network
C. Recurrent Neural network
D. Perceptron

Q40. Suppose while training, you encounter this issue. The error suddenly increases
after a couple of iterations.
You determine that there must a problem with the data. You plot the data and find
the insight that, original data is somewhat skewed and that may be causing the
problem.
What will you do to deal with this challenge?
A. Normalize
B. Apply PCA and then Normalize
C. Take Log Transform of the data
D. None of these
Q41. Which of the following is a decision boundary of Neural Network?
A) B
B) A
C) D

D) C
E) All of these
Q42. In the graph below, we observe that the error has many “ups and downs”
Should we be worried?
A. Yes, because this means there is a problem with the learning rate of neural
network.
B. No, as long as there is a cumulative decrease in both training and validation
error, we don’t need to worry.
Q43. What are the factors to select the depth of neural network?
1. Type of neural network (eg. MLP, CNN etc)
2. Input data
3. Computation power, i.e. Hardware capabilities and software capabilities
4. Learning Rate
5. The output function to map
A. 1, 2, 4, 5
B. 2, 3, 4, 5
C. 1, 3, 4, 5
D. All of these

Q44. Consider the scenario. The problem you are trying to solve has a small amount
of data. Fortunately, you have a pre-trained neural network that was trained on a
similar problem. Which of the following methodologies would you choose to make
use of this pre-trained network?
A. Re-train the model for the new dataset
B. Assess on every layer how the model performs and only select a few of them
C. Fine tune the last couple of layers only
D. Freeze all the layers except the last, re-train the last layer
Q45. Increase in size of a convolutional kernel would necessarily increase the
performance of a convolutional network.
A. True
B. False
1) Which of the following statement is true in following case?
A) Feature F1 is an example of nominal variable.
B) Feature F1 is an example of ordinal variable.
C) It doesn’t belong to any of the above category.
D) Both of these
2) Which of the following is an example of a deterministic algorithm?
A) PCA
B) K-Means
3) [True or False] A Pearson correlation between two variables is zero but, still their
values can still be related to each other.
A) TRUE
B) FALSE
4) Which of the following statement(s) is / are true for Gradient Decent (GD) and
Stochastic Gradient Decent (SGD)?
1. In GD and SGD, you update a set of parameters in an iterative manner to
minimize the error function.
2. In SGD, you have to run through all the samples in your training set for a
single update of a parameter in each iteration.

3. In GD, you either use the entire data or a subset of training data to update a
parameter in each iteration.
A) Only 1
B) Only 2
C) Only 3
D) 1 and 2
5) Which of the following hyper parameter(s), when increased may cause random
forest to over fit the data?
1. Number of Trees
2. Depth of Tree
3. Learning Rate
A) Only 1
B) Only 2
C) Only 3
D) 1 and 2
6) Imagine, you are working with “Analytics Vidhya” and you want to develop a
machine learning algorithm which predicts the number of views on the articles.
Your analysis is based on features like author name, number of articles written by
the same author on Analytics Vidhya in past and a few other features. Which of the
following evaluation metric would you choose in that case?
1. Mean Square Error
2. Accuracy
3. F1 Score
A) Only 1
B) Only 2
C) Only 3
D) 1 and 3
7) Given below are three images (1,2,3). Which of the following option is correct for
these images?

A)
B)
C)
A) 1 is tanh, 2 is ReLU and 3 is SIGMOID activation functions.
B) 1 is SIGMOID, 2 is ReLU and 3 is tanh activation functions.
C) 1 is ReLU, 2 is tanh and 3 is SIGMOID activation functions.
D) 1 is tanh, 2 is SIGMOID and 3 is ReLU activation functions.
8) Below are the 8 actual values of target variable in the train file.
[0,0,0,1,1,1,1,1]
What is the entropy of the target variable?
A) -(5/8 log(5/8) + 3/8 log(3/8))
B) 5/8 log(5/8) + 3/8 log(3/8)
C) 3/8 log(5/8) + 5/8 log(3/8)
D) 5/8 log(3/8) – 3/8 log(5/8)

9) Let’s say, you are working with categorical feature(s) and you have not looked at
the distribution of the categorical variable in the test data.
You want to apply one hot encoding (OHE) on the categorical feature(s). What
challenges you may face if you have applied OHE on a categorical variable of train
dataset?
A) All categories of categorical variable are not present in the test dataset.
B) Frequency distribution of categories is different in train as compared to the test
dataset.
C) Train and Test always have same distribution.
D) Both A and B
10) Skip gram model is one of the best models used in Word2vec algorithm for
words embedding. Which one of the following models depict the skip gram model?
A) A
B) B
C) Both A and B
D) None of these
11) Let’s say, you are using activation function X in hidden layers of neural network.
At a particular neuron for any given input, you get the output as “-0.0001”. Which of
the following activation function could X represent?
A) ReLU
B) tanh
C) SIGMOID
D) None of these

12) [True or False] LogLoss evaluation metric can have negative values.
A) TRUE
B) FALSE
13) Which of the following statements is/are true about “Type-1” and “Type-2”
errors?
1. Type1 is known as false positive and Type2 is known as false negative.
2. Type1 is known as false negative and Type2 is known as false positive.
3. Type1 error occurs when we reject a null hypothesis when it is actually true.
A) Only 1
B) Only 2
C) Only 3
D) 1 and 3
14) Which of the following is/are one of the important step(s) to pre-process the text
in NLP based projects?
1. Stemming
2. Stop word removal
3. Object Standardization
A) 1 and 2
B) 1 and 3
C) 2 and 3
D) 1,2 and 3
15) Suppose you want to project high dimensional data into lower dimensions. The
two most famous dimensionality reduction algorithms used here are PCA and t-SNE.
Let’s say you have applied both algorithms respectively on data “X” and you got the
datasets “X_projected_PCA” , “X_projected_tSNE”.
Which of the following statements is true for “X_projected_PCA” &
“X_projected_tSNE” ?
A) X_projected_PCA will have interpretation in the nearest neighbour space.
B) X_projected_tSNE will have interpretation in the nearest neighbour space.
C) Both will have interpretation in the nearest neighbour space.
D) None of them will have interpretation in the nearest neighbour space.

Context: 16-17
Given below are three scatter plots for two features (Image 1, 2 & 3 from left to right).
16) In the above images, which of the following is/are examples of multi-collinear
features?
A) Features in Image 1
B) Features in Image 2
C) Features in Image 3
D) Features in Image 1 & 2
17) In previous question, suppose you have identified multi-collinear features. Which
of the following action(s) would you perform next?
1. Remove both collinear variables.
2. Instead of removing both variables, we can remove only one variable.
3. Removing correlated variables might lead to loss of information. In order to
retain those variables, we can use penalized regression models like ridge or
lasso regression.
A) Only 1
B)Only 2
C) Only 3
D) Either 2 or 3
18) Adding a non-important feature to a linear regression model may result in.
1. Increase in R-square
2. Decrease in R-square
A) Only 1 is correct
B) Only 2 is correct

C) Either 1 or 2
D) None of these
19) Suppose, you are given three variables X, Y and Z. The Pearson correlation
coefficients for (X, Y), (Y, Z) and (X, Z) are C1, C2 & C3 respectively.
Now, you have added 2 in all values of X (i.enew values become X+2), subtracted 2
from all values of Y (i.e. new values are Y-2) and Z remains the same. The new
coefficients for (X,Y), (Y,Z) and (X,Z) are given by D1, D2 & D3 respectively. How do
the values of D1, D2 & D3 relate to C1, C2 & C3?
A) D1= C1, D2 < C2, D3 > C3
B) D1 = C1, D2 > C2, D3 > C3
C) D1 = C1, D2 > C2, D3 < C3
D) D1 = C1, D2 < C2, D3 < C3
E) D1 = C1, D2 = C2, D3 = C3
20) Imagine, you are solving a classification problems with highly imbalanced class.
The majority class is observed 99% of times in the training data.
Your model has 99% accuracy after taking the predictions on test data. Which of the
following is true in such a case?
1. Accuracy metric is not a good idea for imbalanced class problems.
2. Accuracy metric is a good idea for imbalanced class problems.
3. Precision and recall metrics are good for imbalanced class problems.
4. Precision and recall metrics aren’t good for imbalanced class problems.
A) 1 and 3
B) 1 and 4
C) 2 and 3
D) 2 and 4
21) In ensemble learning, you aggregate the predictions for weak learners, so that
an ensemble of these models will give a better prediction than prediction of individual
models.
Which of the following statements is / are true for weak learners used in ensemble
model?
1. They don’t usually overfit.
2. They have high bias, so they cannot solve complex learning problems

3. They usually overfit.
A) 1 and 2
B) 1 and 3
C) 2 and 3
D) Only 1
22) Which of the following options is/are true for K-fold cross-validation?
1. Increase in K will result in higher time required to cross validate the result.
2. Higher values of K will result in higher confidence on the cross-validation
result as compared to lower value of K.
3. If K=N, then it is called Leave one out cross validation, where N is the number
of observations.
A) 1 and 2
B) 2 and 3
C) 1 and 3
D) 1,2 and 3
Cross-validation is an important step in machine learning for hyper parameter tuning.
Let’s say you are tuning a hyper-parameter “max_depth” for GBM by selecting it from
10 different depth values (values are greater than 2) for tree based model using 5-
fold cross validation.
Time taken by an algorithm for training (on a model with max_depth 2) 4-fold is 10
seconds and for the prediction on remaining 1-fold is 2 seconds.
Note: Ignore hardware dependencies from the equation.
23) Which of the following option is true for overall execution time for 5-fold cross
validation with 10 different values of “max_depth”?
A) Less than 100 seconds
B) 100 – 300 seconds
C) 300 – 600 seconds
D) More than or equal to 600 seconds

24) In previous question, if you train the same algorithm for tuning 2 hyper
parameters say “max_depth” and “learning_rate”.
You want to select the right value against “max_depth” (from given 10 depth values)
and learning rate (from given 5 different learning rates). In such cases, which of the
following will represent the overall time?
A) 1000-1500 second
B) 1500-3000 Second
C) More than or equal to 3000 Second
D) None of these
25) Given below is a scenario for training error TE and Validation error VE for a
machine learning algorithm M1. You want to choose a hyperparameter (H) based on
TE and VE.
H TE VE
1 105 90
2 200 85
3 250 96
4 105 85
5 300 100
Which value of H will you choose based on the above table?
A) 1
B) 2
C) 3
D) 4
26) What would you do in PCA to get the same projection as SVD?
A) Transform data to zero mean
B) Transform data to zero median
C) Not possible
D) None of these

Assume there is a black box algorithm, which takes training data with multiple
observations (t1, t2, t3,…….. tn) and a new observation (q1). The black box outputs
the nearest neighbor of q1 (say ti) and its corresponding class label ci.
You can also think that this black box algorithm is same as 1-NN (1-nearest
neighbor).
27) It is possible to construct a k-NN classification algorithm based on this black box
alone.
Note: Where n (number of training observations) is very large compared to k.
A) TRUE
B) FALSE
28) Instead of using 1-NN black box we want to use the j-NN (j>1) algorithm as black
box. Which of the following option is correct for finding k-NN using j-NN?
1. J must be a proper factor of k
2. J > k
3. Not possible
A) 1
B) 2
C) 3
29) Suppose you are given 7 Scatter plots 1-7 (left to right) and you want to compare
Pearson correlation coefficients between variables of each scatterplot.
Which of the following is in the right order?
1. 1<2<3<4
2. 1>2>3 > 4
3. 7<6<5<4
4. 7>6>5>4
A) 1 and 3
B) 2 and 3
C) 1 and 4
D) 2 and 4

30) You can evaluate the performance of a binary class classification problem using
different metrics such as accuracy, log-loss, F-Score. Let’s say, you are using the
log-loss function as evaluation metric.
Which of the following option is / are true for interpretation of log-loss as an
evaluation metric?
1.
If a classifier is confident about an incorrect classification, then log-loss will
penalise it heavily.
2. For a particular observation, the classifier assigns a very small probability for
the correct class then the corresponding contribution to the log-loss will be
very large.
3. Lower the log-loss, the better is the model.
A) 1 and 3
B) 2 and 3
C) 1 and 2
D) 1,2 and 3
Context Question 31-32
Below are five samples given in the dataset.
Note: Visual distance between the points in the image represents the actual
distance.
31) Which of the following is leave-one-out cross-validation accuracy for 3-NN (3-
nearest neighbor)?
A) 0
D) 0.4
C) 0.8
D) 1

32) Which of the following value of K will have least leave-one-out cross validation
accuracy?
A) 1NN
B) 3NN
C) 4NN
D) All have same leave one out error
33) Suppose you are given the below data and you want to apply a logistic
regression model for classifying it in two given classes.
You are using logistic regression with L1 regularization.
Where C is the regularization parameter and w1 & w2 are the coefficients of x1 and
x2.
Which of the following option is correct when you increase the value of C from zero
to a very large value?
A) First w2 becomes zero and then w1 becomes zero
B) First w1 becomes zero and then w2 becomes zero
C) Both becomes zero at the same time
D) Both cannot be zero even after very large value of C
34) Suppose we have a dataset which can be trained with 100% accuracy with help
of a decision tree of depth 6. Now consider the points below and choose the option
based on these points.
Note: All other hyper parameters are same and other factors are not affected.
1. Depth 4 will have high bias and low variance

2. Depth 4 will have low bias and low variance
A) Only 1
B) Only 2
C) Both 1 and 2
35) Which of the following options can be used to get global minima in k-Means
Algorithm?
1. Try to run algorithm for different centroid initialization
2. Adjust number of iterations
3. Find out the optimal number of clusters
A) 2 and 3
B) 1 and 3
C) 1 and 2
D) All of above
36) Imagine you are working on a project which is a binary classification problem.
You trained a model on training dataset and get the below confusion matrix on
validation dataset.
Based on the above confusion matrix, choose which option(s) below will give you
correct predictions?
1. Accuracy is ~0.91
2. Misclassification rate is ~ 0.91
3. False positive rate is ~0.95
4. True positive rate is ~0.95
A) 1 and 3

B) 2 and 4
C) 1 and 4
D) 2 and 3
37) For which of the following hyperparameters, higher value is better for decision
tree algorithm?
1. Number of samples used for split
2. Depth of tree
3. Samples for leaf
A)1 and 2
B) 2 and 3
C) 1 and 3
D) 1, 2 and 3
E) Can’t say
Context 38-39
Imagine, you have a 28 * 28 image and you run a 3 * 3 convolution neural network
on it with the input depth of 3 and output depth of 8.
Note: Stride is 1 and you are using same padding.
38) What is the dimension of output feature map when you are using the given
parameters.
A) 28 width, 28 height and 8 depth
B) 13 width, 13 height and 8 depth
C) 28 width, 13 height and 8 depth
D) 13 width, 28 height and 8 depth
39) What is the dimensions of output feature map when you are using following
parameters.
A) 28 width, 28 height and 8 depth
B) 13 width, 13 height and 8 depth
C) 28 width, 13 height and 8 depth
D) 13 width, 28 height and 8 depth

40) Suppose, we were plotting the visualization for different values of C (Penalty
parameter) in SVM algorithm. Due to some reason, we forgot to tag the C values
with visualizations. In that case, which of the following option best explains the C
values for the images below (1,2,3 left to right, so C values are C1 for image1, C2 for
image2 and C3 for image3 ) in case of rbf kernel.
A) C1 = C2 = C3
B) C1 > C2 > C3
C) C1 < C2 < C3
D) None of these
1) Is the data linearly separable?
A) Yes
B) No
2) Which of the following are universal approximators?
A) Kernel SVM
B) Neural Networks
C) Boosted Decision Trees
D) All of the above
3) In which of the following applications can we use deep learning to solve the
problem?

A) Protein structure prediction
B) Prediction of chemical reactions
C) Detection of exotic particles
D) All of these
4) Which of the following statements is true when you use 1×1 convolutions in a
CNN?
A) It can help in dimensionality reduction
B) It can be used for feature pooling
C) It suffers less overfitting due to small kernel size
D) All of the above
5) Question Context:
Statement 1: It is possible to train a network well by initializing all the weights as 0
Statement 2: It is possible to train a network well by initializing biases as 0
Which of the statements given above is true?
A) Statement 1 is true while Statement 2 is false
B) Statement 2 is true while statement 1 is false
C) Both statements are true
D) Both statements are false
6) The number of nodes in the input layer is 10 and the hidden layer is 5. The
maximum number of connections from the input layer to the hidden layer are
A) 50
B) Less than 50
C) More than 50
D) It is an arbitrary value
7) The input image has been converted into a matrix of size 28 X 28 and a
kernel/filter of size 7 X 7 with a stride of 1. What will be the size of the convoluted
matrix?
A) 22 X 22
B) 21 X 21
C) 28 X 28
D) 7 X 7
8) In a simple MLP model with 8 neurons in the input layer, 5 neurons in the hidden
layer and 1 neuron in the output layer. What is the size of the weight matrices
between hidden output layer and input hidden layer?
A) [1 X 5] , [5 X 8]

B) [8 X 5] , [ 1 X 5]
C) [8 X 5] , [5 X 1]
D) [5 x 1] , [8 X 5]
11) Which of the following functions can be used as an activation function in the
output layer if we wish to predict the probabilities of n classes (p1, p2..pk) such that
sum of p over all n equals to 1?
A) Softmax
B) ReLu
C) Sigmoid
D) Tanh
12) Assume a simple MLP model with 3 neurons and inputs= 1,2,3. The weights to
the input neurons are 4,5 and 6 respectively. Assume the activation function is a
linear constant value of 3. What will be the output ?
A) 32
B) 643
C) 96
D) 48
13) Which of following activation function can’t be used at output layer to classify an
image ?
A) sigmoid
B) Tanh
C) ReLU
D) If(x>5,1,0)
14) [True | False] In the neural network, every parameter can have their different
learning rate.
A) TRUE
B) FALSE
15) Dropout can be applied at visible layer of Neural Network model?
A) TRUE
B) FALSE

16) I am working with the fully connected architecture having one hidden layer with 3
neurons and one output neuron to solve a binary classification challenge. Below is
the structure of input and output:
Input dataset: [ [1,0,1,0] , [1,0,1,1] , [0,1,0,1] ]
Output: [ [1] , [1] , [0] ]
To train the model, I have initialized all weights for hidden and output layer with 1.
What do you say model will able to learn the pattern in the data?
A) Yes
B) No
17) Which of the following neural network training challenge can be solved using
batch normalization?
A) Overfitting
B) Restrict activations to become too high or low
C) Training is too slow
D) Both B and C
E) All of the above
18) Which of the following would have a constant input in each epoch of training a
Deep Learning model?
A) Weight between input and hidden layer
B) Weight between hidden and output layer
C) Biases of all hidden layer neurons
D) Activation function of output layer
19) True/False: Changing Sigmoid activation to ReLu will help to get over the
vanishing gradient issue?
A) TRUE
B) FALSE
20) In CNN, having max pooling always decrease the parameters?
A) TRUE
B) FALSE
21) [True or False] BackPropogation cannot be applied when using pooling layers
A) TRUE
B) FALSE

22) What value would be in place of question mark?
Here we see a convolutional function being applied to input.
A) 3
B) 4
C) 5
D) 6
23) For a binary classification problem, which of the following architecture would you
choose?
A) 1
B) 2
C) Any one of these
D) None of these

24) Suppose there is an issue while training a neural network. The training
loss/validation loss remains constant. What could be the possible reason?
A) Architecture is not defined correctly
B) Data given to the model is noisy
C) Both of these
25)
The red curve above denotes training accuracy with respect to each epoch in a deep
learning algorithm. Both the green and blue curves denote validation accuracy.
Which of these indicate overfitting?
A) Green Curve
B) Blue Curve
26) Which of the following statement is true regrading dropout?
1: Dropout gives a way to approximate by combining many different architectures
2: Dropout demands high learning rates
3: Dropout can help preventing overfitting
A) Both 1 and 2
B) Both 1 and 3
C) Both 2 and 3
D) All 1, 2 and 3
27) Gated Recurrent units can help prevent vanishing gradient problem in RNN.
A) True
B) False
29) [True or False] Sentiment analysis using Deep Learning is a many-to one
prediction task
A) TRUE
B) FALSE

30) What steps can we take to prevent overfitting in a Neural Network?
A) Data Augmentation
B) Weight Sharing
C) Early Stopping
D) Dropout
E) All of the above

pml.pdf

More Related Content

Similar to pml.pdf (20)

Recently uploaded (20)

pml.pdf