Those questions about :
Review of regression modeling and analysis; implementation in R. [Chap. 3].
3. Classification problems and classification tools. Logistic regression and review of linear discriminant
analysis. [Chap. 4]
4. Resampling methods; bootstrap. [Chap. 5 and lecture notes].
5. High-dimensional data and shrinkage. Ridge regression. LASSO. Model selection methods and
dimension reduction. [Chap. 6]
6. Nonlinear trends and splines. [Chap. 7; 7.4-7.5]
7. Regression trees and decision trees [Chap. 8]
8. Introduction to support vector machines [If time permits; Chap. 9]
9. Clustering methods [Chap. 10]
The following are the questions:
True or False. Justify your answer.
The lasso, relative to least squares is more flexible and hence will give improved prediction accuracy
when its increase in bias is less than its decrease in variance.
Suppose you are given a dataset of MRI images from patients with and without brain cancer and
suppose that we know a method to effectively extract features from those images. If you are required to
train a classifier that predicts the probability that the patient has brain cancer, you would prefer to use
Decision trees over logistic regression.
Suppose the dataset in the previous question had 900 brain cancer-free MRI images and 100 images
from brain cancer patients. If we train a classifier which achieves 85% accuracy on this dataset, then
we can consider this classifier a good classifier.
A classifier that attains 99.95% accuracy on the training set and 65% accuracy on test set is better
than a classifier that attains 76.5% accuracy on the training set and 72% accuracy on test set.
Decision trees with interaction depth one will always give a linear decision boundary.
A tree with depth of 3 has higher variance than a tree with depth of 1.
Models which underfit have high variance.
The bootstrap method involves sampling without replacement.
Using the kernel trick, one can get non-linear decision boundaries using algorithms designed originally
for linear models.
The maximum margin decision boundaries that support vector machines construct have the lowest
generalization error among all linear classifiers.
For each of the following learning problems, indicate whether it is a prediction, regression or classification
problem.
(a) A biologist has given different amounts of food to different rats in his laboratory. He has recorded the
weight of each rat after two months. Now he wants to learn how the weight of the rats depends on the
amount of food they get.
(b) Each spring a farmer counts the number of newborn sheep. Based on his counts of the previous years
he wants to estimate the number of newborn sheep in the coming year.
(c) A computer program tries to determine whether a newspaper article is about politics based on the
number of times the article contains the following words/phrases: ’law’, ’sports’, ’newspaper’, ’hockey’,
’elections’, ’human rights’ and ’party’.