Big Data Stats II Name:__________
STAT4650 Sample Final Exam Solution
Spring 2020
Instructions
1. Make sure to read or parse the entire exam.
2. Open book and open note.
3. You are allowed to use a scientific calculator.
4. Exiting the exam under any circumstance is final, and you will NOT be allowed to take it on
your second attempt.
5. If any two or more of you submit identical answers to essay questions, it will result in a score
of zero for that question.
6. If you copy the content from the textbook/handout/solution, it will result in a score of zero for
that question.
7. You have 120 minutes to complete the test.
8. Make sure you hit the ‘Submit’ button once you are done with your exam.
9. If you have any questions, simply enter my online meeting room via zoom and I will help you
with your questions.
10. Don't panic.
Students in my class are required to adhere to the standards of conduct set by Clark University
and GSOM. Please sign the following Honor pledge that signifies your understanding of the
rules set by the code of conduct.
“I pledge my honor that I have not violated Clark University's Code of Conduct during this
examination.”
Please sign here to acknowledge_____________________________
2
1. This question is about trees and random forests.
(a) [5 pts] Sketch the tree corresponding to the partition of the predictor space illustrated in the
following Figure. The Ri inside the boxes indicate region i.
Sol:
3
(b) [5 pts] Create a partition, using the tree illustrated in the following figure. Drag items onto
the image.
Solution:
(c) [5 pts] What is a Bootstrap Aggregation of Decision Trees? Explain (2-3 sentences).
Sol:
We first generate B different bootstrapped training datasets. Construct B decision trees on each
of the B training datasets, and obtain the prediction. To do prediction, we take average of all
predictions from all B regression trees. In case of classification problem, we take majority vote
among all B trees.
4
(d) [5 pts] How does a Random Forest differ from a Bootstrap Aggregation of Decision Trees?
Explain (2-3 sentences).
Solution:
Build a number of decision trees on bootstrapped training sample, but when building these trees,
each time a split in a tree is considered, a random sample of m predictors is chosen as split
candidates from the full set of p predictors (Usually = √).
5
2. This question relates to Support Vector Machines and uses the data below.
(a) [5 pts] We are given n = 7 observations in p = 2 dimensions. Horizontal axis corresponds
to 1 and Vertical axis corresponds 2. For each observation, there is an associated class
label. Sketch the observations in the coordinate grid.
(b) [5 pts] Provide 0, 1, 2 for the maximum margin separating hyperplane defined by
0 + 11 + 22 = 0.
(c) [5 pts] Indicate the support vectors for the maximal margin classifier (you may answer
this question by writing down the coordinates of the support vectors).
(d) [5 pts] Argue that a slight movement of the seventh observation would not affect the
maximal margin hyperplane.
Obs. 1 2 Y
1 3 4 Red
2 2 2 Red
3 4 4 Red
4 1 4 Red
5 2 1 Blue
6 4 3 Blue
7 4 1 Blue
Solution: (a)
()
(b) 0.5 − 1 + 2 = 0 (Note: any equation close to this one is Okay)
6
If a point falls above the given line meaning that 0.5 − 1 + 2 > 0, we classify the point as
“red” while if our point is below the given line meaning that 0.5 − 1 + 2 < 0, we classify the
point as “blue”.
(c) The support vectors are the four points that pass though the gray lines.
These points are (2,1), (2, 2), (4,3), (4,4).
(d)The seventh point is located at (4, 1) which is far from the separating hyperplane and not
close to any of the supporting vectors which determine the separating hyperplane. As such small
movements in its location won’t change the separating hyperplane.
7
3. (a) [5 pts] Considering the two methods “k-means clustering” and “k-nearest neighbors”,
which is a supervised learning algorithm and which is an unsupervised learning algorithm?
Solution:
“k-means clustering” is unsupervised learning while “k-nearest neighbors” is supervised
learning
(b) [5 pts] What quantity does PCA minimize when it generates each principle component?
Solution: the sum of the squared perpendicular distances to each point
(c) [5 pts] What are the optimum number of principle components in the below figure?
Solution:
We can see in the above figure that the number of components = 30 is giving highest
variance with lowest number of components.
8
4. Suppose that we have four observations, for which we compute a dissimilarity matrix, given
by
For instance, the dissimilarity between the first and second observations is 0.3, and the
dissimilarity between the second and fourth observations is 0.8.
(a) [5 pts] On the basis of this dissimilarity matrix, sketch the dendrogram that results from
hierarchically clustering these four observations using complete linkage. Be sure to indicate on
the plot the height at which each fusion occurs, as well as the observations corresponding to each
leaf in the dendrogram. Drag items
(b) [5 pts] Repeat (a), this time using single linkage clustering.
(c) [5 pts] Suppose that we cut the dendrogram obtained in (a) such that two clusters result.
Which observations are in each cluster?
(d) [5 pts] Suppose that we cut the dendrogram obtained in (b) such that two clusters result.
Which observations are in each cluster?
(e) [5 pts] It is mentioned in the chapter that at each fusion in the dendrogram, the position of the
two clusters being fused can be swapped without changing the meaning of the dendrogram.
Draw a dendrogram that is equivalent to the dendrogram in (a), for which two or more of the
leaves are repositioned, but for which the meaning of the dendrogram is the same.
Solution:
(a)
9
10
(b)
(c)
(1,2), (3,4)
(d)
(1, 2, 3), (4)
(e)
11
5. Time Series Forecasting
(a) [10 pts] What are the differences between autoregressive and moving average models?
Solution: Autoregressive models specify the current value of a series yt as a function of its
previous p values and the current value an error term, ut, while moving average models
specify the current value of a series yt as a function of the current and previous q values
of an error term, ut. AR and MA models have different characteristics in terms of the
length of their “memories”, which has implications for the time it takes shocks to yt to die
away, and for the shapes of their autocorrelation and partial autocorrelation functions.
An autoregressive process has a geometrically decaying acf and a number of non-zero
points of pacf, which equal to AR order. A moving average process has a number of non-
zero points of acf that equal to MA order and a geometrically decaying pacf.
(b) [10 pts] A researcher wants to test the order of integration of some time series data. He
decides to use the DF test. He estimates a regression of the form
∆ = + −1 +
and obtains the estimate = −0.023 with standard error = 0.009. What are the null and
alternative hypotheses for this test? Given the data, and a critical value of −2.86, perform
the test. What is the conclusion from this test and what should be the next step?
Solution: The null hypothesis is of a unit root against a one sided stationary alternative, i.e. we
have
H0 : yt is non-stationary process
H1 : yt is stationary process
which is also equivalent to
H0 : = 0
H1 : < 0
The test statistic is given by /() which equals -0.023 / 0.009 = -2.556. Since this is not more
negative than the appropriate critical value, we do not reject the null hypothesis.
We therefore conclude that there is at least one unit root in the series (there could be 1, 2, 3 or
more). What we would do now is to regress 2yt on yt-1 and test if there is a further unit root. The
null and alternative hypotheses would now be:
H0 : yt I(1) i.e. yt I(2)
H1 : yt I(0) i.e. yt I(1)
If we rejected the null hypothesis, we would therefore conclude that the first differences are
stationary, and hence the original series was I(1). If we did not reject at this stage, we would
conclude that yt must be at least I(2), and we would have to test again until we rejected.