Final Exam
Stat 131A
Spring 2020
NAME:
Directions – PLEASE READ CAREFULLY
• The exam is due on Gradescope Thursday night at 11:59pm.
• You must show your work (where appropriate).
• You may refer to your notes or online notes from this class. (It is technically allowed
to read other materials online, but I very much doubt that will help you at all).
• You may NOT discuss the exam questions with anyone else, whether they are in the
class or out of the class. You may not post questions on message boards relating to
problems on this exam. Any violations of this policy will result in a zero for the exam.
• If you have any clarification questions, please send an email to Will and George, NOT
Piazza.
• DO NOT post on Piazza about exam questions.
1
2
1. Answer the following TRUE/FALSE questions. Briefly explain your reasoning.
(a) (5 points) Because a regression coefficient measures the effect of a predictor vari-
able on the response variable when we hold everything else constant, we can
interpret the coefficient as a causal effect of that predictor on the response.
(b) (5 points) Suppose we have a data set with p variables, x(1) up to x(p), and we
calculate the loadings a1, . . . , ap for the first principal component. Then, suppose
we center and scale each of the variables to get new y variables with mean 0 and
standard deviation 1. That is, the value of variable y(j) for the ith observation is
y
(j)
i = (x
(j)
i −mean(x(j)))/sd(x(j))
True or false: If we calculate new first principal component loadings b1, . . . , bp on
the new data set with the y variables, then the new loadings might be different
from the old loadings.
(c) (5 points) If somehow we knew the true values of β in a linear regression, then
those values would give us an even smaller RSS than the estimated coefficients βˆ
that we get from lm.
(d) (5 points) Suppose we fit a regression of response y on variables x(1) and x(2),
using the lm function. We get βˆ2 = 5 with a p-value in the regression table equal
to 0.04. But after we add a third variable x(3) to the regression, the estimate
changes to βˆ2 = −15, and the p-value in the new regression table is now 10−12.
True or false: This means that the first estimate almost certainly had the wrong
sign, and the p-value must have been small the first time just by chance.
(e) (5 points) In part d, it is impossible for the first model (with predictor variables
x(1) and x(2)) to have a strictly lower RSS than the the second model (with
predictor variables x(1), x(2), and x(3)).
(f) (5 points) In agglomerative hierarchical clustering, if we use the ”single linkage”
method to measure distances between clusters, then we will sometimes see a ”rich
get richer” effect in intermediate steps of the algorithm, where there are a few
very large clusters and many much smaller clusters.
3
2. Bar plot for two categorical variables
The bar plot below shows the joint distribution of two categorical variables measured
in the Wellbeing survey that we discussed in class. The participants in the survey
were asked to report their General Satisfaction and their Job Satisfaction, and their
responses were recorded.
There are four levels of Job Satisfaction (Very satisfied, Moderately satisfied, A little
dissatisfied, and Very dissatisfied), and three levels of General Satisfaction (Very happy,
Pretty happy, and Not too happy).
Very satisfied (Job) Mod. satisfied (Job) A little dissat (Job) Very dissatisfied (Job)
Very happy (General)
Pretty happy (General)
Not too happy (General)
Fr
eq
ue
nc
y
0
50
0
10
00
15
00
20
00
25
00
Note: to make the question simpler, I have removed categories like ”Don’t know”
and ”Not applicable” from the data set; you should answer the question as if these
categories never existed and the data shown here represent the entire data set.
Answer the following questions and explain how you can tell. You can assume the
data set is large enough so that the observed probabilities in the figure are very close
to the probabilities in the overall population.
(a) (5 points) Overall, what is the most common response for General Satisfaction?
(b) (5 points) Which is higher: (A) the conditional probability of being Pretty happy
in general given that someone is Very satisfied with their job, or (B) the condi-
tional probability of being Pretty happy in general given that someone is Moder-
ately satisfied with their job?
(c) (5 points) Given that someone is Not too happy in general, which response were
they most likely to give about their Job Satisfaction?
4
3. Hierarchical clustering
The plot below shows a data set with p = 2 variables and n = 7 data points. The
seven data points are labeled A,B,C,D,E,F,G.
x(1)
x(
2)
A
B
C
D
E
F
G
The Euclidean distances between pairs of points are shown in the matrix below. For
example, the value corresponding to row ”C” and column ”D” gives the Euclidean
distance between the points C and D.
## A B C D E F G
## A 0.0 0.2 0.9 2.3 2.1 2.6 4.1
## B 0.2 0.0 0.7 2.1 2.1 2.5 3.9
## C 0.9 0.7 0.0 1.4 1.7 2.2 3.2
## D 2.3 2.1 1.4 0.0 2.0 2.3 1.8
## E 2.1 2.1 1.7 2.0 0.0 0.4 3.2
## F 2.6 2.5 2.2 2.3 0.4 0.0 3.3
5
## G 4.1 3.9 3.2 1.8 3.2 3.3 0.0
For the following questions, assume (except where expressly stated otherwise) that we
are using agglomerative hierarchical clustering with Euclidean distances and complete
linkage.
(a) (10 points) What two ”clusters” will the algorithm join in each of the first four
steps, and what will the algorithm use as the distance between each pair of clus-
ters? (You only need to give the distance between the two ”clusters” that were
joined in each step).
(b) (10 points) Would any part of your answer to part a be different if we were using
single linkage clustering instead?
(c) (5 points) The dendrogram for the complete linkage clustering is shown below,
with the seven ”leaves” (dangling ends) of the tree unlabeled:
0
1
2
3
4
Cluster Dendrogram
hclust (*, "complete")
H
ei
gh
t
6
Write the letters A to G below the leaves to give a valid labeling that corresponds
to the complete linkage clustering (note there is more than one possible labeling,
you only need to do one of them).
7
4. Consider the Ames Housing Dataset that was introduced in class. This dataset contains
information on sales of houses in Ames, Iowa. The original dataset has been made
smaller to make the analysis easier. The variables in the dataset are:
• Lot.Area: Lot Size (Land Area) in square feet.
• Total.Bsmt.SF: Total square feet of basement area.
• Gr.Liv.Area: Total living area in square feet.
• Garage.Cars: Size of garage in terms of car capacity.
• Fireplaces.YN: a yes (Y) or no (N) variable that indicates whether the house
has a fireplace or not.
• Year.Built: the year in which the house was built.
• SalePrice: the sale price of the house in dollars.
There are n = 1314 observations in the dataset and some observations are listed below:
head(ames)
## Lot.Area Total.Bsmt.SF Gr.Liv.Area Garage.Cars Fireplaces.YN
## 1 11622 882 896 1 N
## 2 14267 1329 1329 1 N
## 3 4920 1338 1338 2 N
## 4 5005 1280 1280 2 N
## 5 7980 1168 1187 2 N
## 6 8402 789 1465 2 Y
## Year.Built SalePrice
## 1 1961 105000
## 2 1958 172000
## 3 2001 213500
## 4 1992 191500
## 5 1992 185000
## 6 1998 180400
(a) (5 points) I fit a regression equation with SalePrice as the response variable in
terms of all the other variables (note that Bldg.Type and Fireplaces.YN are
treated as factors). This gave me the following:
m1 <- lm(SalePrice ~ ., data = ames)
m1
##
## Call:
## lm(formula = SalePrice ~ ., data = ames)
##
8
## Coefficients:
## (Intercept) Lot.Area Total.Bsmt.SF Gr.Liv.Area
## -9.795e+05 8.489e-01 3.230e+01 5.130e+01
## Garage.Cars Fireplaces.YNY Year.Built
## 9.694e+03 8.768e+03 5.125e+02
What does this regression equation say about the change in our prediction for
SalePrice for a 100 square ft. increase in Gr.Liv.Area provided the other vari-
ables remain unchanged?
(b) (5 points) According to the regression equation m1, find the predicted SalePrice
for a house with a lot area of 10000 square feet, total basement square footage
of 1000 square feet, 1000 square feet of total living area, having a fireplace and a
garage that can hold two cars and which was built in the year 2000.
(c) (5 points) My friend who has a lot of experience with the Ames real estate market
suggests to me that I should add an interaction between Fireplaces.YN and
Year.Built, because she thinks that fireplaces used to be more common and now
they are a luxury. I fit a new regression equation with an interaction term added:
m2 <- lm(SalePrice ~ . + Year.Built:Fireplaces.YN, data = ames)
m2
##
## Call:
## lm(formula = SalePrice ~ . + Year.Built:Fireplaces.YN, data = ames)
##
## Coefficients:
## (Intercept) Lot.Area
## -9.009e+05 8.571e-01
## Total.Bsmt.SF Gr.Liv.Area
## 3.248e+01 5.078e+01
## Garage.Cars Fireplaces.YNY
## 9.579e+03 -2.351e+05
## Year.Built Fireplaces.YNY:Year.Built
## 4.727e+02 1.240e+02
The estimate of the interaction term is 124. What is the interpretation of that
number in the regression equation?
(d) (5 points) For the house in part b, assume there is a newer house which is exactly
the same except that it was built 5 years later, in 2005. Compared to the older
house, how much more would we predict the newer house to sell for, if we use
model m2 to make our predictions?
(e) (5 points) Assume that we make the usual assumptions to justify parametric in-
ference in regression. Give a parametric 95% confidence interval for the coefficient
of the interaction, based on the summary table below (you may use 2 for the tα/2
9
value). Based on the result, can we feel confident that the interaction term is
helping us make better predictions?
summary(m2)
##
## Call:
## lm(formula = SalePrice ~ . + Year.Built:Fireplaces.YN, data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87632 -11006 16 10112 94784
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.009e+05 4.994e+04 -18.039 < 2e-16 ***
## Lot.Area 8.571e-01 1.428e-01 6.004 2.49e-09 ***
## Total.Bsmt.SF 3.248e+01 2.154e+00 15.081 < 2e-16 ***
## Gr.Liv.Area 5.078e+01 2.643e+00 19.216 < 2e-16 ***
## Garage.Cars 9.579e+03 9.098e+02 10.529 < 2e-16 ***
## Fireplaces.YNY -2.351e+05 7.959e+04 -2.954 0.00320 **
## Year.Built 4.727e+02 2.575e+01 18.359 < 2e-16 ***
## Fireplaces.YNY:Year.Built 1.240e+02 4.048e+01 3.064 0.00223 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18300 on 1306 degrees of freedom
## Multiple R-squared: 0.7459,Adjusted R-squared: 0.7445
## F-statistic: 547.6 on 7 and 1306 DF, p-value: < 2.2e-16
(f) (5 points) My friend gets very excited and wants to put in more interaction terms
with Fireplaces.YN. After adding interactions with two more variables, we get
the model:
m3 <- lm(SalePrice ~ . + Year.Built:Fireplaces.YN + Fireplaces.YN:Lot.Area
+ Fireplaces.YN:Gr.Liv.Area,
data = ames)
She points out that the R2 value has improved, so the new model is a better fit:
summary(m3)$r.squared
## [1] 0.7459505
summary(m2)$r.squared
## [1] 0.7458855
10
She says this means the new interaction terms must be making the model better
so we should keep them in. Do you agree?
(g) (5 points) My friend wants to find more support for her theory that we need
some more interactions with Fireplaces.YN, so she looks for the best model
with Year.Built:Fireplaces.YN plus two additional interactions. That is, she
always keeps the interaction with Year.Built in the model, and tries out every
single way to add two more interactions with Fireplaces.YN, trying to make the
R2 as big as possible.
She finds that the best two variables to interact with Fireplaces.YN are Lot.Area
and Total.Bsmt.SF, and she wants to show that this model makes better out-
of-sample predictions. She uses cross-validation to compare the model she chose
(with three interactions) to the model m2 (with only one interaction), and she
finds that her model has slightly better prediction error, as measured by cross-
validation. Does this show her model is better?
(h) (5 points, extra credit) I now fit a regression equation to the same dataset with the
logarithm of SalePrice as the response variable (I left the explanatory variables
unchanged). This gave me the following:
m4 <- lm(log(SalePrice) ~ ., data = ames)
m4
##
## Call:
## lm(formula = log(SalePrice) ~ ., data = ames)
##
## Coefficients:
## (Intercept) Lot.Area Total.Bsmt.SF Gr.Liv.Area
## 3.641e+00 7.349e-06 2.146e-04 3.842e-04
## Garage.Cars Fireplaces.YNY Year.Built
## 7.578e-02 5.699e-02 3.739e-03
Assume that our predicted SalePrice for a certain house with no fireplace, using
model m4, is $150,000. What would the predicted SalePrice become if we add a
fireplace to the house and leave all of the other variables unchanged?
(i) (5 points) We decide to investigate a bit more how fireplaces relate to some of the
other variables in the data set, so we create a new binary outcome variable called
Fireplaces.10, which is 1 if Fireplaces.YN is Y and 0 otherwise. We estimate
a logistic regression using the glm function in R, and we get the following model:
ames$Fireplaces.10 <- ifelse(ames$Fireplaces.YN == "Y", 1, 0)
m5 <- glm(Fireplaces.10 ~ Year.Built + SalePrice, data = ames,
family = binomial)
m5
11
##
## Call: glm(formula = Fireplaces.10 ~ Year.Built + SalePrice, family = binomial,
## data = ames)
##
## Coefficients:
## (Intercept) Year.Built SalePrice
## 2.942e+01 -1.784e-02 3.535e-05
##
## Degrees of Freedom: 1313 Total (i.e. Null); 1311 Residual
## Null Deviance: 1729
## Residual Deviance: 1493 AIC: 1499
According to model m5, how likely is it that a house built in 1950, whose current
SalePrice is $100,000, would have a fireplace? What about a house built in
2000? Give your answers as probabilities.