Final Exam 
Stat 131A 
Spring 2020 
NAME: 
Directions – PLEASE READ CAREFULLY 
• The exam is due on Gradescope Thursday night at 11:59pm. 
• You must show your work (where appropriate). 
• You may refer to your notes or online notes from this class. (It is technically allowed 
to read other materials online, but I very much doubt that will help you at all). 
• You may NOT discuss the exam questions with anyone else, whether they are in the 
class or out of the class. You may not post questions on message boards relating to 
problems on this exam. Any violations of this policy will result in a zero for the exam. 
• If you have any clarification questions, please send an email to Will and George, NOT 
Piazza. 
• DO NOT post on Piazza about exam questions. 
1 
2 
1. Answer the following TRUE/FALSE questions. Briefly explain your reasoning. 
(a) (5 points) Because a regression coefficient measures the effect of a predictor vari- 
able on the response variable when we hold everything else constant, we can 
interpret the coefficient as a causal effect of that predictor on the response. 
(b) (5 points) Suppose we have a data set with p variables, x(1) up to x(p), and we 
calculate the loadings a1, . . . , ap for the first principal component. Then, suppose 
we center and scale each of the variables to get new y variables with mean 0 and 
standard deviation 1. That is, the value of variable y(j) for the ith observation is 
y 
(j) 
i = (x 
(j) 
i −mean(x(j)))/sd(x(j)) 
True or false: If we calculate new first principal component loadings b1, . . . , bp on 
the new data set with the y variables, then the new loadings might be different 
from the old loadings. 
(c) (5 points) If somehow we knew the true values of β in a linear regression, then 
those values would give us an even smaller RSS than the estimated coefficients βˆ 
that we get from lm. 
(d) (5 points) Suppose we fit a regression of response y on variables x(1) and x(2), 
using the lm function. We get βˆ2 = 5 with a p-value in the regression table equal 
to 0.04. But after we add a third variable x(3) to the regression, the estimate 
changes to βˆ2 = −15, and the p-value in the new regression table is now 10−12. 
True or false: This means that the first estimate almost certainly had the wrong 
sign, and the p-value must have been small the first time just by chance. 
(e) (5 points) In part d, it is impossible for the first model (with predictor variables 
x(1) and x(2)) to have a strictly lower RSS than the the second model (with 
predictor variables x(1), x(2), and x(3)). 
(f) (5 points) In agglomerative hierarchical clustering, if we use the ”single linkage” 
method to measure distances between clusters, then we will sometimes see a ”rich 
get richer” effect in intermediate steps of the algorithm, where there are a few 
very large clusters and many much smaller clusters. 
3 
2. Bar plot for two categorical variables 
The bar plot below shows the joint distribution of two categorical variables measured 
in the Wellbeing survey that we discussed in class. The participants in the survey 
were asked to report their General Satisfaction and their Job Satisfaction, and their 
responses were recorded. 
There are four levels of Job Satisfaction (Very satisfied, Moderately satisfied, A little 
dissatisfied, and Very dissatisfied), and three levels of General Satisfaction (Very happy, 
Pretty happy, and Not too happy). 
Very satisfied (Job) Mod. satisfied (Job) A little dissat (Job) Very dissatisfied (Job) 
Very happy (General) 
Pretty happy (General) 
Not too happy (General) 
Fr 
eq 
ue 
nc 
y 
0 
50 
0 
10 
00 
15 
00 
20 
00 
25 
00 
Note: to make the question simpler, I have removed categories like ”Don’t know” 
and ”Not applicable” from the data set; you should answer the question as if these 
categories never existed and the data shown here represent the entire data set. 
Answer the following questions and explain how you can tell. You can assume the 
data set is large enough so that the observed probabilities in the figure are very close 
to the probabilities in the overall population. 
(a) (5 points) Overall, what is the most common response for General Satisfaction? 
(b) (5 points) Which is higher: (A) the conditional probability of being Pretty happy 
in general given that someone is Very satisfied with their job, or (B) the condi- 
tional probability of being Pretty happy in general given that someone is Moder- 
ately satisfied with their job? 
(c) (5 points) Given that someone is Not too happy in general, which response were 
they most likely to give about their Job Satisfaction? 
4 
3. Hierarchical clustering 
The plot below shows a data set with p = 2 variables and n = 7 data points. The 
seven data points are labeled A,B,C,D,E,F,G. 
x(1) 
x( 
2) 
A 
B 
C 
D 
E 
F 
G 
The Euclidean distances between pairs of points are shown in the matrix below. For 
example, the value corresponding to row ”C” and column ”D” gives the Euclidean 
distance between the points C and D. 
## A B C D E F G 
## A 0.0 0.2 0.9 2.3 2.1 2.6 4.1 
## B 0.2 0.0 0.7 2.1 2.1 2.5 3.9 
## C 0.9 0.7 0.0 1.4 1.7 2.2 3.2 
## D 2.3 2.1 1.4 0.0 2.0 2.3 1.8 
## E 2.1 2.1 1.7 2.0 0.0 0.4 3.2 
## F 2.6 2.5 2.2 2.3 0.4 0.0 3.3 
5 
## G 4.1 3.9 3.2 1.8 3.2 3.3 0.0 
For the following questions, assume (except where expressly stated otherwise) that we 
are using agglomerative hierarchical clustering with Euclidean distances and complete 
linkage. 
(a) (10 points) What two ”clusters” will the algorithm join in each of the first four 
steps, and what will the algorithm use as the distance between each pair of clus- 
ters? (You only need to give the distance between the two ”clusters” that were 
joined in each step). 
(b) (10 points) Would any part of your answer to part a be different if we were using 
single linkage clustering instead? 
(c) (5 points) The dendrogram for the complete linkage clustering is shown below, 
with the seven ”leaves” (dangling ends) of the tree unlabeled: 
0 
1 
2 
3 
4 
Cluster Dendrogram 
hclust (*, "complete") 
H 
ei 
gh 
t 
6 
Write the letters A to G below the leaves to give a valid labeling that corresponds 
to the complete linkage clustering (note there is more than one possible labeling, 
you only need to do one of them). 
7 
4. Consider the Ames Housing Dataset that was introduced in class. This dataset contains 
information on sales of houses in Ames, Iowa. The original dataset has been made 
smaller to make the analysis easier. The variables in the dataset are: 
• Lot.Area: Lot Size (Land Area) in square feet. 
• Total.Bsmt.SF: Total square feet of basement area. 
• Gr.Liv.Area: Total living area in square feet. 
• Garage.Cars: Size of garage in terms of car capacity. 
• Fireplaces.YN: a yes (Y) or no (N) variable that indicates whether the house 
has a fireplace or not. 
• Year.Built: the year in which the house was built. 
• SalePrice: the sale price of the house in dollars. 
There are n = 1314 observations in the dataset and some observations are listed below: 
head(ames) 
## Lot.Area Total.Bsmt.SF Gr.Liv.Area Garage.Cars Fireplaces.YN 
## 1 11622 882 896 1 N 
## 2 14267 1329 1329 1 N 
## 3 4920 1338 1338 2 N 
## 4 5005 1280 1280 2 N 
## 5 7980 1168 1187 2 N 
## 6 8402 789 1465 2 Y 
## Year.Built SalePrice 
## 1 1961 105000 
## 2 1958 172000 
## 3 2001 213500 
## 4 1992 191500 
## 5 1992 185000 
## 6 1998 180400 
(a) (5 points) I fit a regression equation with SalePrice as the response variable in 
terms of all the other variables (note that Bldg.Type and Fireplaces.YN are 
treated as factors). This gave me the following: 
m1 <- lm(SalePrice ~ ., data = ames) 
m1 
## 
## Call: 
## lm(formula = SalePrice ~ ., data = ames) 
## 
8 
## Coefficients: 
## (Intercept) Lot.Area Total.Bsmt.SF Gr.Liv.Area 
## -9.795e+05 8.489e-01 3.230e+01 5.130e+01 
## Garage.Cars Fireplaces.YNY Year.Built 
## 9.694e+03 8.768e+03 5.125e+02 
What does this regression equation say about the change in our prediction for 
SalePrice for a 100 square ft. increase in Gr.Liv.Area provided the other vari- 
ables remain unchanged? 
(b) (5 points) According to the regression equation m1, find the predicted SalePrice 
for a house with a lot area of 10000 square feet, total basement square footage 
of 1000 square feet, 1000 square feet of total living area, having a fireplace and a 
garage that can hold two cars and which was built in the year 2000. 
(c) (5 points) My friend who has a lot of experience with the Ames real estate market 
suggests to me that I should add an interaction between Fireplaces.YN and 
Year.Built, because she thinks that fireplaces used to be more common and now 
they are a luxury. I fit a new regression equation with an interaction term added: 
m2 <- lm(SalePrice ~ . + Year.Built:Fireplaces.YN, data = ames) 
m2 
## 
## Call: 
## lm(formula = SalePrice ~ . + Year.Built:Fireplaces.YN, data = ames) 
## 
## Coefficients: 
## (Intercept) Lot.Area 
## -9.009e+05 8.571e-01 
## Total.Bsmt.SF Gr.Liv.Area 
## 3.248e+01 5.078e+01 
## Garage.Cars Fireplaces.YNY 
## 9.579e+03 -2.351e+05 
## Year.Built Fireplaces.YNY:Year.Built 
## 4.727e+02 1.240e+02 
The estimate of the interaction term is 124. What is the interpretation of that 
number in the regression equation? 
(d) (5 points) For the house in part b, assume there is a newer house which is exactly 
the same except that it was built 5 years later, in 2005. Compared to the older 
house, how much more would we predict the newer house to sell for, if we use 
model m2 to make our predictions? 
(e) (5 points) Assume that we make the usual assumptions to justify parametric in- 
ference in regression. Give a parametric 95% confidence interval for the coefficient 
of the interaction, based on the summary table below (you may use 2 for the tα/2 
9 
value). Based on the result, can we feel confident that the interaction term is 
helping us make better predictions? 
summary(m2) 
## 
## Call: 
## lm(formula = SalePrice ~ . + Year.Built:Fireplaces.YN, data = ames) 
## 
## Residuals: 
## Min 1Q Median 3Q Max 
## -87632 -11006 16 10112 94784 
## 
## Coefficients: 
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -9.009e+05 4.994e+04 -18.039 < 2e-16 *** 
## Lot.Area 8.571e-01 1.428e-01 6.004 2.49e-09 *** 
## Total.Bsmt.SF 3.248e+01 2.154e+00 15.081 < 2e-16 *** 
## Gr.Liv.Area 5.078e+01 2.643e+00 19.216 < 2e-16 *** 
## Garage.Cars 9.579e+03 9.098e+02 10.529 < 2e-16 *** 
## Fireplaces.YNY -2.351e+05 7.959e+04 -2.954 0.00320 ** 
## Year.Built 4.727e+02 2.575e+01 18.359 < 2e-16 *** 
## Fireplaces.YNY:Year.Built 1.240e+02 4.048e+01 3.064 0.00223 ** 
## --- 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 18300 on 1306 degrees of freedom 
## Multiple R-squared: 0.7459,Adjusted R-squared: 0.7445 
## F-statistic: 547.6 on 7 and 1306 DF, p-value: < 2.2e-16 
(f) (5 points) My friend gets very excited and wants to put in more interaction terms 
with Fireplaces.YN. After adding interactions with two more variables, we get 
the model: 
m3 <- lm(SalePrice ~ . + Year.Built:Fireplaces.YN + Fireplaces.YN:Lot.Area 
+ Fireplaces.YN:Gr.Liv.Area, 
data = ames) 
She points out that the R2 value has improved, so the new model is a better fit: 
summary(m3)$r.squared 
## [1] 0.7459505 
summary(m2)$r.squared 
## [1] 0.7458855 
10 
She says this means the new interaction terms must be making the model better 
so we should keep them in. Do you agree? 
(g) (5 points) My friend wants to find more support for her theory that we need 
some more interactions with Fireplaces.YN, so she looks for the best model 
with Year.Built:Fireplaces.YN plus two additional interactions. That is, she 
always keeps the interaction with Year.Built in the model, and tries out every 
single way to add two more interactions with Fireplaces.YN, trying to make the 
R2 as big as possible. 
She finds that the best two variables to interact with Fireplaces.YN are Lot.Area 
and Total.Bsmt.SF, and she wants to show that this model makes better out- 
of-sample predictions. She uses cross-validation to compare the model she chose 
(with three interactions) to the model m2 (with only one interaction), and she 
finds that her model has slightly better prediction error, as measured by cross- 
validation. Does this show her model is better? 
(h) (5 points, extra credit) I now fit a regression equation to the same dataset with the 
logarithm of SalePrice as the response variable (I left the explanatory variables 
unchanged). This gave me the following: 
m4 <- lm(log(SalePrice) ~ ., data = ames) 
m4 
## 
## Call: 
## lm(formula = log(SalePrice) ~ ., data = ames) 
## 
## Coefficients: 
## (Intercept) Lot.Area Total.Bsmt.SF Gr.Liv.Area 
## 3.641e+00 7.349e-06 2.146e-04 3.842e-04 
## Garage.Cars Fireplaces.YNY Year.Built 
## 7.578e-02 5.699e-02 3.739e-03 
Assume that our predicted SalePrice for a certain house with no fireplace, using 
model m4, is $150,000. What would the predicted SalePrice become if we add a 
fireplace to the house and leave all of the other variables unchanged? 
(i) (5 points) We decide to investigate a bit more how fireplaces relate to some of the 
other variables in the data set, so we create a new binary outcome variable called 
Fireplaces.10, which is 1 if Fireplaces.YN is Y and 0 otherwise. We estimate 
a logistic regression using the glm function in R, and we get the following model: 
ames$Fireplaces.10 <- ifelse(ames$Fireplaces.YN == "Y", 1, 0) 
m5 <- glm(Fireplaces.10 ~ Year.Built + SalePrice, data = ames, 
family = binomial) 
m5 
11 
## 
## Call: glm(formula = Fireplaces.10 ~ Year.Built + SalePrice, family = binomial, 
## data = ames) 
## 
## Coefficients: 
## (Intercept) Year.Built SalePrice 
## 2.942e+01 -1.784e-02 3.535e-05 
## 
## Degrees of Freedom: 1313 Total (i.e. Null); 1311 Residual 
## Null Deviance: 1729 
## Residual Deviance: 1493 AIC: 1499 
According to model m5, how likely is it that a house built in 1950, whose current 
SalePrice is $100,000, would have a fireplace? What about a house built in 
2000? Give your answers as probabilities.