CIS 520: Machine Learning Due: Apr 14, 2020 (10:29 AM) 
Problem Set 4 
Submission Instructions: 
Unless otherwise announced, all problem sets require submission of two components on Gradescope and 
Canvas: 
• PDF Submission: This is your main submission, and must contain your solutions to all problems 
that you are attempting, including both mathematical problems and programming prob- 
lems. It must be submitted as a single PDF file to Gradescope and Canvas, compiled in LATEX 
using the LATEX template provided on Canvas. Each problem should be on a new page. Mathematical 
problems can be either typed in LATEX or written by hand and scanned into images included in your 
LATEX solution, but it must be readable by our staff to be graded; we recommend that you not take 
photos of your hand-written solutions. For programming problems, in addition to including any plots, 
observations, or explanations as requested, be sure to include a snippet of your code in your LATEX 
solution; for this, we encourage you to use mcode or listings packages for LATEX. 
Special instruction for submitting on Gradescope: For each problem, select the pages containing your 
solution for that problem. 
• Code Submission: Submit all your (Matlab) code in a single zip file under the problem set’s code 
submission component on Canvas. Submit all your (Matlab) code files (unzipped .m files) to the 
problem set’s code submission component on Gradescope. Note that these are in addition to writing 
your code inline in the PDF file above. 
Summary: The PDF file should be submitted to both Gradescope and Canvas. Matlab code files should 
be submitted to Canvas in a single zip file. Also, Matlab code files should be submitted to Gradescope 
unzipped. All components of problem set must be received in the right places and in the right formats 
before the submission deadline. Plan to submit early! 
Collaboration Policy: 
You are allowed to discuss problems in groups, but you must write all your solutions and code individually, 
on your own. All collaborators with whom problems were discussed must be listed in your PDF submission. 
1 
2 CIS 520: Machine Learning, Spring 2020, Problem Set 4 
1. (20 points) Principal Component Analysis. In this exercise, you are provided part of the MNIST 
digit data set, containing 3,600 images of handwritten digits (data is provided in X_train.mat). Each 
example is a 28 × 28 grayscale image, leading to 784 pixels. These images correspond to 3 different 
digits (’3’, ’6’, and ’9’), although you do not have the labels for them. You will perform PCA in an 
unsupervised manner. 
(a) To get familiar with what the data looks like, generate the last example (the 3, 600th row of 
X_train) as an image. Provide your code and the resulting image. (Hint: You can reshape 
the data back into a 28 × 28 image using reshape in MATLAB, and you can use imagesc and 
colormap gray to see the picture. Make sure you are able to get the picture to face the right 
way, which should look like a ”9”.) 
(b) Run PCA on the data. You can use the built-in pca function in MATLAB. (Reminder: Make sure 
you understand the MATLAB function you used, especially whether the function standardizes the 
data set before running PCA. Make sure that the data is mean-centered.) Just like in the previous 
part, generate the first principal component vector as an image. Repeat for the second and third. 
Do these images make sense? Explain. 
(c) Create a 2D scatter plot showing all the data points projected in the first 2 PC dimensions, 
clearly labeling your axes. Make sure the data points are mean-centered before projecting onto 
the principal components. Now create a similar plot showing the same data points projected in 
the 100th and 101st PC dimensions. What do you observe? Can you explain why you might 
expect this? 
(d) Graph the (in-sample) fractional reconstruction accuracy as a function of the number of principal 
components that are included. Also give a table listing the number of principal components 
needed to achieve each of 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, and 10% reconstruction 
accuracy (i.e., to explain X% of the variance). 
(e) Using the numbers you found in the previous part, reconstruct the 1000th, 2000th, and 3000th 
examples using each of the different numbers of principal components. (Hint: We have provided a 
function plot_reconstruction.m for your convenience. Make sure you read the documentation 
within this function to understand it’s input arguments.) For instance, the last example looks 
like: 
CIS 520: Machine Learning, Spring 2020, Problem Set 4 3 
2. (20 points) EM Practice: Red and Blue Coins. Your friend has two coins: a red coin and a blue 
coin, with biases pr and pb, respectively (i.e. the red coin comes up heads with probability pr, and the 
blue coin does so with probability pb). She also has an inherent preference pi for the red coin. She 
conducts a sequence of m coin tosses: for each toss, she first picks either the red coin with probability 
pi or the blue coin with probability 1− pi, and then tosses the corresponding coin; the process for each 
toss is carried out independently of all other tosses. You don’t know which coin was used on each toss; 
all you are told are the outcomes of the m tosses (heads or tails). In particular, for each toss i, define 
a random variable Xi as 
Xi = 
{ 
1 if the i-th toss results in heads 
0 otherwise. 
Then the data you see are the values x1, . . . , xm taken by these m random variables. Based on this 
data, you want to estimate the parameters θ = (pi, pr, pb). To help with this, for each toss i, define a 
latent (unobserved) random variable Zi as follows: 
Zi = 
{ 
1 if the i-th toss used the red coin 
0 otherwise. 
(a) Let X be a random variable denoting the outcome of a coin toss according to the process described 
above, and let Z be the corresponding latent random variable indicating which coin was used, 
also as described above (both X and Z take values in {0, 1} as above). Write an expression for 
the joint distribution of X and Z. Give your answer in the form 
p(x, z; θ) = . 
(b) Write an expression for the complete-data log-likelihood, lnLc(θ) = 
∑m 
i=1 ln p(xi, zi; θ). 
(c) Suppose you knew the values zi taken by the latent variables Zi. What would be the maximum- 
likelihood parameter estimates θ̂? Give expressions for pi, p̂r, and p̂b (in terms of xi and zi). Show 
your calculations. 
(d) In the absence of knowledge of zi, one possibility for estimating θ is to use the EM algorithm. 
Recall that the algorithm starts with some initial parameter estimates θ0, and then on each 
iteration t, performs an E-step followed by an M-step. Let θt denote the parameter estimates 
at the start of iteration t. In the E-step, for each toss i, the algorithm requires computing the 
posterior distribution of the latent variable Zi under the current parameters θ 
t. Calculate the 
posterior probability P(Zi = 1 |Xi = xi; θt). 
(e) For each toss i, denote the posterior probability computed in part (d) above by γti (so that 
γti = P(Zi = 1 |Xi = xi; θt)). Then the expected complete-data log-likelihood with respect to 
these posterior distributions is 
m∑ 
i=1 
( 
γti · ln p(xi, 1; θt) + (1− γti ) · ln p(xi, 0; θt) 
) 
. 
The M-step of the EM algorithm requires finding parameters θt+1 that maximize this expected 
complete-data log-likelihood. Determine the updated parameters θt+1. Give expressions for pit+1, 
pt+1r , and p 
t+1 
b (in terms of xi and γ 
t 
i ). Show your calculations. 
3. (25 points) Programming Exercise: Gaussian Mixture Models. Write a piece of MATLAB code 
to implement the EM algorithm for learning a Gaussian mixture model (GMM) given a data set X, 
where X is an m× d matrix (m instances, each of dimension d), and a target number of Gaussians K: 
your program should take the training data X and the target number of Gaussians K as input and 
should output estimated parameters of a mixture of K Gaussians (specifically, it should output mixing 
4 CIS 520: Machine Learning, Spring 2020, Problem Set 4 
coefficients p̂i ∈ ∆K , mean vectors µ̂1, . . . , µ̂K ∈ Rd, and covariance matrices Σ̂1, . . . , Σ̂K ∈ Rd×d). 
For this problem, you are provided a 2-dimensional data set generated from a mixture of (3) Gaussians. 
The data set is divided into training and test sets; the training set has 480 data points and the test 
set has 120 data points. The goal is to learn a GMM from the training data; the quality of the learned 
GMM will be measured via its log-likelihood on the test data. 
EM initialization and stopping criterion: Initialize the mixing coefficients to a uniform distribu- 
tion, and each of the covariance matrices to be the d× d identity matrix: pi0 = (1/K, . . . , 1/K)>, and 
Σ01 = . . . = Σ 
0 
K = Id. Initializations for the means will be provided to you (if you like, you can write 
your program to take the mean initialization as an additional input). For the stopping criterion, on each 
iteration t, keep track of the (incomplete-data) log-likelihood on the training data, 
∑m 
i=1 ln p(xi; θ 
t); 
stop when either the change in log-likelihood, 
∑m 
i=1 ln p(xi; θ 
t+1)−∑mi=1 ln p(xi; θt), becomes smaller 
than 10−6, or when the number of iterations reaches 1000 (whichever occurs earlier; here θt denotes 
the parameter estimates on iteration t). 
(a) Known K. For this part, assume you are given K = 3. 
i. Learning curve. Use your implementation of EM for GMMs to learn a mixture of 3 Gaus- 
sians from increasing fractions of the training data (10% of the training data, then 20% of 
the training data, then 30% and so on upto 100%). You must use the subsets provided 
in the folder TrainSubsets; in each case, to initialize the means of the Gaussians, use the 
initializations provided in the folder MeanInitialization. In each case, calculate the nor- 
malized (incomplete-data) log-likelihood of the learned model on the training data, as well 
as the normalized log-likelihood on the test data (here normalizing means dividing the log- 
likelihood by the number of data points); you can use the provided file compute llh.m to 
compute the log-likelihood. In a single plot, give curves showing the normalized train and 
test log-likelihoods (on the y-axis) as a function of the fraction of data used for training (on 
the x-axis). (Note: the training log-likelihood should be calculated only on the subset of 
examples used for training, not on all the training examples available in the given data set.) 
ii. Analysis of learned models. For each of the learned models (corresponding to the different 
subsets of the training data), show a plot of the Gaussians in the learned GMM overlaid on 
the test data. What do you observe? For the final model (learned from 100% of the training 
data), also write down all the parameters of the learned GMM, together with the (normalized) 
train and test log-likelihoods. 
Hints: 
i. You may use the function we provide in plot contours.m to assist in plotting the density 
curves. Be sure to carefully read the function header, so you know how to structure your 
inputs. If you choose to do this, you may find it convenient to format the parameters output 
by your GMM function in the data structures expected by plot contours.m 
(b) Unknown K. Now suppose you do not know the right number of Gaussians in the mixture 
model to be learned. Select the number of Gaussians K from the range {1, . . . , 5} using 5-fold 
cross-validation on the full training data (use the folds provided in the folder CrossValidation). 
For each K, to initialize the means of the K Gaussians, use the initializations provided in the 
folder MeanInitialization. In a single plot, draw 3 curves showing the (normalized) train, test, 
and cross-validation log-likelihoods (on the y-axis) as a function of number of Gaussians K (on 
the x-axis); again, you can use the provided file compute llh.m to compute the normalized log- 
likelihood. Write down the chosen value of K (with the highest cross-validation log-likelihood). 
4. (35 points) Hidden Markov Models. On any given day, Alice is in one of two states: happy or sad. 
You do not know her internal state, but get to observe her activities in the evening. Each evening, she 
either sings, goes for a walk, or watches TV. 
CIS 520: Machine Learning, Spring 2020, Problem Set 4 5 
Alice’s state on any day is random. Her state Z1 on day 1 is equally likely to be happy or sad: 
P (Z1 = happy) = 1/2 . 
Given her state Zt on day t, her state Zt+1 on the next day is governed by the following probabilities 
(and is conditionally independent of her previous states and activities): 
P (Zt+1 = happy |Zt = happy) = 4/5 P (Zt+1 = happy |Zt = sad) = 1/2 . 
Alice’s activities are also random. Her activities vary based on her state; given her state Zt on day t, 
her activity Xt on that day is governed by the following probabilities (and is conditionally independent 
of everything else): 
P (Xt = sing |Zt = happy) = 5/10 P (Xt = sing |Zt = sad) = 1/10 
P (Xt = walk |Zt = happy) = 3/10 P (Xt = walk |Zt = sad) = 2/10 
P (Xt = TV |Zt = happy) = 2/10 P (Xt = TV |Zt = sad) = 7/10 . 
(a) (15 points) Suppose you observe Alice singing on day 1 and watching TV on day 2, i.e. you 
observe x1:2 = (sing,TV). Find the joint probability of this observation sequence together with 
each possible hidden state sequence that could be associated with it, i.e. find the four probabilities 
below. Show your calculations. 
P 
( 
X1:2 = (sing,TV), Z1:2 = (happy, happy) 
) 
P 
( 
X1:2 = (sing,TV), Z1:2 = (happy, sad) 
) 
P 
( 
X1:2 = (sing,TV), Z1:2 = (sad, happy) 
) 
P 
( 
X1:2 = (sing,TV), Z1:2 = (sad, sad) 
) 
Based on these probabilities, what is the most likely hidden state sequence z1:2? What is the 
individually most likely hidden state on day 2 ? 
(b) (20 points) Write a small piece of Matlab code to implement the Viterbi algorithm, and use this 
to calculate both the most likely hidden state sequence z1:10 and the corresponding maximal joint 
probability p(x1:10, z1:10) for each of the following observation sequences: 
x1:10 = (sing,walk,TV, sing,walk,TV, sing,walk,TV, sing) 
x1:10 = (sing, sing, sing,TV,TV,TV,TV,TV,TV,TV) 
x1:10 = (TV,walk,TV, sing, sing, sing,walk,TV, sing, sing) 
In your code, rather than multiplying probabilities (which can lead to underflow), you may find 
it helpful to add their logarithms (and later exponentiate to obtain the final result). Include a 
snippet of your code in your LATEX submission.