CS 471 Intro Artificial Intelligence
2020 Spring
Practice Final
Note: These questions are meant to help guide your studying for the exam only. It
is not guaranteed that this practice midterm covers all of this material (though we
have tried to cover as much as possible) and it is not guaranteed that every topic
on this practice will show up on the final exam. Please note that while the final
exam will have an emphasis on material covered since Midterm 2, material from
earlier parts of the course may be tested as well.
We will go over this practice final and its solutions in class in the review session.
Goodluck!
Bayes Net
1. After consulting a behavioral psychologist, we obtained the following complete set of
conditional relationships:
• HW depends only on Party and Smart
• Mac depends only on Smart and Creative
• Project depends only on Smart and Creative
• Success depends only on HW and Project
• Happy depends only on Party, Mac, and Success
i. Draw the Bayesian network.
ii. Write joint distribution as a product of conditional probabilities.
iii. What is the number of independent parameters needed for each conditional probability table?
iv. What is the total number of independent parameters?
v. Using only the Bayesian network structure from part 4.1, answer the following True/False
questions and provide a brief explanation:
a. Party is independent of Success given HW.
b. Party is independent of Smart given Success
c. Party is independent of Creative given Happy.
vi. Given the simple Bayes net shown above, what is the probability that it is raining, given that
the grass is wet (P(R = T | G = T))? Show your work.
.
2.
(a) Write down the joint probability distribution associated with the following Bayes Net.
Express the answer as a product of terms representing individual conditional probabilities tables
associated with this Bayes Net:
(b) Draw the Bayes net associated with the following joint distribution: P(A) · P(B) · P(C|A,
B) · P(D|C) · P(E|B, C)
(c) Do the following products of factors correspond to a valid joint distribution over the
variables A, B, C, D? (Circle TRUE or FALSE and give a sentence of explanation)
(i) TRUE FALSE P(A) · P(B) · P(C|A) · P(C|B) · P(D|C)
(ii) TRUE FALSE P(A) · P(B|A) · P(C) · P(D|B, C)
(iii) TRUE FALSE P(A) · P(B|A) · P(C) · P(C|A) · P(D)
(iv) TRUE FALSE P(A|B) · P(B|C) · P(C|D) · P(D|A)
(d) What factor can be multiplied with the following factors to form a valid joint distribution?
(Write “none” if the given set of factors can’t be turned into a joint by the inclusion of exactly
one more factor.)
(i) P(A) · P(B|A) · P(C|A) · P(E|B, C, D)
(ii) P(D) · P(B) · P(C|D, B) · P(E|C, D, A)
(e) Answer the next questions based of the Bayes Net below:
All variables have domains of {-1, 0, 1}
(i) Before eliminating any variables or including any evidence, how many entries does the
factor at G have?
(ii) Now we observe e = 1 and want to query P(D|e = 1), and you get to pick the first variable
to be eliminated
• Which choice would create the largest number of factor f?
• Which choice would create the smallest number of factor f1?
Regression
3.We would like to classify some data. We have N samples, where each sample consists of a
feature vector x = {x1, · · · ,xk} and a label y = {0, 1}. We introduce a new type of classifier
called logistic regression, which produces predictions as follows:
where s(γ) is the logistic function, exp(x) = ex , and w = {w1, · · · , wk} are the learned weights.
Let’s find the weights wj for logistic regression using stochastic gradient descent. We would like
to minimize the following loss function for each sample:
(a) Find dL/dwi . Hint: s ' (γ) = s(γ)(1 − s(γ)).
(b) Write the stochastic gradient descent update for wi . Our step size is η.
4.In many real-world scenarios our data has millions of dimensions, but a given example has
only hundreds of non-zero features. For example, in document analysis with word counts for
features, our dictionary may have millions of words, but a given document has only hundreds of
unique words. In this question we will make l2 regularized SGD efficient when our input data is
sparse. Recall that in l2 regularized logistic regression, we want to maximize the following
objective (in this problem we have excluded w0 for simplicity):
where l(x (j) , y(j) , w) is the logistic objective function
and the remaining sum is our regularization penalty.
When we do stochastic gradient descent on point (x (j) , y(j) ), we are approximating the objective
function as
Definition of sparsity: Assume that our input data has d features, i.e. x (j) ∈ Rd . In this problem,
we will consider the scenario where x (j) is sparse. Formally, let s be average number of nonzero
elements in each example. We say the data is sparse when s << d. In the following questions,
your answer should take the sparsity of x(j) into consideration when possible. Note: When we use
a sparse data structure, we can iterate over the non-zero elements in O(s) time, whereas a dense
data structure requires O(d) time.
a) Let us first consider the case when λ = 0. Write down the SGD update rule for wi when λ = 0,
using step size η, given the example (x (j) , y(j) ).
b) If we use a dense data structure, what is the average time complexity to update wi when λ = 0?
What if we use a sparse data structure? Justify your answer in one or two sentences.
c) Now let us consider the general case when λ > 0. Write down the SGD update rule for wi when
λ > 0, using step size η, given the example (x (j) , y(j) )
d) If we use a dense data structure, what is the average time complexity to update wi when λ >
0?
e) Let w (t) i be the weight vector after t-th update. Now imagine that we perform k SGD updates
on w using examples (x (t+1), y(t+1)), · · · ,(x (t+k) , y(t+k) ), where x (j) i = 0 for every example in the
sequence. (i.e. the i-th feature is zero for all of the examples in the sequence). Express the new
weight, w (t+k) i in terms of w (t) i , k, η, and λ
f) Using your answer in the previous part, come up with an efficient algorithm for regularized
SGD when we use a sparse data structure. What is the average time complexity per example?
(Hint: when do you need to update wi?)
5.Below is a data set to be linearly separated into two sets (+) and (o) by the blue line. The blue
line is initialized as x2 = -2*x1 by the weights w = [ 0 1 0.5 ]’. Use data points from the set
below to test their classification by the blue line. Once you find a point that is incorrectly
classified, use it to update the weights until correct weights are found. Use eta = 0.2.
6.We are dealing with samples x where x is a single value. We would like to test two alternative
regression models:
1. y = ax + e
2. y = ax + bx2 + e
We make the same assumptions we had in class about the distribution of e (e~N(0,s2))
a) Assume we have n samples: x1 … xn . with their corresponding y values: y1 … yn. Derive
the value assigned to b in model 2. You can use a in the equation for b.
b) Which of the two models is more likely to fit the training data better, explain your answer?
i) model 1.
ii) model 2.
iii) both will fit equally well
iv) impossible to tell
c) Which of the two models is more likely to fit the test data better, explain your answer?
i) model 1.
ii) model 2.
iii) both will fit equally well
iv) impossible to tell
Perceptron
7. You are learning a multi-class perceptron between 4 classes. The weight vectors are currently
[1, 0]T , [0, 1] T , [−1, 0] T , [0, −1] T for the classes A, B, C, and D. The next training example x
has a label of A and feature vector f(x). For the following questions, do not make any
assumptions about tie-breaking. (Do not write down a solution that creates a tie.). Show your
work.
(i) Write down a feature vector in which no weight vectors will be updated.
() = [ ] or Not possible
(ii) Write down a feature vector in which only wA will be updated by the perceptron.
() = [ ] or Not possible
(iii) Write down a feature vector in which only wA and wB will be updated by the perceptron
() = [ ] or Not possible
(iv) Write down a feature vector in which only wA and wC will be updated by the perceptron.
() = [ ] or Not possible
The weight vectors are the same as before, but now there is a bias feature with value of 1 for all x
and the weight of this bias feature is 0, −2, 1, −1 for classes A, B, C, and D respectively.
As before, the next training example x has a label of A and a feature vector f(x). The always “1”
bias feature is the first entry in f(x).
(v) Write down a feature vector in which only wB and wC will be updated by the perceptron
() = [
1
] or Not possible
(vi) Write down a feature vector in which only wA and wC will be updated by the perceptron.
() = [
1
] or Not possible
8. We would like to use a perceptron to train a classifier for drone position in 3D world
coordinate system and labels 1(active tracking of adversaries) or 0 (passive
surveillance). Let’s use a learning rate of α = .25. Consider the following labeled training
data:
Features – Drone Position
(x, y, z)
Label
y*
(-1,2,3) 1
(3,-1,3) 0
(1,2,3) 0
(3,1,3) 1
(a) Our two perceptron weights have been initialized to w1 = 2 , w2 = −2, w3 = 0.5. After
processing the first point with the perceptron learning rule , what will be the updated values for
these weights?
(b) After how many steps will the perceptron algorithm converge? Write “never” if it will never
converge. Note: one steps means processing one point. Points are processed in order and then
repeated, until convergence.
(c) Instead of the perceptron, we decide to treat the perceptron as a single node neural network
and update the weights using gradient descent on the loss function.
The loss function for one data point is Loss(y, y* ) = (y − y* ) 2 , where y* is the training label for
a given point and y is the output of our single node network for that point.
(i) Given a general activation function g(z) and its derivative g’(z), what is the derivative of the
loss function with respect to w2 in terms of g, g’ , y* , x1, x2, x3, w1, w2 and w3?
(ii) For this question, ReLU activation function is used:
g(z) = 1 if z ≥ 0 and = 0 if z < 0
Given the following gradient descent equation to update the weights given a single data point.
With initial weights of w1 = 2 , w2 = −2, w3 = 0.5, what are the updated weights after processing
the first? Gradient descent update equation: wi = wi − α ∂Loss/ ∂wi . Show all of your work.
(ii) For this question, a TanH activation function is used:
g(z) = exp(z) – exp(-z) / exp(z) + exp(-z)
Given the following gradient descent equation to update the weights given a single data point.
With initial weights of w1 = 2 , w2 = −2, w3 = 0.5, what are the updated weights after processing
the second point? Gradient descent update equation: wi = wi − α ∂Loss/ ∂wi . Show all of your
work. (you can round of the value to 1 digit after the decimal)
Neural networks
9.The network below is a neural network with inputs x1 and x2, and outputs y1 and y2. The
internal nodes are computed below. All variables are scalar values.
Note that ReLU(x) = max(0, x).
The expressions for the internal nodes in the network are given here for convenience:
h1 = w11x1 + w12x2 h2 = w21x1 + w22x2 h3 = w31x1 + w32x2
r1 = ReLU(h1) r2 = ReLU(h2) r3 = ReLU(h3) s1 = x1 + r1 + r2 s2 = x2 + r2 + r3
y1 =
exp(s1)
exp(s1)+exp(s2)
y2 =
exp(s2)
exp(s1)+exp(s2)
(a) Forward Propagation Suppose for this part only, x1 = 3, x2 = 5, w11 = −10, w12 = 7, w21 = 2,
w22 = 5, w31 = 4, w32 = −4. What are the values of the following internal nodes? Please simplify
any fractions.
(i) h1 = (ii) s1 = (iii) y2 =
(b) Back Propagation Compute the following gradients analytically. The answer should be an
expression of any of the nodes in the network (x1, x2, h1, h2, h3, r1, r2, r3, s1, s2, y1, y2) or weights
w11, w12, w21, w22, w31. In the case where the gradient depend on the value of nodes in the
network, please list all possible analytical expressions, caused by active/inactive ReLU,
separated by comma.
Hint 1: If z is a function of y, and y is a function of x, the chain rule of taking derivative is:
Hint 2: Hint: Recall that for functions of the ,
(i)
(ii)
(iii)
(iv)
(v)
(vi)
(c) In roughly 15 words, what role could the non-negative values in node r2 play according to
its location in the network architecture?
10.
For each of the piecewise-linear functions below, mark all networks from the list above that can
represent the function exactly on the range x ∈ (−∞, ∞). In the networks above, relu denotes the
element-wise ReLU nonlinearity: relu(z) = max(0, z). The networks Gi use 1-dimensional layers,
while the networks Hi have some 2-dimensional intermediate layers. Briefly show your reason.