首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
MAT 3373编程语言辅导、讲解R编程设计、R程序调试 辅导Python程序|调试Matlab程序
项目预算:
开发周期:
发布时间:
要求地区:
MAT 3373: HOMEWORK 1
Homework Policy Basics
(1) You can find homework deadlines in the course schedule, available on
Brightspace.
(2) You can find detailed homework grading policies in the first week’s
lecture notes, available on Brightspace.
(3) You can find overall grading policies in the syllabus, available on Brightspace.
(4) Homework should be written in R Markdown format.1 You should
always submit both the raw workbook (in .Rmd format) and a compiled
version where I can see the charts (in .html or .pdf format). Marks
will be taken off if the homework is not compiled correctly. I
suggest using the set.seed() command near the top of each homework
set, and only re-compiling chunks as you need to. This will reduce
the chance of bugs popping up in previously-completed parts of the
homework.
(5) Homework solutions should appear in the same order as the questions
in the document. I will dock 3 percent on assignments with out-of-order
solutions, and I will generally not post grades of individual questions.
(6) It is possible to submit several versions of the homework. I will always
grade the last one submitted before the deadline. With this in mind,
I strongly encourage you to submit drafts of your homework
well before the deadline. In particular, you might as well submit a
draft every time you finish a question.
1See first week’s lecture notes for comments on Python people.
1
Introduction and Homework Focus
This homework is concerned with material from Chapters 1-4 of the textbook,
with one introductory question from Section 5.2. It focuses on getting
practice with programming, and on the methods of regression, logistic regression,
and k-nearest neighbours. It uses only the basic statistical tests seen
in Chapters 1-4. In particular, the homework doesn’t use the main tools for
statistical analysis of a collection of very different models, which are introduced
in Chapter 5. As such, many of the questions are either straightforward
applications of the methods (1,2,7,10) or ask you to identify and resolve an
enormous problem that appears when you try a straightforward application
(3,4,5,8,9).
Finding and fixing these sorts of enormous problems is of course important
- all of the mistakes highlighted in this homework are fairly easy to make, even
for professionals. Still, this might give the impression that machine learning is
mostly about identifying these sorts of gross violations of modelling assumptions.
In the following homework set, we’ll have access to more of the machine
learner’s standard toolkit, and we’ll do more examples where a “normal” work-
flow gives sensible results that can be improved with small tweaks and good
statistical practice.
2
1. Introduction to Data Exploration
Do Question 8 from Chapter 2 of the textbook.
2. Empirical Study: KNN for MNIST Data
Open the (small subsample of the) MNIST testing and training datasets:
MNIST_train = read.csv("mnist_train.csv")
MNIST_test = read.csv("mnist_test.csv")
Fit the knn classifier using the training data for k ∈ {3, 4, 5, 6}. Select the
value of k that minimizes the test error, and calculate the confusion matrix.
Comment on the results. If you saw a collection of new datapoints from the
same collection, would you expect the error to be larger, smaller, or about the
same as the observed test error?
3. Short Conceptual Questions
(1) Do Question 3 from Chapter 3 of the textbook.
(2) Imagine that you’re going to collect data, and have committed to doing
a one-dimensional linear regression analysis with intercept parameter
β0 known to be 0 and variance parameter σ known to be 1 (so you are
just trying to learn the slope term β1). Furthermore, you are absolutely
certain that the linear regression model is true.2
Before collecting the response variable, you need to choose the predictors.
You have the following options:
X
(1) = (−12, −9, −6, −3, 0, 3, 6, 9, 12)
X
(2) = (−1, −0.9, −0.8, . . . , 0.8, 0.9, 1)
X
(3) = (−28, −2, 76, 412).
Which of those should you choose? Why?
(3) People often use statistical models to do optimization, as follows. You
have some function f, and would like to find the largest value maxx f(x).
You try the following procedure:
(a) Collect data (X1, Y1), . . . ,(Xn, Yn).
(b) Based on this data, estimate a model ˆf.
(c) Use the maximum of the predicted values argmaxx
ˆf(x) as a prediction
for the true location of the maximum value argmaxxf(x).
2Of course this is unrealistic - but please take it seriously for this question.
3
This is a reasonable thing to try: you get an estimate ˆf for an entire
function, and you can try to use that as a surrogate for the real function
when doing optimization (or any other task). For some models, this
approach works rather well. Explain why it will almost never work well
when the model is a linear regression model on all of R.
4. Lying with Linear Regression
It is often possible to get very misleading results by deliberately choosing a
bad model with a certain structure. In this question, we’ll practice doing this
in a simple setting.3
Throughout the question, we’ll assume that every X-value is paired with
the observed Y -value Y = sin(X); there is no measurement error. We will
then fit this data to the usual linear regression model with unknown β0, β1, σ.
(1) Find a collection of X1, . . . , Xn of predictors so that the 99-percent con-
fidence interval for the slope β1 is contained in the interval (−∞, −0.95].
Display the collection of points, fit the model in R, and give the output
of the summary function applied to the fit.
(2) Find a collection of X1, . . . , Xn of predictors so that the 99-percent
confidence interval for the slope β1 is contained in the interval [0.95,∞).
Display the collection of points, fit the model in R, and give the output
of the summary function applied to the fit.
(3) Would it be possible to do part (2) of this question if I replaced the
interval [0.95,∞) by the interval [100, ∞)? Why or why not?
Note: A complete proof is not required, but will be considered for
bonus marks.
5. Simulation Study: Post-Selection Inference
Generate predictors “Pred” and response variables “Resp” according to the
following R code:
m= 80 # You can experiment with your own large value.
n=100 # You can experiment with your own large value.
Pred = matrix(rnorm(m*n,0,1), nrow = n, ncol = m)
Resp = rnorm(n,0,1)
We interpret this as n datapoints, with the j’th datapoint looking like
(X
(j)
1
, . . . , X(j)
m , Y (j)
).
(1) Do the following two steps:
3
In case it wasn’t clear: you shouldn’t actually do this! However, seeing very misleading
results can help you learn to diagnose problems with statistical analyses.
4
(a) For each i ∈ {1, 2, . . . , m}, compute the correlation of each (X
(1)
i
, . . . , X(n
i
)
with (Y
(1), . . . , Y (n)
).
Denote by i1, i2 the indices with the largest correlation in absolute
value.
(b) Fit the linear model
Y = β0 + β1Xi1 + β2Xi2 +
based on the response variable Y and the two chosen predictors.
Comment on the fit and the confidence intervals for these quantities.
(2) Split the data into two equal-sized parts - a training dataset and a
testing data set.
Repeat the above question, but use only the training dataset in part
(a) and only the testing dataset in part (b). Comment on the difference
between the results you see.
6. Theory: Consistency of K-NN Classification
We will prove that the k-nearest neighbour classification algorithm eventually
gets the right answer, at least in a simple setting.
Define the function f : [0, 1] 7→ {0, 1} by the piecewise-constant formula
f(x) = 0, x < 0.5
f(x) = 1, x ≥ 0.5.
Consider data X1, X2, . . .
i.i.d. ∼ Unif[0, 1] and let Yi = f(Yi). For integer
k ∈ {3, 5, 7, . . .} and n ∈ {k + 1, k + 2, . . .}, let ˆfk,n be the k-nearest neighbour
classifier associated with the dataset {(X1, Y1), . . . ,(Xn, Yn)}.
Prove that, for any fixed x ∈ [0, 0.5)∪(0.5, 1] and any fixed k ∈ {3, 5, 7, . . .},
we have
limn→∞
P[
ˆfk,n(x) = f(x)] = 1.
Hint 1: You don’t have to directly estimate the probability of the event
An = {
ˆfk,n(x) = f(x)} in the question. It is enough to estimate the probability
of some event Bn contained in An.
Hint 2: If you’re stuck proving this for all x, start with a concrete number
such as x = 0.75. Can you write down a formula for the probability of the
event “at least k of the points X1, . . . , Xn are in the interval (0.5, x)”? When
that event happens, what can you say about ˆfn,k(x)?
5
Hint 3: You may find the following fact useful: for any constants α, β > 0,
limn→∞
n
α
e
−βn = 0.
You may remember this as a phrase like “exponentials beat polynomials” from
a plotting section in calculus class.
7. Empirical Study: Bayes Rate of Classifiers
Open the dataset weight-height.csv. This dataset was downloaded from
Kaggle.com. Split this dataset (randomly) into a training set and a testing
dataset, each with 2500 men and 2500 women.
(1) For both genders separately, calculate the sample average weight, sample
average height, and the sample variance for the weight and height
based on this training dataset. Note: you should have eight numbers
at this point - for each of two genders, and each of two measurements,
you should both the mean and variance.
(2) For the remainder of the question, assume that the male data is exactly
bivariate Gaussian with means and variances as calculated in the previous
part of the question (and 0 covariance). Similarly, assume that
the male data is exactly bivariate Gaussian with means and variances
as calculated in the previous part of the question (and 0 covariance). 4
Based on this modelling assumption and using only the observed
height and weight, predict the gender of the 5000 elements of the testing
set according to the Bayes classifier (see page 38 of the textbook
for a definition).
Comment on the quality of this classifier. Can you think of anywhere
that the model can be improved, or does it look about as good as you
could hope for (given the data)?
Hint: You may wish to make an illustrative plot along the following
lines: let the x-axis and y-axis be the observed weight and height, and
colour each point in one of 4 colours corresponding to true/predicted
gender. You need not make exactly this plot (and indeed you will
probably need to make small adjustments to end up with a useable
picture).
4Of course, this is not really the true data-generating process. I’m making this assumption
because it is only possible to calculate a Bayes rate if you know the true data-generating
process. As you can probably guess, this means that you can almost never calculate the
true Bayes rate. Nonetheless, it is useful to think about the Bayes rate to try to understand
fundamental limits with any classification procedure.
6
(3) Compute the average Bayes error rate, using the same modelling assumption
as in the previous part of this question.5 Use a plot or other
data-summarization technique to check if the Bayes error rate seems
sensible, and explain/interpret the broad features of your plot.
8. Data Analysis: House Sizes and Collinearity
Open the file
House_Data.csv
using the command:
house = read.csv("House_Data.csv")
This dataset is a version of the dataset posted at https://www.kaggle.com/
c/house-prices-advanced-regression-techniques/overview. The data
itself is real; I just pruned various outliers to make the following data analysis
slightly easier.
In the following, I denote by Ai
, Fi and Si the lot area, first floor area and
second floor area of the i’th house. In the dataset, these can be accessed by
writing:
house$LotArea[i], house$X1stFlrSF[i], house$X2ndFlrSF[i]
We will consider the following “full” regression model
Ai = α + β1Fi + β2Si + i (8.1)
and the two “sub-”models
Ai = α + β1Fi + i (8.2)
and
Ai = α + β2Si + i
. (8.3)
Note: Equations (8.1), (8.2) and (8.3) are three different models - I’m not
writing down one model where all three of these equations hold simultaneously!
In reality none of these models are quite true, and when you fit them you will
not get the same estimates for α, β1, β2, i
in all three models.
(1) Plot Fi vs Si
. Do you notice anything about their relationship?
(2) Use the lm function to fit all three models and find p-values for the
parameters α, β1, β2.
5Again, computing the Bayes error rate requires you to know (or assume you know) the
true data-generating process.
7
(3) When you fit model (8.1), are the p-values for β1, β2 below 0.05? What
about the p-values when you fit models (8.2) and (8.3)? Comment on
any discrepancy, perhaps in light of the plot in the first part of the
question.
(4) In the first three parts of the question, I told you to plot Fi
, Si together
and then fit three models for Ai
. In light of the plot from part (1) (and
common-sense observations about the relationship between Ai
, Fi
, Si),
suggest a modelling procedure that would have produced better results.
9. Simulation Study: Bootstrap
There are several closely-related algorithms called “the boostrap.” We first
describe the bootstrap that will be studied in this question.
We have some parametric model {fθ}θ∈Θ, data X1, . . . , Xn ∈ R, and estimator
ˆθ : R
n
7→ Θ taking a dataset to a parameter value. A single bootstrap
replicate is obtained by the following algorithm:
(1) Sample X∗
1
, . . . , X∗
n uniformly and with replacement from the dataset
X1, . . . , Xn.
(2) Return Θ∗ = ˆθ(X∗
1
, . . . , X∗
n
).
Denote by θ0 ∈ Θ the “true” parameter, F the true distribution of ˆθ(X1, . . . , Xn),
and Fboot the distribution of Θ∗
.
(1) Consider the parametric model fθ = N(θ, 1) with the usual MLE ˆθ =
1
n
Pn
i=1 Xi
. Fix n = 100 and θ = 0. Do a simulation study and plot
the densities of both F and Fboot. Do they look similar? 6
(2) Consider the parametric model fθ = Unif[0, θ]. Show that the MLE is
ˆθ = max(X1, . . . , Xn).
(3) Set n = 100 and θ = 1. Do a simulation study to plot the densities
of both F and Fboot for the setup in part (2).7 Do they look similar?
If not, comment on what would “go wrong” if you used Fboot as a
surrogate for F.
Depending on how you plot initially, the densities might be “squished”
and hard to see - rescale your x- and y-axes until you get a good luck,
6
It is possible to compute F exactly, but computing Fboot is much harder. Rather than
trying to do this, you should take many samples from Fboot using the bootstrap algorithm
given at the start of the question, then plot the histogram or density plot of this sample.
7As in the previous part of the question, you can estimate F and Fboot by simulating from
the appropriate densities and plotting the empirical distribution of the densities. However,
it is not terribly difficult to compute both of these densities exactly using only tools from
MAT2371/2377. If you can do these computations, you may find it easier to plot and
compare them.
8
and comment in particular on the values in intervals that are very close
to 1, such as [0.998, 1].
10. Logistic Regression Practice
Do Parts (a,b,c,d,e) of Question 10 from Chapter 4 of the textbook.
Comment on the quality of the result.
9
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代做 program、代写 c++设计程...
2024-12-23
comp2012j 代写、代做 java 设...
2024-12-23
代做 data 编程、代写 python/...
2024-12-23
代做en.553.413-613 applied s...
2024-12-23
代做steady-state analvsis代做...
2024-12-23
代写photo essay of a deciduo...
2024-12-23
代写gpa analyzer调试c/c++语言
2024-12-23
代做comp 330 (fall 2024): as...
2024-12-23
代写pstat 160a fall 2024 - a...
2024-12-23
代做pstat 160a: stochastic p...
2024-12-23
代做7ssgn110 environmental d...
2024-12-23
代做compsci 4039 programming...
2024-12-23
代做lab exercise 8: dictiona...
2024-12-23
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!