首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
讲解Stat 428编程、辅导R编程设计、R程序语言调试 讲解留学生Processing|解析Java程序
项目预算:
开发周期:
发布时间:
要求地区:
Final Project
Stat 428
I. Simulation Problem (50 points)
In the lecture, we discussed Nearest Neighbor Tests and Energy Distance Test for two sample testing problem.
We consider another two tests: two-sample Hotelling’s T-square test statistic and graph-based two sample
test. Suppose the data we observe X1, . . . , Xn and Y1, . . . , Ym, where Xi
, Yj ∈ R
d are multivariate random
vectors. Here, X1, . . . , Xn are drawn from distribution F and Y1, . . . , Ym are drawn from distribution G. The
hypothesis of interest in two sample testing problem is
H0 : F = G and H1 : F 6= G.
Graph-based two sample test is defined in the following way. We pool all data together
Z1, . . . , Zn+m = X1, . . . , Xn|Y1, . . . , Ym
Based these n + m observations, we construct a graph G = (V, E) such that the set of vertex is V =
{1, . . . , n + m} and there is an edge between i and j if kZi − Zjk ≤ Q, where Q is a positive number. Let E
be the collection of edges. The graph-based two sample test statistic is defined as,
where |E| means the number of edges in the edge set E. Here, Ie = 1 if the two vertex connected by e have
the same label and Ie = 0 otherwise.
Question 1 Report
A pharmaceutical company would like to test whether the effect of two treatments are similar or not. The
manager want to choose one two sample testing method from nearest neighbor tests, energy distance test,
Hotelling’s T-square test and graph-based two sample test and ask your advice for the choice of two sample
test. First, could you help the manager to implement these four methods from the scratch: nearest neighbor
tests, energy distance test, Hotelling’s T-square test and graph-based two sample test? Second, could you
prepare a report to provide some suggestions for the manager? In this report, you need to address at least
four of the following points:
1
• Several different parts can be customized in these tests, e.g., the threshold Q in graph-based test, the
number of neighbor in nearest neighbor test and the specific form of distance in energy distance test
and graph-based test. Could you provide some suggestion on the choice of these customized part? You
need to show some numerical experiment as your evidence.
• Are these tests sensitive to the dimension of data d?
• Are these tests sensitive to specific distribution of F or G?
• Which test has larger power under what condition?
• Clearly, the power of the test relies on the sample size n, m and how different F and G are under
alternative hypothesis. Could you prepare a plot to show effect of sample size on power? Could you
prepare another plot to show effect of the difference bewteen F and G on power?
• Are these methods able to control Type I error?
You need to submit both Rmd and pdf file of your report.
Question 2 Presentation and Slides
Based your report, could you prepare a 3-5 minutes presentation to summarize your findings and suggestions?
Assume your audience is the manager from this pharmaceutical company, who has only very limited statistic
background. In this question, you need to submit a video (I need to see you in this video) and your slides
(Both Rmd and pdf).
Question 3 R package (Bonus question: extra 10 points for the final project)
Could you prepare an R package to include all your four two sample testing methods and a manual that
introduces how these methods can be used? To finish this question, you need to submit a compressed R
package.
II. Real Data Problem (50 points)
The data for this project describe payments for child support made to a government agency. A “case” refers
to a legal judgment that an absent parent (abbreviated in variable names as “AP”) must make child support
payments. The data is distributed in four CSV files, whcih can be downloaded from Compass2g. The data
are distributed “as is” as obtained from the agency (albeit anonymized). Most categorical variables are
self-explanatory.
The file cases.csv has six columns, one for each case:
• CASE_NUM Unique case identifier
• CASE_STATUS ACV (active), IN_ (inactive), IC_ (closed), IO_ (legal), IS_(suspend)
• CASE_SUBTYPE AO (arrears), EF (foster), MA (medical), NO (arrears), RA (regular), RN (regular)
• CASE_TYPE AF (AFDC), NA (non-afdc), NI (other)
• AP_ID Identifying number for absent parent
• LAST_PYMNT_DT Recorded date of last payment
The file parents.csv has 10 columns, one for each parent:
• AP_ID Unique identifier for parent
• AP_ADDR_ZIP Coded na for missing, 0 for “known unknown”, 1 for city, 2 south state, 3 north state,
4 other
• AP_DECEASED_IND AP is deceased
• AP_CUR_INCAR_IND AP is incarcerated
• AP_APPROX_AGE
• MARITAL_STS_CD Self-explanatory
• SEX_CD
• RACE_CD Categorical
• PRIM_LANG_CD Language code
2
• CITIZENSHIP_CD Citizenship code
The file children.csv has 9 columns:
• CASE_NUM Case number
• ID Unique identifier for child
• SEX_CD
• RACE_CD
• MARITAL_STS_CD Marital status code
• PRIM_LANG_CD Primary language
• CITIZENSHIP_CD
• DATE_OF_BIRTH_DT
• DRUG_OFFNDR_IND Past drug offence
The file payments.csv has only six columns, but more than 1.5 million records:
• CASE_NUM Case number for the payment
• PYMNT_AMT Dollar amount of payment
• COLLECTION_DT Date of payment
• PYMNT_SRC A (regular), C (worker comp), F (tax offset), I (interstate), S (st tax), W (garnish)
• PYMNT_TYPE A (cash), B (bank), C (check), D (credit card), E (elec trans), M (money order)
• AP_ID Absent parent ID
Question 1 File linkage integrity
(a) Read the four CSV files into R, building four data frames with the names “Cases”, “Parents”, “Children”
and “Payments”. Show the dimensions of these data frames. (You may find it useful to save these data
frames as Rdata objects in a file using the save command. You can then recover them with the load
command more quickly than reading the CSV file.)
(b) What is the distribution of the number of children attached to a case? Show an appropriate plot of the
distribution, and mark the location of the average number in the plot.
(c) The file children.csv may have more than one record for each child. What is the largest number of
cases associated with a child, and indicate why you believe that this is indeed the same child.
(d) Does every absent parent (AP_ID) identified in the payments data have an identifying record in the
parents data file?
Question 2 Recoding categories
Some categorical variables among these data frames are sparse (seldom observed). For example, the variable
PYMNT_SRC in Payments has category ‘M’ with 2 cases and category ‘R’ with 7. These are too few for
modeling in regression.
Write a function named “pool_categories” that recodes a categorical variable into a “simpler” factor with
fewer categories by pooling categories with counts below a threshold into a category labeled ‘Other’ (a factor
level which your function should check does not already exist!). You might find the R function %in% useful
for this exercise.
Question 3 Payment counts and amounts
You must use ggplot2 for generating the plots asked for in this question.
(a) Make a variable Payments$DATE which is a viable R date by converting the COLLECTION_DT
variable. Use this variable to find (i) the range of dates of all payments and (ii) the percentage of the
total number of payments made before May 1, 2015.
3
(b) Show a sequence plot of the total number of payments made on each day from May 1, 2015 through the
end of the data.
(c) What explains the bimodal shape of the marginal distribution of the number of payments over this
time period? Explain with some evidence how you reached your opinion.
(d) Describe the distribution of the payment amounts. Do you have an explanation for its shape? (You
might find it useful to work with a sample for plotting. R takes a while to draw 1.5 million points.)
Question 4 Most common parent
(a) Identify the parent with the most cases.
(b) Identify all of the different children associated with the cases of the parent identified in (a).
(c) What is the average age of these children, in years? Use their age as of Jan 1, 2017. (Fractions of a
year are fine.)
(d) Show a plot of the payment history for this parent.
Question 5 Payments for cases
The unit of analysis for this question is the payment behavior of an absent parent. Hence, if the parent is
involved in several cases, you will need to accumulate the relevant information. You may find it useful for
this and the next question to build a data frame for parents that collects the relevant information for each
parent. You may find dplyr useful here and elsewhere, but you don’t have to use it.
(a) It has been conjectured that parents deemed responsible for more children are more likely to make
either a larger number of payments or a larger total payment amount over this period. Is that true?
(b) It has been conjectured that parents responsible for younger children are more likely to make more
payments. Is the average age of the children of an absent parent associated with the total amount of
payments made by the absent parent? (Define a child’s age as the age on Jan 1, 2017.)
(c) Does the location of the parent (AP_ADDR_ZIP) anticipate the total amount of payments made by
the absent parent?
(d) Does the combination of attributes of the parent with the number and average age of the children
involved predict the total amount of payments made by a parent? Explain your results briefly. (Note:
It makes no sense to remove cases with missing values of a categorical variable. Missingness just defines
another category of the variable.)
Question 6 Consistency
Again, the unit of analysis for this question is an absent parent. An important aspect of payments is the
consistency of the payments over time. A steady income stream is, for many, preferable to a highly volatile,
unpredictable payment schedule, even if the latter has a higher average.
(a) Among all parents who made payments, is there any association between the SD of total daily payments
and the average of total daily payments?
(b) The coefficient of variation (CV) is the ratio of the SD of daily payments to the mean. Show time
sequence plots of the payments of 3 parents, with low, medium and high CV. That is, find three
representative parents who make payments. One of these three should have a high CV, another an
medium CV, and a third a low CV.
(c) Is the CV of payments associated with the total amount of payments over this time period?
(d) Do any attributes of the parent as revealed in these data anticipate that the parent will make consistent
payments, that is, have small CV?
4
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代做ceng0013 design of a pro...
2024-11-13
代做mech4880 refrigeration a...
2024-11-13
代做mcd1350: media studies a...
2024-11-13
代写fint b338f (autumn 2024)...
2024-11-13
代做engd3000 design of tunab...
2024-11-13
代做n1611 financial economet...
2024-11-13
代做econ 2331: economic and ...
2024-11-13
代做cs770/870 assignment 8代...
2024-11-13
代写amath 481/581 autumn qua...
2024-11-13
代做ccc8013 the process of s...
2024-11-13
代写csit040 – modern comput...
2024-11-13
代写econ 2070: introduc2on t...
2024-11-13
代写cct260, project 2 person...
2024-11-13
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!