STAT 3280 Spring 2020 HW3
Due on Apr 12.
Submit your homework by sending it to both and ,
with subject “STAT 3280-Section#-HW3: names”, where the section number is “001” if you
are in 11-12:15 and “002” if you are in 12:30-1:45, and the “names” should be replaced by
your last name(s) of the group. Each group only has to submit it once, by ensure you in-
clude everyone’s name. Please use a separate page for each problem. And the answer to each
problem cannot be longer than one page (with reasonable font size, line space, margins etc.).
You can explain how you did it in R by submitting your code with detailed explanations, but
only include this part in an appendix. For this homework, you can use whatever R package
you want.
Due to the recent COVID-19 problem, I decide to replace some problem in my mind
before by a new problem of analyzing COVID-19 data. I strongly encourage you to work
on the problems with careful thinking and sufficient efforts. For that, I actually reduce the
number of problems in this data set (I originally planned for 5 but now there are only 3).
For Q1, you are supposed to give a single figure. For each of Q2 and Q3, you can use up
to 2 pages for your analysis and add text (including paragraphs) to explain your results.
However, your component cannot exceed one-page for each of Q2 and Q3.
And I will include both Q2 and Q3 as your options for presentation and you can select
either one of the two to present (but you need to work on both for your submission). We will
work out a plan for presentation in this remote teaching scenario later on. The presentation
will count for 5 points. So Q1 will be 5 points. Q2 and Q3 will be 15 points. Together with
the presentation, you will have 40 points for HW3. I strongly recommend early starting for
this homework, specially because in foreseeable future, most of you will be working remotely
with your team members. The deadline for HW3 is a hard deadline. We need to reserve
enough time for HW4 and group presentation and there is not way we can extend it.
1. (5 pts) We will be introducing the way to generate the US airline plot in class. Now,
based on the larger data set in the airport folder, generate the global airline network
data, similar to that in the slides.
2. (15 pts) In the folder “statisticians”, you can find the data about statisticians’ publica-
tions in 4 journals during 2002-2012. Look at the ReadMe file in the folder to understand
the data set. You can also explore the paper at https://arxiv.org/abs/1410.2840
for a detailed analysis if you want. We will explore the data set a bit in class and test
basic visualizations. You are supposed to further extend the analysis as below:
1. Use the abstract keywords and visualize the keywords over time. Do you observe
any trend?
2. Ego network exploring: pick one of the a few statisticians with large number of
citations and/or collaborations.
1
(a) Visualize his/her citation network change over time. Notice that it makes more
sense to visualize it in an accumulative way (eg. in 2010, you take all citation
relations he/she had before 2010). Highlight and comment on the expansion
pattern.
(b) Similarly, visualize his/her collaboration network change over time. Highlight
and comment on the expansion pattern. Comment on the difference between
this one and the citation networks.
(c) Visualize the keywords from his/her articles in the data set. Is the trend in for
this person similar to the global trend?
(d) Use Google Scholar, find all the paper titles under his/her name. Again, visual
the word cloud from the titles over time, for the person’s whole academic career.
Does the period of 2002-2012 seems to match what you observe from (c)? If
not, give a briefly intuitive explanation. (Hint: you do not have to segment the
data by each year, if you feel that is too long or you do not have enough data
for each year. It is find to aggregate your data across several years.)
3. (15 pts) In class, we will be extracting the latest coronavirus incident data from Johns
Hopkins University’s Center for Systems Science and Engineering. There are several
data science tools globally following the progress of this disease as well as an R package
coronavirus: The 2019 Novel Coronavirus COVID-19 (2019-nCoV) that provides
a daily summary of the Coronavirus (COVID-19) cases by state/province. We will also
learn the join operation in class, as our preparation for database later on.
1. In class, we will visualize the number of confirmed cases over time for countries.
Generate a graph to visualize infected per million (instead of total infected number).
The World Bank is a tremendous source of global socio-economic data and can be
accessed via the R package wbstats. Have a look at their HTML on CRAN.
2. Similar to (1), visualize the ratio between death and recovered over time for different
countries.
3. (Only for presentation) As an extension, give an animated illustration of the result
in (1) and (2).
4. Imagine you are part of a working group that is to provide the US government with
some recommendation on various policies or strategies to face the challenge of the
virus. Is the current development of the virus spreading in US still under control? If
not, what else one should do? Would travel ban (international or domestic or both)
be effective? Would the lockdown strategy of China be effective? As part of the
team, your task is to use the data provided by John Hopkins University to generate
a few key visualization graphics that will contribute towards making this decision.
Please justify why these graphics are useful and suggest the potential “decision” it
may lead to. Hints: you might start with checking the following aspects of the data
• Are there certain states in US that are more at risk compared to others?
2
• Compare the pattern or rate of increase in different countries, in different pe-
riods.
• Compare the cross-over point between the number of recovery and active case.
• Examine the geographical distribution of confirmed cases around the world.
• Perhaps use some external information. For example, you might want to learn
overall the categories of strategies used by other countries so far, especially
those countries whose status is ahead of US. For another example, the openflight
dataset you use for Q1 may provide helpful information about international
travel flow between countries, thus give you some information about the travel
impacts.
Remark: note that to really rigorously discover the true effects of a policy, the only
way is causal inference. However, in practice, causal inference has too many strong
requirements and may not be feasible for such emergencies. Therefore, exploring
from such observational data is usually the only option for data scientists. There
are hundreds of teams working intensively on evaluating various strategies. For ex-
ample, a recent paper published on Science uses statistical infection models to eval-
uate the potential effects of travel restrictions within China and out of China. URL:
https://science.sciencemag.org/content/early/2020/03/05/science.aba9757. In this
problem, you do not need to use such advanced methods. Exploring your data by
carefully designed operations can already give you very insightful findings.
3