MSDS 16:954:597
Data Wrangling and Husbandry
Final Project Instructions
Spring 2020
1 Project Instructions
For the final project for the course, your assignment is essentially to wrangle some
data and to show off your skills. I am intentionally not specific about how you do
so, but you have the weekly assignments as models. Think of the project as the
equivalent of chaining together multiple weeks of assignments: you should bring
data into R, clean it, tidy it, perhaps create new variables, perhaps summarize your
data, and report on it with tables and figures. However, there are some required
elements:
• You must get your data from at least two distinct sources, at least one of
which must be at least somewhat difficult to work with (requires scraping or
cleaning).
• You must use Git and Github to manage your project. If you have not done
so already, please create an account.
• All of your code and the R Markdown file should run in its own directory,
without any additional files or code.
• Every code chunk must be labeled.
• You must include a step where you save a tidy version of (perhaps just some
of) your data as a csv file. The idea is that the csv file would be an easy place
for someone else to start from.
• Your report, generated from an R Markdown file, should be as good looking
and well formatted as you can make it — that includes tables and figures. Do
not use echo = TRUE except as truly needed.
• We have not done statistical analyses more sophisticated than correlation and
linear regression in this course and there is no need for more advanced analyses
in your report. You can do so if you wish, however.
• If some parts of your project are relatively easy, you should balance that out
by going into more depth in other aspects.
• Your report should explain the steps you’ve taken and why — I do not want
to see just a collection of tables and figures. Feel free to describe approaches
that didn’t work or were more troublesome than expected.
• I expect that you will discuss this project with others, but please avoid using
datasets in common (I realize that might still happen by coincidence). All of
the work submitted must be your own. Be sure to credit the sources of your
data and any other material — it is better to over-credit than to under-credit.
If you have any questions about properly crediting others’ work, just see me
about it.
page 1 of 2
Data Wrangling and Husbandry April 2, 2020
2 Presentation or Written Project
Students will be given the following two options for their final project.
(a) Give a 5-10 minute presentation of your project during our last class on Mon-
day, May 4 (think 5-10 slides). Besides the presentation, you will also turn in
your slides and other components required for the project, including a 5-page
report of your project. Students who give a presentation will have until the
end of that week (May 8) to turn in their project. Because everything is now
virtual, be sure you have a working webcam with microphone.
(b) Produce a 10-page formal report. Students who hand in a formal report will
turn in their project at the time of the last class (May 4).
In any case, focus on why you were interested in the datasets, some of the issues
in wrangling it, and a few interesting figures or tables. While keeping in mind that
what was time-consuming for you may not be interesting for others, remember that
the course has emphasized mechanics and that your classmates may very well be
interested in, say, what regular expression you used to reformat a particular column.
3 Procedures and Dates
Submit (via Canvas) a short description of your data and plans for it by Tuesday,
April 7. I will also ask for your preference for a presentation versus report as part
of that “assignment”. The description should include links to your data sources.
There is no grade associated with this part.
You will submit your final project (again, via Canvas) by giving the URL to
clone your GitHub repository. Also submit any API keys required using the format
api.key.. <- "abcdefg".
Your final project will be graded holistically, but I will be looking at these elements:
• That you have demonstrated your ability to use R to accomplish your tasks.
• That your code is easy to understand with supporting comments.
• That your report is well-written (i.e., clear and concise) with well-presented
tables and figures.
page 2 of 2