Take-home assignment #2
Applied Econometrics
Fall 2024
Due date: September 27 (Friday), by 10:00pm (U.S. Eastern Time)
Submission of late work: Please see syllabus for details on late submission policy
· Please see syllabus for the policy on the submission of late work (i.e., past the deadline). Please note that there are no exceptions to this policy.
· You may complete this assignment individually (by yourself) or within a group of up to three people. In other words, a group may not have more than three students.
· If you work within a group, then only one of the group members needs to upload the assignment into Canvas. Please make sure that the names of each group member are clearly written on the first page of the assignment, otherwise credit will not be given to names that are not written on that first page. Please note that there are no exceptions to this rule; if you worked within a group but your name is not written on the first page of the assignment, then you will not receive credit for the assignment.
· When an assignment involves working with data or/or coding, you do not need to submit the data or the code that you used to solve the assignment. You only need to submit your tables, charts, and discussions.
· Your submitted work should all be within one single PDF document. Please note that there are no exceptions to this rule. That is, assignments that are not in PDF format will not be accepted. (Please note that “Google”-format documents are not accepted.) I would recommend that you first do your work in a (say) Word document, and then pdf it and then finally upload it into Canvas. Additionally, please make sure to check the format and content of your final pdf document before submitting it, to make sure it is what you intend it to be.
· If the assignment involves creating charts, I suggest you please take a screenshot of your charts and then paste them (one at a time) into the Word document and the pdf it. (Please recall that we ask that you please submit only one document.)
· Please note that submission by email is not permitted. You will not receive credit if you submit your assignment by email. There are no exceptions to this policy.
· An individual (or your “group,” if you are working within a group) is not allowed to share information with any other group or individual. It would be a violation of Northeastern’s policy on academic integrity. Please note that there are no exceptions to this policy.
· Please note that TurnitIn will be used to check for plagiarism of work. Turnitin checks for plagiarism on documents across all submissions in the class and also submissions from other classes. It also checks for plagiarism on documents submitted in past classes.
· Please show all your work to receive full credit. If you provide just the answer, without showing how you derived it, then you will not receive full credit.
· Instructions on how to upload your (or your group) assignment into Canvas: https://community.canvaslms.com/docs/DOC-10663-421254353 .
· Please do not hesitate to reach out with any questions!
1. (Summarizing data numerically and visually) One of the most important aspects of analyzing data is the ability to summarize it efficiently. An important tool that allows us to summarize data efficiently is graphs. Another important tool that also allows us to summarize data efficiently is a table of descriptive statistics. This type of table displays, within one single page, some key statistics such as average, standard deviation, skew, min, max, and percentiles for each of the variables within a dataset. Below is an example of such a table, which summarizes a dataset on 270 MSAs (Metropolitan Statistical Areas) across the U.S. That table takes the original dataset, which is at the MSA-level, and provides statistics on the share of the insured population across the MSAs, on real GDP per capita across the MSAs, on the number of microbreweries across the MSAs, and on several other variables. (Please note that the “p”s represent percentiles; for example, “p5” is fifth percentile.)
This question asks you to create some graphs and a table of descriptive statistics for a different dataset. Specifically, attached to this assignment is a cross-sectional dataset that contains information at the county level for the year 2012. For each county in the U.S., it contains data on, for example, the percentage of the population that is obese, median income, the number of fast-food restaurants per 100,000 people, the share of the population that is Hispanic, the percentage of the population that has a high school diploma or higher, and so on. This data was compiled from the following sources:
· U.S. Census Bureau
· U.S. Centers for Disease Control and Prevention (CDC)
· U.S. Bureau of Economic Analysis (BEA)
· U.S. Department of Agriculture (DOA)
· U.S. Department of Health and Human Services (DHHS)
· FBI Uniform. Crime Reporting
a) Please use a statistical software (R, Stata, Python, etc; please do not use Excel) to create histograms for two of the variables (of your choice). Next, please also create two scatter diagrams for variables of your choice. Next, please provide a brief write-up about your findings. What are some meaningful insights that you observe from your graphs? Any interesting patterns in the data?
As you construct these graphs with your favorite statistical software, please keep in mind of the “do’s” and “don’ts” of graphs, as discussed in the recorded lecture video about graphing that is posted in the Modules section of our Canvas course site. If you haven’t watched that lecture video yet, please do so. Points will be deducted if those graphing guidelines are not followed. In other words, for this particular question, your grade will depend on, among other things, “how good and ‘neat’ your graphs look, and whether the graphs are displaying the information in an effective manner.”
b) Next, please use a statistical software (R, Stata, Python, etc; please do not use Excel) to create a table of descriptive statistics for this dataset. Please follow the format of the table that is displayed below, because in economics that is the typically used format. In other words, your table should have an informative title, should fit within one single page (and not spillover across multiple pages), should have a footnote, the description of the variables should be in the first column and be self-explanatory, the names of the statistics should be on top (as a row), the table should be as clutter-free as possible (for example, there is no need to display more than two decimal numbers for each number), it should be typo-free and with no grammatical errors, its columns should be aligned, etc. In other words, the “look” of the table matters, and its look will be reflected on the grade that you receive.
c) From directly inspecting the table that you just created, and directly using the numbers in the table (and not in the raw data), please answer the following:
i. Which variables (if any) are moderately skewed right? Please explain why.
ii. Which variables (if any) are strongly skewed left? Please explain why
iii. Suppose there is a particular county whose obesity rate is 19.7%. What is the z-score of this county for obesity rate?
iv. Please note that, for certain variables, the mean is meaningfully larger (say, more than 50% larger) than the median. Please explain, making reference to one of the statistics displayed in the table, why that is the case.
d) Next, let’s run a single-variable OLS regression. Specifically, please select, from the dataset, a “Y” variable of your choice (the “dependent” variable in the regression) and a single “X” variable of your choice (the “independent” variable, also called the “regressor”). These two variables should be chosen such that it makes sense that an economist would want to estimate “the effect of the X on the Y,” and not the other way around. (For example, it makes economic sense to estimate the effect of house size (X) on house price (Y), and not the other way around.)
Next, please run an OLS regression of Y on X. Carefully interpret the meaning of the two “beta” coefficients that you obtained in the regression. Finally, also carefully interpret the meaning of the “R2” coefficient that you obtained in the regression.