Assignment 3: Exploratory data analysis
For SOC252H1, Fall 2024
Exploratory data analysis for a research paper
For this assignment, you will have a choice of data set for analysis.
There are many other websites and places that have datasets suitable for this type of analysis. However, remember that data found online may not be clean and ready for analysis. Here are a couple of ideas:
1. Any of the datasets that we have used in this course.
2. The following website contains a list of data sets used for teaching in R with documen- tation. https://vincentarelbundock.github.io/Rdatasets/datasets.html
3. Open data City of Toronto (similar portal exists for Ontario and National Canadian data): https://www.toronto.ca/city-government/data-research-maps/open-data/
4. For census and survey data from around the world: https://www.ipums.org/
5. Variety of survey and data deposited by researchers: https://www.icpsr.umich.edu/web/ pages/
Once you identified the dataset of your choice, find the documentation for it and read to understand how the survey is done and the variables in it.
. Caution
|
1. Read the available documentation well, as the questions will require you to become familiar with the datasets.
2. If your assignment submission is not easily readable (code, output, and explanation) through a pdf or a word document, there will be a %10 penalty.
|
Question 1:
In your own words, very briefly describe the dataset (what type of survey it is) you have chosen and why (max 200 words). [10 points]
Question 2:
Univariate relationships: Provide a summary table of the variables you have and describe the distributions.
Grading criteria: [20 points]
• Maximum 2 table
• Maximum 2 figures
• Max 150 words explanation of the distributions (fewer words are ok).
Question 3:
Bivariate relationship: Identify your dependent variable (outcome) and 2-3 independent vari- ables. Explore the relationship between your dependent variable and each of your independent variables using appropriate methods and visualizations.
Grading criteria: [30 points]
• Use a maximum of 1 table and/or 1 figure for each relationship explored.
– For example, if you have two independent variables, you are allowed two tables and two figures - fewer is ok.
• Max 100 words explanation of each relationship.
– For example, if you have two independent variables, you are allowed 200 words.
Question 4:
Identify your main regression model. [20 points]
i. Clearly identify the dependent and independent variables. You choose how many inde- pendent variables you want to include
ii. Create a DAG diagram of the relationship you expect between the variables and briefly explain it.
iii. Identify the type of regression (linear or logistic) and write out the regression formula, identifying each component of the formula in a brief text.
Question 5:
Run the identified regression model in R. Provide a brief interpretation of the coefficients.
Grading criteria: [20 points]
• A regression table.
• Interpretation of the coefficients and confidence interval.
• Write a short summary of the findings from the regression. (Max 250 words)