PUBPOL 5310
Applied Multivariate Statistics
PROBLEM SET 2
Fall 2025
Due 11:55pm, Monday, September 22, via Canvas
• Work in teams of 2-3 (or solo, if you strongly prefer). You can choose your own teams. Clearly indicate all the members of the team at the beginning of the problem set. Turn in one problem set per team (not one per person).
• Keep answers as brief as possible, and include key Stata output (charts and descriptive statistics) with your answers. Be sure to label your charts and output clearly, and to indicate which question each chart is intended to answer.
• Turn in the main problem set as one file only, not several documents. PDF preferred. Clearly label the problem set file (example: “PS 1 – PADM 5310 – Olivero Miller.pdf”)
• NEW: However, also turn in related .log and .do files (if relevant), as separate uploads.
• Include relevant Stata commands and output (such as tables or “summarize” output) in your answers, so that we know what commands led to what results
• When cutting and pasting Stata results into your Word document, use “Courier” or “Courier New” or other fonts that preserve the neat formatting in Stata
1. Bivariate regression with cross-sectional data.
For this question, use the data on Hourly Wage and education of US residents in 2022 (for those
individuals working “full time and full year”), given in the data set CPS-ASEC-2024_fall25.dta on the course website (in the “Data” folder).
a. Look at the variables. Use “summarize” for the variables wage, age, education, and male and female, and “tabulate” for education. Do the summary statistics look consistent with your expectations? Explain.
b. Run a linear regression of WAGE on Years of education using the “regress” command.
c. Interpret the slope coefficient – what does it tell us in words? Is this reasonable?
d. Interpret the intercept coefficient. What does it tell us in words? Is this reasonable?
e. Based on the regression output, what is the predicted wage for someone with 12 years of education? Show your work.
2. Predicting after a regression
a. Immediately following the regression in the previous problem, generate a new variable that is “predicted wages”. You can do this in Stata with “predict wage_hat” . (If you want to give your new variable a different name than wage_hat, that is fine too.) Next, generate a new variable that is the residual from the regression. You can do this in Stata with “predict wage_residual , resid” . (If you want to give your new variable a different name than wage_residual, that is fine too.)
b. Find someone in the dataset with 12 years of schooling. Confirm that their predicted wage is the same as your answer to the question above.
c. Confirm for a few sample cases that the residual is indeed equal to the actual value minus
the predicted value. (Hint: try “list wage_per_hour wage_hat wage_residual in 1/10” to get a listing of these variables for the first 10 observations in the dataset. (You only need to show your calculation for one observation; but confirm for yourself that this is what is going on).
d. Graphically show what is going on with the predicted values. You can do this with a command such as:
i. graph twoway (line wage_hat years_education)
ii. or alternatively:
graph twoway (line wage_hat years_education) (scatter wage_per_hour years_education)
e. Do the predicted values make sense? Do they look like a “best fit line”? It seems like for
low educated individuals, the predictions are systematically strange. Why do you think that this is happening?
3. Creating conditional averages as a way to clarify data presentation. For this question, you will need the Stata data set CPS-ASEC-2024_fall25.dta, available for download in the “Data” folder on our course website. Let’s start with a graphical representation of the relationship between years of schooling (“years_education”) and wage (“wage_per_hour”).
a. Create a scatter plot with wage on the Y axis, and years of schooling on the x-axis. This will look a little weird! Why do you think the graph has these vertical lines?
b. It’s even worse than it looks. Most of the data are “smooshed together” down in the lower range of the wage y-axis. You can get a better sense of this by making the “marker size” smaller. Add an option to your command “ , msize(tiny)”, and show the resulting graph. This helps, but it’s still not clear.
c. Before proceeding further, use stata’s “help” command to look up the following commands: preserve, restore, collapse. We will use collapse to compute “conditional averages”. But this will alter the data in Stata’s memory, and we will later want to return to the main data. The commands “preserve” and “restore” will help us with that part.
d. Use the “collapse” command to compute average wages for each value of years of schooling. Hint: this will require use of the “, by(years_schooling)” option.
e. Using the “list” command, confirm that your new dataset now has only one observation per year of schooling. Create a scatter plot on this transformed data. Is the relationship in this graph more clear, or less clear, than in your answers to (a) and (b) above?
f. Next, let’s return to our main data, and add a twist. [use can use “restore” command, if you previously preserved the data.] Now (after “preserve”-ing again) collapse your data to “years of education by female” cells. You should end up a dataset with 28 observations:
one for each year of schooling for men (female == 0) and for women (female == 1). Create a scatter plot with two different colors, one for men and one for women. You can do this with a command like:
graph twoway (scatter wage year if female == 1) (scatter wage year if female == 0)
What do we learn from this graph about how wages vary across education and gender? How does the gender gap in wages change across different levels of education?
g. What is the magnitude (in $/hour) of the gender wage gap for those with 12 years of schooling? For those with 16 years of schooling?
h. Now restore the main data set, and then repeat part (d), except instead of looking at “years of schooling” collapse to “age by gender” cells. Plot out the life-cycle pattern of wages for men and women. Is the gap consistent over all ages, or does it grow at certain parts of the life-cycle?
4. IPUMS Account creation and exploration
The most natural place to go for data for your independent research project is the IPUMS website. For data related to demographics and labor market outcomes in the US economy, the best data is the CPS. This is the dataset that the government collects to calculate the monthly unemployment rate, and to measure trends in poverty, etc. You can access the raw data at https://cps.ipums.org/cps/. This problem has two parts.
NOTE: Please include and answer for (a) and (b) for each member of the team, not just one answer per team.
a – each member of your team should request an account to use IPUMS-CPS. For this part of the problem set, just verify that this has taken place. (it’s okay if the account has not yet been approved by the submission deadline.) Let us know in the problem set with a screen shot or something similar that this has taken place for each member of the team.
b – Each member of your team should browse the variable listings, and daydream/brainstorm what variables you would each like to explore. In particular, I’d like you to think of 1-4 “outcome variables” that you would like to predict, and 2-6 “predictor” variables that you would like to use to predict them. You can search for variables (and see for what years/months those variables are available) at https://cps.ipums.org/cps-action/variables/group . For this part of the problem, for each member of the team clearly indicate the “outcome” and “predictor” variables you are interested in, and provide a very brief motivation for why you think it would be interesting to examine relationships between these variables. Also, indicate what time periods and/or other sample restrictions that you would like to examine.