Homework 4
Statistical Machine Learning • • Spring 2025
Assigned: April 22
Due: May 12
Instructions:
• You may work with others on this homework assignment but all solutions must be written up and submitted individually.
• All homework assignments must be submitted in pdf format and should be at most 12 pages in length. Any material beyond 12 pages will not be graded.
• You must submit all code used to complete this homework assignment as an appendix (this can go beyond the 12 pages) . Failure to submit code will result in 20% reduction.
• You are permitted 4 total late days on homework assignments throughout the semester. Any late days taken beyond these 4 will incur a 20% reduction per day.
• This homework assignment has 110 total points available and any points earned above 100 will count as extra credit.
1. (50 points) Trees, Ensembles, & Neural Networks. This problem will use the Adult Dataset available from https://archive.ics.uci.edu/ml/datasets/adult. The dataset is already split into a training and test set and you should report accuracy results on the given test set.
(a) (25 points) Trees & Tree-Based Ensembles.
i. (10 points) Hyperparameters. Try to purposefully overfit each of the following approaches: Decision Trees, AdaBoost, Gradient Boosting, and Random Forests. Were you able to overfit? What hyperparameters did you use to overfit these approaches? Plot the training and some form of validation error as a function of some of the hyperparameters for various approaches to show which hyperparameters can lead to overfitting. You may want to try varying the tree depth, number of iterations (Boosting) learning rate (Boosting), and/or the number of features to consider for each split (Random Forests), among others.
ii. (10 points) Accuracy & Runtime. Now properly tune each method. How did you do so? Justify your approach. Compare and contrast the test accuracy of each method for the optimal tuning. Which performs best? Why? What hyperparameters led to the best test error? How long did each approach take to train and tune?
iii. (5 points) Interpretation. Interpret the best performing model from each model family. Plot the classification tree and examine the feature importance measures for each approach. Do all methods deem similar features important? Discuss your interpretations.
(b) (25 points) Feedforward Neural Networks.
i. Hyperparameters. (15 points) Build feedforward neural networks and report some form of validation error for a range of hyperparameters. You should consider:
• One or two hidden layers.
• Varying the number of hidden units per layer.
• At least two activation functions.
• At least two optimizers (with varying learning rates).
• Different random initializations.
Note that you do not need to vary all of these at once in a grid search, but may instead consider these in groups or randomly. Visualize the training and validation error for these hyperparameters. Which hyperparameters are best? Discuss your results.
ii. Accuracy & Runtime. (10 points) Now properly tune your multi-layer perceptron (feedforward neural network). How did you do so? Justify your approach. Which hyperparameters were selected? Compare the test accuracy and runtime (training and tuning) to those of trees and tree-based ensembles studied previously. Which method is best for this data set? If you could use more computational resources for training, which hyperparameters would you explore further that may yield additional improvements in test accuracy? Interpret and discuss your results.
2. (60 points) Machine Learning Pipeline. For this problem, you will develop your own machine learning pipeline to analyze a data set of your choosing. Here are some requirements of this pipeline development and data analysis:
(a) Data & Pre-processing. The data set you choose should be (i) reasonable in size and complexity (e.g. not too small or simple) and (ii) have a clear analysis objective that can be addressed via machine learning; this can be supervised or unsupervised in nature. Some good resources to find data sets include:
• UCI ML Repository: https://archive.ics.uci.edu/.
• Kaggle Datasets: https://www.kaggle.com/datasets.
• OpenML Datasets: https://www.openml.org/.
You should also wrangle and pre-process your data to prepare it for the machine learning task. Justify any choices you make.
(b) Exploratory Data Analysis & Visualization. Use unsupervised learning techniques among others to explore and visualize your data. Are there any findings from EDA that will influence your choice of models or modeling approach? Justify any choices you make.
(c) Modeling & Model Validation. Fit several machine learning models that are appropriate to address the stated objective. Which models did you choose and why? How did you properly tune hyperparameters and validate models? Which model is ultimately the best and why? Justify any choices you make.
(d) Communication of Results & Interpretation. Communicate the findings from your final (best) model and interpret your results. You must provide at least one visual summary of your findings.