COM6012 Assignment 2 - Deadline: 4:00 PM, Wed 13 May 2020  
Assignment Brief   
How and what to submit  
A. Create a .zip file containing the following:  
1) AS2_report.pdf: A report in PDF containing answers to ALL questions. The report             
should be concise. You may include appendices/references for additional          
information but marking will focus on the main body of the report.  
2) Code, script, and output files: All files used to generate the answers for individual               
questions above, except the data. These files should be named properly starting             
with the question number: e.g., your python code as Q2_xxx.py (one each             
question), your script for HPC as Q2_HPC.sh, and your output files on HPC such               
as Q2_output.txt or Q2_figB.jpg. If you develop your answers in Jupyter Notebook,             
you MUST have these files (code, script, output, images etc specified above in              
bold) prepared and submitted after you finalise your code and output. The results              
should be generated from the HPC, not your local machine.  
B. Upload your .zip file to MOLE before the deadline above. Name your .zip file as                 
USERNAME_STUDENTID_AS2.zip , where USERNAME is your username such as         
abc18de , and STUDENTID is your student ID such as 18xxxxxxx.  
C. NO DATA UPLOAD: Please do not upload the data files used. We have a copy                 
already. Instead, please use a relative file path in your code (data files under folder                
‘Data’), as in the lab notebook so that we can run your code smoothly.  
D. Code and output. 1) Use PySpark as covered in the lecture and lab sessions to                 
complete the tasks; 2) Submit your PySpark job to HPC with qsub to obtain the output.  
Assessment Criteria (Scope: Session 6-9; Total marks: 20)  
1. Being able to use pipelines, cross-validators and a different range of supervised             
learning methods for large datasets   
2. Being able to analyse and put in place a suitable course of action to address a large                  
scale data analytics challenge   
Late submissions: We follow the Department's guidelines about late submissions, i.e., a             
deduction of 5% of the mark each working day the work is late after the deadline, but NO late                    
submission will be marked one week after the deadline because we will release a               
solution by then. Please see this link.   
Use of unfair means: " Any form of unfair means is treated as a serious academic offence                 
and action may be taken under the Discipline Regulations." (from the MSc Handbook).              
Please carefully read this link on what constitutes Unfair Means if not sure.  
Please, only use interactive HPC when you work with small data to test that your                
algorithms are working fine. If you use rse-com6012 in interactive HPC, the             
performance for the whole group of students will be better if you only use up to four                  
cores and up to 15G per core. When you want to produce your results for the                 
assignment and/or want to request access to more cores and more memory, PLEASE              
USE BATCH HPC. This will be mandatory. We will monitor the time your jobs are taking                 
to run and will automatically “qdel” the job if it is taken much more than expected. We                  
want to promote good code practices (e.g. memory usage) so, please, once more,              
make sure that what you run on HPC has already been tested enough for a smaller                 
dataset. It is OK to attempt to produce results several times in HPC, but please, be                 
mindful that extensive running jobs will affect the access of other users to the pool of                 
resources.   
Question 1. Searching for exotic particles in high-energy physics using classic            
supervised learning algorithms  [15 marks]  
In this question, you will explore the use of supervised classification algorithms to identify               
Higgs bosons from particle collisions, like the ones produced in the Large Hadron Collider.               
In particular, you will use the HIGGS dataset.   
About the data: “The data has been produced using Monte Carlo simulations. The first 21                
features (columns 2-22) are kinematic properties measured by the particle detectors in the              
accelerator. The last seven features are functions of the first 21 features; these are               
high-level features derived by physicists to help discriminate between the two classes.             
There is an interest in using deep learning methods to obviate the need for physicists to                 
manually develop such features. Benchmark results using Bayesian Decision Trees from a             
standard physics package and 5-layer neural networks are presented in the original paper.              
The last 500,000 examples are used as a test set.”  
You will apply Random Forests and Gradient boosting over a subset of the dataset and over                 
the full dataset. As performance measures use classification accuracy and area under the              
curve .   
1. Use pipelines and cross-validation to find the best configuration of parameters for             
each model.   
a. For finding the best configuration of parameters, use 5% of the data chosen              
randomly from the whole set (2 marks).  
b. Use a sensible grid for the parameters (for example, three options for each              
parameter) for each predictive model (3 marks).  
c. Use the same splits of training and test data when comparing performances             
among the algorithms (1 mark).  
Please, use the batch mode to work on this. Although the dataset is not as                
large, the batch mode allows queueing jobs and for the cluster to better allocate               
resources.  
2. Working with the larger dataset. Once you have found the best parameter             
configurations for each algorithm in the smaller subset of the data, use the full               
dataset to compare the performance of the three algorithms in the cluster.             
Remember to use the batch mode to work on this.   
a. Use the best parameters found for each model in the smaller dataset of the               
previous step, for the models used in this step (2 marks)  
b. Once again, use the same splits of training and test data when comparing              
performances between the algorithms (1 mark)  
c. Provide training times when using 10 CORES and 20 CORES (2 marks).             
Based on our own solution, with the proper setting, you need a maximum of               
10 mins for running the exercise when using 10 cores (with the rse-com6012              
queue).  
3. Report the three most relevant features according to each method in step 2 (1               
mark).  
4. Discuss at least three observations (e.g., anything interesting), with one to three             
sentences for each observation (3 marks).  
Do not try to upload the dataset to MOLE when returning your work. It is 2.6Gb.  
HINTS: 1) An old, but very powerful engineering principle says: divide and conquer. If you                
are unable to analyse your datasets out of the box, you can always start with a smaller one,                   
and build your way from it. 2) This dataset was used in the paper “Searching for Exotic                  
Particles in High-energy Physics with Deep Learning” by P. Baldi, P. Sadowski, and D.               
Whiteson, published in Nature Communications 5 (July 2, 2014). You can compare the              
results that you get against Table 1 of the paper.   
Question 2. Senior Data Analyst at Intelligent Insurances Co.  [15 marks]  
You are hired as a Senior Data Analyst at Intelligent Insurances Co. The company wants to                 
develop a predictive model that uses vehicle characteristics to accurately predict insurance             
claim payments. Such a model will allow the company to assess the potential risk that a                 
vehicle represents.  
The company puts you in charge of coming up with a solution for this problem and provides                  
you with a historic dataset of previous insurance claims. The claimed amount can be zero                
or greater than zero and it is given in US dollars. A more detailed description of the                  
problem and the available historic dataset is here The website contains several files. You               
only need to work with the .csv file in train_set.zip. The uncompressed file is 2.66 Gb.  
1. Preprocessing   
a. The dataset has several fields with missing data. Choose a method to deal              
with missing data (e.g. remove the rows with missing fields or use an              
imputation method) and justify your choice (1 mark).  
b. convert categorical values to a suitable representation (1 mark).  
c. the data is highly unbalanced: most of the records contain zero claims.             
When designing your predictive model, you need to account for this (2             
marks).  
2. Prediction using linear regression. You can see the problem as a regression             
problem where the variable to predict is continuous. Be careful about the             
preprocessing step above. The performance of the regression model will depend on             
the quality of your training data   
a. Use linear regression in PySpark as the predictive model. Partition your data             
into training and test (percentages according to your choice) and report the             
mean absolute error and the mean squared error (2 marks).  
b. Provide training times when using 10 CORES and 20 CORES. Remember            
to use the batch mode to work on this (1 mark). Based on our own                
solution, with the proper setting, you need a maximum of 5 mins for running               
the exercise when using 10 cores (with the rse-com6012 queue).  
3. Prediction using a combination of two models. You can build a prediction model              
based on two separate models in tandem (one after the other). Once again, be               
careful about the preprocessing step above. For this step, use the same training              
and test data that you used in 2.a.   
a. The first model will be a binary classifier (of your choice) that will tell whether                
the claim was zero or different from zero. The performance of the classifier              
will depend on the quality of your training data (2 marks).  
b. For the second model, if the claim was different from zero, train a Gamma               
regressor (a GLM) to predict the value of the claim. Report the mean              
absolute error and the mean squared error (2 marks).  
c. Provide training times when using 10 CORES and 20 CORES. Remember            
to use the batch mode to work on this (1 mark). Based on our own                
solution, with the proper setting, you need a maximum of 15 mins for running               
the exercise when using 10 cores (with the rse-com6012 queue).  
4. Discuss at least three observations (e.g., anything interesting), with one to three             
sentences for each observation (3 marks).  
HINT: An old, but very powerful engineering principle says: divide and conquer. If you are                
unable to analyse your datasets out of the box, you can always start with a smaller one, and                   
build your way from it.