The University of Western Ontario
Computer Science 2035B
Instructor: Dr. David Champredon
Final Examination
Take-Home Version
Handed out: April 7, 2020 at 00:01AM
Due Date: April 20, 2020 at 11:55PM
Marking Scheme
• This exam consists of 7 questions worth a total of 100 points.
• This exam comprises 22% of your overall mark for this course.
• There is an additional relative 10% bonus point for this exam if your
entire work is submitted before April 16, 2020, 11:55PM.
For example, if your grade for this exam was 80 points, you could earn an extra 8
points. There is no cap, so the maximum mark is potentially 110 points for entire
work submitted that early.
• This is an exam, so no late work will be accepted. A late submission
will be marked zero.
GOOD LUCK!
COMPSCI-2035B Final Examination - Take Home Page 2 of 7
Important
• Submit your assignment on OWL in the form of well commented Matlab script files
in the “Assignments” section. The file names must follow this convention:
Final STUDENTNUMBER EXn.m where STUDENTNUMBER is your 9-digit student number
and n represents the exercise number ( n is 1 to 7 ). Functions (if any) can either
be in the same script (“all-in-one” style) or in separated scripts (with the script name
the same as the function name).
• Make sure your code runs. Your grade will be based on what the grader can
run. A program that does not run (i.e., stops because of any error) will
be graded zero. Make sure you submit all the necessary files such that the grader
can run your programs. This includes the csv data files (the ones provided with this
exam). Unlike assignments, there will be no exception or second chance.
• Your submission must reflect you own work. Programs that are suspiciously
similar will be graded zero.
• You are encouraged to submit regularly to OWL. This will avoid unnecessary last-
minute stress.
COMPSCI-2035B Final Examination - Take Home Page 3 of 7
1.(7 points) Speed of Light. An astrophysics laboratory has purchased a new set of instruments
to connect to their telescope. The scientists performed a set of independent experi-
ments to measure the speed of light with the new instruments to verify if they are well
calibrated. The speed of light measurements, in meter per second (m/s), are recorded
in the file speed-light.csv . The exact theoretical value for the speed of light is
299,792,458 m/s.
Are the instruments well calibrated? Justify your answer with a Matlab code that
implements a rigorous statistical analysis and a short explanation.
2.(7 points) Algorithm. Write a Matlab function named find pos occ that returns the number
of occurrences and the positions of a single character in a text. For example, the text
I have an apple in my bag. You can take it!
contains 6 occurrences of the character a at the positions 4, 8, 11, 24, 33 and 37.
You can only use any of the following Matlab commands in your algorithm:
for, while, if, then, sum(), size(), length(), zeros() .
3.(12 points) Decathlon. The results of a decathlon where 33 athletes competed are saved in the
file deca.csv . The first ten columns indicate the performance for each of the ten
events (100 meters run, long jump, etc.). The last (11th) column is the total score
calculated for each athlete from his 10 performances .
a) Create a 2-by-5 panels figure that plots the athletes’ total score against the per-
formance of the event, for all events.
b) Create a 10-row table named cor event score where the first column is the name
of the event and the second column is the correlation between the total score and
the numeric performance of the event. The rows should be sorted in descending
order from the highest absolute value of the correlation to the lowest.
c) Perform a principal component analysis on this dataset. Provide a short explana-
tory paragraph and a figure that illustrate your interpretation of the PCA.
COMPSCI-2035B Final Examination - Take Home Page 4 of 7
4.(16 points) Influenza Evolution. The file h3n2.csv contains n = 950 genetic sequences of
the H3N2 strain of influenza viruses that have circulated among human since 1968.
The first column represents the name of the genetic sample and the second column its
molecular sequence expressed with amino acids. Each letter of the molecular sequence
represents an amino acid. The third column is the year of the sample collection. Each
sequence has 566 amino acids. We refer to the position of an amino acid simply as
its position in the genetic sequence. For example, in the sequence NGTMVK , the amino
acid G is in position 2. We want to focus this analysis only on the amino acids that
are between positions 100 and 500 (inclusive). We define the distance between two
sequences as the number of amino acids they differ by. For example, the sequences
NGTMVK and NGAMTK have a distance equal to 2 because their amino acids in positions
3 and 5 differ.
a) Calculate (in a n× n matrix) the pairwise distances between all sequences.
b) Perform a classical multi-dimensional scaling (MDS) in dimension 2. Create a
scatter plot of the projected data points, colouring each projected point by the
decade of its collection year (for example, the 1970s is the decade for the years
1970, 1971, ..., 1979). Make a legend to identify the decades.
c) Antigenic drift is a kind of genetic variation in viruses resulting from the accu-
mulation, over time, of mutations in the virus genes that code for virus-surface
proteins that antibodies of humans recognize. This results, year after year, in new
strains of influenza virus that “look” different from the strains of previous years
(they are more “distant”) making it easier for the changed virus to spread through-
out a partially immune population. Does the MDS performed in b) illustrate the
antigenic drift of influenza? Explain with one short paragraph.
COMPSCI-2035B Final Examination - Take Home Page 5 of 7
5.(14 points) Ice creams and Weather. An ice cream truck is a commercial vehicle that serves
as a mobile retail outlet for ice cream. The owner of an ice cream truck has noticed
that, unsurprisingly, her sales are better during sunny warm days. She wants to better
understand this relationship between the weather and her business. Her daily sales
revenues (in dollars) are presented in the file sales.csv . Daily records of temperature
(in Celsius) and rainfall (in mm) for the 2019 summer season were downloaded from
a weather website to the file weather.csv for the location where she usually sells ice
cream. Note that the ice cream truck owner does not operate every day because of
various reasons (sickness, technical problems with the truck, etc.).
a) Perform a multivariate linear regression that models the ice cream sales revenues
as a function of the two weather variables. Make one figure of your choice that
illustrates this linear regression (possibly with multiple panels).
b) The 5-day weather forecast for next week is:
Day Temperature (oC) Rainfall (mm)
Monday 32 0
Tuesday 31 5
Wednesday 26 10
Thursday 23 15
Friday 23 2
What is the total expected revenues for the next 5 days (assuming the owner works
every day)? What is the 90% confidence interval for your estimate?
COMPSCI-2035B Final Examination - Take Home Page 6 of 7
6.(22 points) Spam Emails. A charity has contacted you to ask if you could help them filtering
the hundreds of spam emails they receive everyday. Spams affects directly the quality
of the service offered by the charity because they crowd out important and urgent
emails received from people in need. One of your friend thinks spams can be detected
by calculating the frequency of certain words and characters in the text of an email.
Your friend developed a program that calculates the frequencies of 48 selected keywords
and 9 other variables that look at other various metrics from the content of the email.
Hence, in total, there are 57 metrics extracted from a given email. The charity gave
you a random sample of 1,000 emails they received last month, and your friend ran the
program on those emails. Then, your friend read through all of the 1,000 emails and
annotated each email to indicate if it was indeed a spam or not. The result of this hard
work is saved in the file spam-train.csv , where the first 57 columns represent the
various metrics calculated by your friend’s program, and the last (58th) column is the
spam annotation: 1 if the email was a spam, 0 else.
a) Based on this “training” dataset presented in spam-train.csv , develop a pre-
dictive model based on a logistic regression coupled with a ROC analysis, that
classifies emails as spam or not. The charity is willing to accept that not more
than 2 out of 100 non-spam emails can be wrongly classified as “spam”.
b) Now that your predictive model is developed, your friend has retrieve new emails
received yesterday, ran the 57 metrics on them and saved the results in the file
spam-test.csv . Run you predictive model on this new data set and classify each
email as a spam or not. What is the proportion of spams that were identified with
your model in this new data set?
c) A software company has approached the charity, claiming they have a new state-
of-the-art software that can detect spams like never before. The director of the
charity is tempted to buy this software but hesitates because of its hefty price.
The software company has run its state-of-the-art program on the same training
set spam-train.csv as you did. The software gives a numerical score to an email:
the higher the score, the more likely the email is a spam. The scores of the 1,000
emails of the training dataset are saved in spam comp.txt . The director asks you
if the software company does better than the method you and your friend provided
from your benevolent (free) work. Answer the director with a short paragraph that
explains your comparative analysis along with one single figure of your choice.
COMPSCI-2035B Final Examination - Take Home Page 7 of 7
7.(22 points) Bike Sharing. A city put in place a bike sharing system a year ago. Through this
system, users are able to easily rent a bike from a particular station and return it back at
another station (possibly the same). There are complaints that some stations often have
no bike available to rent. The logistics to make sure there are enough bikes available at
all renting stations is complex and partially based on the duration of the bike ride for
each user. Municipal employees have noticed that the ride duration tends to be longer
when the weather is nice. If this is true, the municipal staff that moves bikes to empty
stations could plan its activity in advance, based on weather forecasts, to improve bike
availability. The manager of the bike sharing program wants to be sure that weather
influences the bike ride duration and hires you as a data analyst consultant to study
this.
The manager provides a dataset of 2,000 bike rides randomly selected over the last year
in the file bike trips.csv as well as the file bike stations.csv that translates in
English the bike stations names from numerical codes. You also have access to weather
data for the city in the file bike weather.csv . The file bike INFO.txt contains
important additional information about those three files.
a) Merge the information from all three datasets (about bike rides, station names
and weather) in a single table that must have the following format (note that the
codes for weather and stations are replaced by their “names”):
day station start station end duration weather
1 BotanicalGardens MainStreetSouth 56.78 Sunny
1 MontagueStreet MontagueStreet 12.34 Storm
1 BakerStreet AdelaideStreet 8.76 Storm
1 TrainStation BotanicalGardens 6.31 Light Rain
2 ShoppingMall CravenAvenue 69.31 Sunny
· · · · · · · · · · · · · · ·
The table above is illustrative and does not represent the values of the actual data
contained in the file. The variable duration is the duration of the bike ride in
minutes.
Hint: this question involves several steps of joining tables.
b) Produce and display a table that summarizes the average bike ride duration for
each of the four types of weather. Create a well-annotated boxplot that shows the
distribution of bike ride durations by weather type.
c) Conduct a rigorous statistical analysis that determines if the bike ride durations
differs with the type of weather. Write a short paragraph that summarizes your
analysis.