Module Code: COMP 5840M01
Page 1 of 7 Turn the page over
Module Title: Data Mining and Text Analytics © UNIVERSITY OF LEEDS
School of Computing
Semester Two 2018/2019
Calculator instructions:
You are allowed to use a nonprogrammable calculator only from the following list of
approved models in this exam: Casio fx-82 (all variants), Casio fx-83 (all variants),
Casio fx-85 (all variants)
Dictionary instructions:
You are not allowed to use your own dictionary in this exam. A basic English dictionary
is available to use: raise your hand and ask an invigilator, if you need it.
Exam information:
There are 7 pages to this exam.
There will be 2 hours to complete this exam.
Answer all 3 questions.
The number in brackets [ ] indicates the marks available for each question or part
question.
You are reminded of the need for clear presentation in your answers.
The total number of marks for this examination paper is 60.
You are allowed to use annotated materials
Module Code: COMP 5840M01
Page 2 of 7 Turn the page over
Question 1
(a) Marvel Studios make films around Marvel super-heroes, such as Iron Man; and they want
to promote diversity and inclusion, including more women and minority group super heroes in
their movies, such as Captain Marvel, Black Widow, and T’Challa, the Black Panther. Marvel
Studios wants to learn whether movie-goers who have liked women and minority Marvel
super-heroes will like their latest movie, “Endgame”. They asked a group of 7 movie-goers
whether they liked movies starring Captain Marvel, Black Widow, and T’Challa, the Black
Panther; then they asked the group to watch “Endgame” and report whether they liked this
new movie.
The following csv file represents data about the 7 movie-goers and which super-heroes they
liked:
1= yes, 0 = no;
E = Endgame, M = Captain Marvel, W= Black Widow, and T = T’Challa, the Black Panther
E,M,W,T
1,1,0,1
0,0,1,0
1,0,1,1
0,1,1,1
0,1,0,0
1,0,0,1
1,0,1,1
Construct a J48-style decision tree from this training data, to predict class E= “like the movie
Endgame” with at least 85% accuracy when evaluated on the training set. Justify your choice
of features for decision points.
[6 marks: 2 method, 2 justification, 2 full decision tree]
(b) Apply your decision tree from (a) to a new movie-goer, who likes Black Widow and Black
Panther and Captain Marvel; will they like Endgame?
[1 mark for answer with justification]
(c) Extend your J48-style decision tree from (a) to a decision tree with 100% accuracy when
evaluated on the training set. Draw or write down your decision tree, and justify your choice of
features for decision points.
[4 marks: 100% accurate decision tree, justification]
(d) Apply your decision tree from (c) to the new movie-goer, who likes Black Widow and Black
Panther and Captain Marvel; does the revised decision tree predict they will like Endgame?
[1 mark for answer with justification]
Module Code: COMP 5840M01
Page 3 of 7 Turn the page over
(e) Marvel Studios put the same questions to a new, different group of 7 movie-goers, to
collect a separate test data-set. The decision tree from (c) is more accurate than the decision
tree from (a) when evaluated on the training set. Which decision tree, (a) or (c) is better to use
for predicting whether the test-set of movie-goers will like Endgame? Justify your answer.
[2 marks: (a) or (c) with justification]
(f) Apply the a priori association rule mining algorithm to the Marvel Studios movie-goer
training data-set, to find all association rules linking 2 or more features, with at least 90%
accuracy and coverage of at least 4 instances.
[6 marks]
[Question 1 total: 20 marks]
Module Code: COMP 5840M01
Page 4 of 7 Turn the page over
Question 2
English and Arabic are both official languages in Sudan. English is used in many official and
scientific documents, but most Sudanese people speak Arabic as their first or main language.
The Sudan government wants to promote use of the Arabic language terms for plants and
animals found in Sudan, by replacing English terms for these plants and animals with Arabic
words (transliterated to the Roman alphabet) in all Sudanese English-language government
documents. For example, in National Park official documents, references to palm trees will be
replaced with the Arabic word for “palm”.
To achieve this, the Sudan government have acquired some text analytics data-set resources
which could be useful: a list of plants and animals found in Sudan, in both English and Arabic;
a large text corpus of existing Sudan government English-language documents; and the text
of a Sudanese English dictionary, a Sudanese equivalent of LDOCE Longman Dictionary of
Contemporary English.
However, some English words are ambiguous, for example “palm” can be a type of plant but
also has another sense “the inside of a hand”. It is important that an English word is replaced
by its Arabic translation only when the word in context is used in a plant or animal sense. The
government needs a method to automatically classify such ambiguous words in English
documents, to solve the problem of identifying plant or animal senses of ambiguous words.
(a) Outline a supervised machine learning solution to the problem of classifying the sense of
words in context. Explain why it could be expensive to develop the training data-set.
[4 marks: 3 marks for outline supervised ML solution, 1 mark for cost explanation]
(b) If the training data-set from (a) is converted to ARFF format, then you could load it into
WEKA to test several different classifiers on the data-set, and comparatively evaluate results,
to identify the best classifier. The WEKA Explorer Classify tab offers four Test options for
evaluating classifiers:
(i) Use training set
(ii) Supplied test set
(iii) Cross validation
(iv) Percentage split
State an advantage and a disadvantage of each of these Test options in selecting the best
classifier for this task.
[8 marks: 1 mark advantage, 1 mark disadvantage for each option]
(c) Outline an unsupervised or semi-supervised method for the task of identifying plant or
animal senses of ambiguous words. Explain why this model could be less expensive to build
than your answer to (a).
[4 marks: 3 marks for outline un/semi-supervised solution, 1 mark for cost explanation]
Module Code: COMP 5840M01
Page 5 of 7 Turn the page over
(d) The Sudan government wants to train Sudanese university computer scientists in data
mining and text analytics, by offering them postgraduate scholarships to study overseas; and
as a first step, they want to collate a database of MSc and PhD programmes related to Data
Mining and Text Analytics offered by universities worldwide. Each degree programme is to be
represented in a database record, with fields for degree title, names of modules included,
location, cost, information source, etc.
The Sudan government seeks advice on what techniques to use to gather this information
from Web sources. Should they use Information Retrieval, or Information Extraction, or both?
Outline the difference between IR and IE, and give an overall recommendation to the Sudan
government, with justification.
[4 marks: 2 marks difference between IR/IE, 2 marks justified recommendation]
[Question 2 total: 20 marks]
Module Code: COMP 5840M01
Page 6 of 7 Turn the page over
Question 3
Kaggle.com offers online discussion forums, where users can post comments and questions.
Kaggle wants to develop a Machine Learning classifier to detect forum comments that use
offensive language and could be offensive to other users. The Kaggle forums receive a large
volume of comments. Only a small proportion of these comments are offensive, but to maintain
the Kaggle reputation for quality and fairness, it is important that all offensive comments are
identified and dealt with urgently. For Kaggle, it is most important that a classifier flags all
offensive instances, to be investigated further by customer service experts; the customer
service experts will focus on comments flagged by the classifier, and they do not want
offensive comments to “slip through the net” and not be dealt with. It does not matter as much
if the classifier incorrectly labels some innocent emails as offensive, because the customer
service experts should spot these mistaken instances and discount them. However, the
customer services managers would prefer to minimize time wasted on examining innocent
comments incorrectly flagged as offensive.
(a) Outline how to apply the CRISP-DM methodology to this data mining consultancy project.
[6 marks: 1 mark for each CRISP-DM phase applied to this task]
(b) Kaggle has provided a sample data-set of 100 comments, where each instance is labelled
with Class value: OFF or NOT. This data-set was used in experiments with three classifiers,
which we will call X, Y, and Z. The following are Confusion Matrix outputs for each classifier;
for example, for Classifier X, 90 NOT (innocent) instances were classified as NOT, and 10
OFF offensive instances were classified as NOT; in other words, all 100 comments were
classified as NOT.
X:
a b <-- classified as
90 0 | a = NOT
10 0 | b = OFF
Y:
a b <-- classified as
10 80 | a = NOT
0 10 | b = OFF
Z:
a b <-- classified as
80 10 | a = NOT
5 5 | b = OFF
Which WEKA classifier behaves like X? What is the accuracy of X?
[2 marks]
Module Code: COMP 5840M01
Page 7 of 7 End
(c) For classifiers Y and Z, calculate
(i) accuracy
(ii) precision in predicting offensive comments
(iii) recall in predicting offensive comments
[6 marks]
(d) Which of X, Y and Z is worst and which is best in meeting Kaggle’s requirements, and
why?
[6 marks: 3 for worst and reason; 3 for best and reason]
[Question 3 total: 20 marks]