ISAD363
ADVANCED DATABASES
COURSEWORK ASSIGNMENT
MODULE LEADER
Dr Marco Palomino
DEADLINE
Tuesday 5 May 2020 at 16:00
ASSESSED MODULE LEARNING OUTCOMES
ALO1: Critically compare and contrast the differences between
relational and non-relational (NoSQL) databases.
ALO2: Critically appraise NoSQL database strengths and
weaknesses.
ALO3: Demonstrate the ability to perform all the CRUD
operations (namely, create, retrieve, update and delete) on
NoSQL databases.
1. BACKGROUND
Computer systems that make recommendations are called recommender systems. Netflix
recommends movies you may like to watch, Amazon recommends products you may be interested
in purchasing, and Facebook recommends people you may want to contact and become friends
with. The actual algorithms used by these companies to make such suggestions are closely
guarded trade secrets, and they are not available for public scrutiny. However, in this assignment,
you will have an opportunity to learn the basics behind recommender systems. You will use
Facebook as a case study for recommender systems. First, you will have to create a database to
store Facebook users and such a database will have to be a Neo4j graph database.
2. FRIENDSHIP GRAPHS
Figure 1 displays the relationships (friendships) graph for the main characters in Romeo and
Juliet, the tragedy published by William Shakespeare in 1597.
A link between character A and character B in Figure 1 means that A considers B a friend, and
also B considers A a friend—these links are undirected, despite the arrows. Note that Figure 1
shows the characters who belong to the House of Montague in orange, the characters who belong
to Ruling House of Verona in green, the characters who belong to the House of Capulet in pink,
the Franciscans in purple, and the Nurse who works for the House of Capulet in blue. The colours
are just to help us interpret the graph—colours are not mandatory in your assignment.
Figure 1. Romeo and Juliet friendship graph
3. FRIEND RECOMMENDATIONS
Formally speaking, recommending a new friend is equivalent to answering the following
question: “For user X, list some non-friends of X in order, starting with the best friend
recommendation and ending with the worst.” A non-friend is a user who is not X and is not a
friend of X. For example, for the user Mercutio in Figure 1, the list of non-friends is:
Benvolio
Capulet
Friar Lawrence
Juliet
Montague
The Nurse
Tybalt
3.1. Recommend by number of common friends
If non-friend Y is your friend’s friend, then maybe Y should be your friend too. If user Y is the
friend of many of your friends, then Y is an even better recommendation. The best friend
recommendation is the user with whom you have the largest number of mutual friends. For
example, consider Mercutio in Figure 1.
Mercutio has one friend in common with Benvolio (Romeo).
Mercutio has two friends in common with Capulet (Paris and Prince Escalus).
Mercutio has one friend in common with Friar Lawrence (Romeo).
Mercutio has one friend in common with Juliet (Romeo).
Mercutio has two friends in common with Montague (Prince Escalus and Romeo).
Mercutio has no friends in common with the Nurse.
Mercutio has no friends in common with Tybalt.
Therefore, Capulet and Montague are the best friend recommendations for Mercutio, and the
Nurse and Tybalt are the worst recommendations.
Note that the recommendations might not be symmetric: the best friend recommendation for
Montague might be Mercutio, but the best friend recommendation for Mercutio might be Capulet,
rather than Montague.
3.2. Recommend by influence scoring
We will now study a different algorithm. Consider the following hypothetical situation.
Two of your friends are J. K. Rowling and James Corden.
J. K. Rowling has only two friends (you and one other person).
James Corden has 7,000,000,000 friends.
J. K. Rowling and James Corden have no friends in common (apart from you).
Since J. K. Rowling is highly selective in terms of friendship, and is a friend of yours, you are
likely to have a lot in common with J. K. Rowling’s other friend. On the other hand, James Corden
is indiscriminate and there is little reason to believe that you should be friendly with any particular
one of James Corden’s other friends.
Let us incorporate the above idea into a friend recommendation algorithm. Here is the concrete
way to do so—note that we call the technique “influence scoring”: Suppose that PersonA and
PersonB have three friends in common: Friend1, Friend2, and Friend3. When recommending
friends by influence scoring, the score for PersonB as a friend of PersonA is 1/numfriends(Friend1)
+ 1/numfriends(Friend2) + 1/numfriends(Friend3), where numfriends(f) is the number of friends
that f has. In other words, each friend F of PersonA has a total influence score of 1 to contribute,
and divides it equally among all of her friends.
In the example above, J. K. Rowling’s other friend would have a score of 1/2, and each of James
Corden’s friends would have a score of 1/7000000000.
3.3. Algorithm comparison
Does the change of recommendation algorithm make any difference? As you should be able to
find out, Mercutio gets the same friend recommendations with both recommendation algorithms
(number of common friends and influence scoring). In fact, there are 5 characters in the Romeo
and Juliet scenario for whom the recommendations are the same, regardless of the algorithm.
However, there are 6 characters for whom the recommendations are different.
4. THE DATASET
The assignment employs a Facebook dataset provided as a courtesy by the Max Planck Institute
for Software Systems. The data is described in more detail in the following publication:
Viswanath, B., Mislove, A., Cha, M. and Gummadi, K.P., 2009, August. On
the Evolution of User Interaction in Facebook. In Proceedings of the 2nd ACM
Workshop on Online Social Networks (pp. 37-42). ACM.
The Facebook dataset contains, exactly, 63,731 Facebook users—no duplicates are included—
and 817,090 friendship relationships among them—this is the total number of links established
via friendships. Do not be alarmed if importing the Facebook dataset into Neo4j takes a while.
The number of friendships is large! However, if it takes longer than 10 minutes, it probably means
that you are doing something wrong—perhaps, you did not index the right attributes in your
database. Also, do not try to draw the whole Facebook graph at once—this may cause your
computer to “freeze”. Besides, even if you successfully draw the entire graph, you would not learn
much from a tangled mess of 817,090 links.
The Facebook dataset for your coursework can be downloaded from the module DLE website
under the Coursework Section—look for a number of files stored in a folder named Facebook
Dataset. Although the dataset is spread across different files, you are expected to import every
single data item into the database that you will submit—marks will be deducted for failing to
include all the data provided.
4. THE PROBLEM
4.1. Recommend by number of common friends
For every Facebook user in the dataset with an ID number that is a multiple of 980, provide a list
containing the first 10 friend recommendations, as determined by the number of common friends.
If there are fewer than 10 recommendations, provide all the recommendations. For each case, you
will have to provide the Neo4j CQL query that generated the friend recommendations, as well as
the actual recommendations. Note that the recommendations alone will not receive any marks,
and the Neo4j CQL queries must be provided in text format (screenshots will be awarded zero
marks). Also, note that your queries will be tested. Hence, queries that do not produce the
submitted results will invalidate the answers and zero marks will be awarded.
An example of the format in which your results should be delivered is listed below in Table 1.
Please, note that the CQL query and the recommendations listed in Table 1 are shown only to
illustrate the format, and they are not actual solutions to the problem in question.
User ID CQL Query Friend recommendation
… … …
13720 Match (friend:FacebookUser {id:
‘13790’})-[r:FRIENDS_WITH]-(
user:FacebookUser) RETURN friend
17125, 7033, 15462,
33049, 51105, 16424,
23, 7996, 1539,
17420
14700 Match (friend:FacebookUser {id:
‘14775’})-[r:FRIENDS_WITH]-(
user:FacebookUser) RETURN friend
14473, 14495, 17951,
19611, 22749, 23259,
30002, 3154, 8269,
862
15680 Match (friend:FacebookUser {id:
‘15760’})-[r:FRIENDS_WITH]-(
user:FacebookUser) RETURN friend
28606
… … …
Table 1
4.2. Recommend by influence scoring
For every Facebook user in the dataset with an ID that is a multiple of 980, provide a list
containing the first 10 friend recommendations, as determined by influence scoring. If there are
fewer than 10 recommendations, provide all the recommendations.
Once again, the output format should be the same shown in Table 1. You are expected to provide
the Neo4j CQL query that generated the friend recommendations and the actual
recommendations. Note that the recommendations alone will not receive any marks, and the
Neo4j CQL queries must be provided in text format (screenshots will be awarded zero marks).
Also, note that your queries will be tested. Hence, queries that do not produce the submitted results
will invalidate the answers and zero marks will be awarded.
4.3. Algorithm comparison
Considering only those 65 Facebook users with an ID that is a multiple of 980, identify the
Facebook users who have the same first 10 friend recommendations under both recommendation
algorithms, and the Facebook users who have different first 10 friend recommendations under the
two algorithms. Your answers should appear in the format depicted in Table 2 (see the following
page. Please, note that the answers listed in Table 2 are shown only to illustrate the format in
which you should submit the solutions; yet, they may not coincide with the actual solutions to the
problem in question.
The code used for comparing the output of both algorithms has to be implemented using C# or
Python (no other choices will be allowed), and it has to be included as part of your submission. If
you do it manually, without using Python or C#, your solution will not be accepted and zero marks
will be awarded.
4.4. Report: Which algorithm and database model do you propose to use?
Produce a report (1,000 words maximum) to explain which algorithm you propose to recommend
new friends in Facebook, and whether you will implement it using a relational or non-relational
database. Your report must be supported by evidence derived from the output you obtained in
Section 4.1, Section 4.2 and Section 4.3. For example, if you propose to use influence scoring,
you will have to use the results of the previous sections to explain why this is the best option.
IDs of users having the same output under both algorithms:
1960
4900
5880
…
IDs of users having different output under both algorithms:
980
2940
3920
…
Table 2
Before reaching a conclusion regarding which algorithm you wish to propose, you may want to
compare both algorithms using more than the 65 Facebook users whose ID is a multiple of 980,
but you will have to refer to the additional results as evidence to support your decision. It may
also be the case that you propose a combination of the two algorithms explored above. While this
is a suitable alternative, you have to be very clear in terms of explaining how exactly you attempt
to combine the algorithms (which one you will apply first, how much weight is assigned to each
algorithm, what happens if the algorithms agree or disagree, etc.).
Finally, you may want to combine the two algorithms explored with other ideas of your own (or
ideas you have researched separately). You will still have to be very clear in terms of how your
proposed algorithm will work and what evidence you present to support this.
5. DELIVERABLES
Submit the following two files via the DLE.
(1) A WORD document containing (documents that are not in WORD format will not be
reviewed and zero marks will be awarded):
(a) The solutions for Section 4.1 in the format illustrated by Table 1.
(b) The solutions for Section 4.2. Once again, the solutions should be submitted
following the format illustrated by Table 1.
(c) The solutions for Section 4.3. The solutions should be submitted following the
format of Table 2.
(d) A report explaining which algorithm you propose to use to make friend
recommendations and whether you will implement it using a relational or non-
relational database.
(2) The C# or Python code employed to obtain the answers for Section 4.3.
6. ASSESSMENT AND GRADE CRITERIA
Task Marks
1 Recommend by number of common friends
Explanation: See Section 4.1.
Note: This is a core module for the MSc Data Science and Business
Analytics; thus, if the submitted Neo4j CQL queries do not execute,
marks will not be granted.
Learning Outcomes: ALO2, ALO3.
25
2 Recommend by influence scoring
Explanation: See Section 4.2.
Note: This is a core module for the MSc Data Science and Business
Analytics; thus, if the submitted Neo4j CQL queries do not execute,
marks will not be granted.
Learning Outcomes: ALO2, ALO3.
25
3 Algorithm comparison (answers) and the source code (C# or
Python) employed to compare the results of both algorithms.
Explanation: See Section 4.3.
Note: The code you submit is expected to execute correctly. If the
submitted code does not compile and execute, the mark for this part
of the assignment will be zero (0).
Learning Outcomes: ALO2, ALO3.
25
5 Report: Which algorithm and database model do you propose to
use?
Explanation: See Section 4.4.
Learning Outcomes: ALO1, ALO2.
25
7. EXTENUATING CIRCUMSTANCES AND PLAGIARISM
Please note that the University enforces a penalty of zero percent for work submitted after the
published deadline without valid extenuating circumstances (ECs). The University’s policy on
ECs and the EC application form are available at:
https://www.plymouth.ac.uk/student-life/your-
studies/essential-information/exams/exam-rules-and-
regulations/extenuating-circumstances
Also, note that this is an individual assignment and must reflect your own work. Thus, while you
may discuss, in general terms, the assignment with your fellow students, the assignment MUST
be your own work. Do not share designs or code with anyone, OR submit a program design or
code that is, wholly or partially, someone else’s work.
Marks will be awarded according to the following criteria
Mark Grade Criteria
Unprofessional
(0-39%)
The quality of the work has not met the learning outcomes. Understanding
and application of fundamental concepts and techniques is questionable.
Work of this quality would not be acceptable in professional employment.
Poor
(40-49%)
The quality of work has only met the threshold level but still requires further
work to get it to a better standard. Your submission contains logical and
analytical errors related to analysis and design techniques. Also, it only
demonstrates a basic understanding of the subject competence. Further
improvement is required to demonstrate personal thoroughness, effort and
independent learning.
Fair
(50-59%)
The quality of work submitted suggests that you have demonstrated a fair
understanding of the analysis and design techniques. Still, the work you have
submitted contains some errors and incomplete analysis and design. Also, it
demonstrates you are able to apply your knowledge but need to improve
understanding of the subject competence and personal thoroughness, effort
and independent learning.
Good
(60-69%)
The quality of the work submitted suggests that you are able to apply the
analysis and design techniques well. The work you have submitted is
substantially correct and complete. Also, it demonstrates a good
understanding of subject competence and personal thoroughness, effort and
independent learning.
Excellent
(70-100%)
The quality of work is outstanding with no significant flaws. It demonstrates
a high level of subject knowledge and competence; personal thoroughness,
effort and independent learning; and possibly significant additional
analytical/critical thought. Well done!