ECE 795 -- Advanced Big Data Analytics
Final Project: Comprehensive Design of Big Data Analyses
Assigned: March 3, 2020 Spring 2020
Project Demonstration: April 14 and 16, 2020
Project and Report: April 16, 2020
In this project, you will need to leverage the knowledge and tools discussed in this course to
design a comprehensive workflow of big data analysis. Please select one task from the following
(first come, first serve) and each task only allow five people to work on (Task 1 allows six people
to work on, each aims for a different format conversion path). Please make sure to provide
sufficient comments on your code to get full credits. For the sake of space, the references, hints,
and some requirements are not included here. Please find the complete descriptions of each
task in GitHub.
Task 1: Large Scale Web Record Format Conversion
1. Download the provided CSV data from the link and store it in HDFS.
2. Pick one of the data format conversion paths in the following:
a. CSV to XML to JSON
b. CSV to XML to YAML
c. CSV to JSON to XML
d. CSV to JSON to YAML
e. CSV to YAML to XML
f. CSV to YAML to JSON
3. Implement a PySpark application to pre-process the raw data if necessary and convert
the original CSV data to the first data format you chose in Step 2. Afterwards, covert the
data again to the second data format in Step 2.
4. Repeat Step 3 after increasing the number of workers to 3 and 4 in the cluster. Compare
the computing times before and after the changes and plot the figure "Computing time vs.
#workers".
5. Note: there will be two sets of CSV files as the inputs. One is a large number of small
CSV files and the other is a single large CSV file as input data. Please make sure your
PySpark application can handle both cases. Performance analysis between two input sets
should be provided.
Task 2: Stack Overflow Data Analysis in PySpark
1. Use Google Cloud BigQuery API to load the provided data into HDFS.
2. Use PySpark to read the data from your clusters.
3. Analyze the data and answer the following questions:
a. How many questions are posted from Sept. 1st, 2019 to Dec. 31st, 2019?
b. What is the percentage of questions that have been answered over the above
period?
c. How long on average were the questions answered on website over the above
period?
4. Using questions provided in Step 3 as the examples, do more data analyses for the given
dataset and try to find different types of useful information. Please implement all the
analyses using PySpark codes and justify your conclusion of the analyses with the results
of the codes. Your report can design to cover tasks as follows.
a. Find one way to improve the answer rate for a question.
b. Generate an analysis of the user changes in Overflow over the last twelve years.
c. Generate a review of topical trends during the previous twelve years.
5. The complexity and novelty of the analyses will have an impact on the scoring. External
data are allowed to be used along with Stack Overflow data.
Task 3: Publication Analysis for chosen Universities from Google Scholar
1. Pick a list of universities and search on the profile page of Google Scholar website.
2. Implement a program to identify the top 300 professors (ranked by total citations) from
the homepage of each university by a web crawler, find a complete paper list from the
homepage of each identified professor, and store all the related web pages in HDFS.
a. https://scholar.google.com/citations?view_op=view_orgorg=165895925669911
47599hl=enoi=io Homepage of University of Miami.
b. https://scholar.google.com/citations?hl=enuser=7fQX_pYAAAAJ Homepage of
Prof. A Parasuraman, who has the top citations in University of Miami.
c. The number of total papers collected is no fewer than 1,000,000 (more than 10
universities).
d. Cstart and Pagesize are the parameters in URL to scan the paper list.
3. Find the fastest way to analyze the best co-author of each professor and justify why your
method is the fastest one.
4. Use PySpark codes to partition the collected papers in various ways and analyze the
collected data. Please justify your conclusion of the analysis with the results of the codes.
Your report can design to cover tasks as follows (The complexity and novelty of the
analyses will have an impact on the scoring).
a. Generate an analysis of the best department in each university.
b. Generate a review of popular research keywords in each university during the
previous years.
Task 4: Word Count on Streaming Tweets
1. Incorporate Cloud Dataflow and configure it correctly in Google Cloud Platform.
2. Using Cloud Dataflow to import Tweets from Twitter API with the keywords of your
selection.
3. Use PySpark to do word count for all the newly coming tweets in a configurable interval
(as small as possible) and save the results.
4. Please test the word count system and report the smallest interval it supports (e.g., 1
min). Please explain what the bottleneck is to achieve a smaller interval.
5. Write a PySpark application to count the number of tweets with the word count between
a given range.
6. Plot the distribution of tweet word count for a given time interval.
7. Compare the performance of computing the word count distribution based on raw data
and based on the results. Try it multiple times when the numbers of tweets are different.
Plot the figure "computing time vs. the number of tweets".
Please turn in a written report of your project (no more than 6 pages in the same template as in
the first project) including:
Instructions on how to compile and run your program
Documented program listings
The design of your implementation
Detailed discussion of your implementation and analyses
Necessary diagram(s), flowchart(s), pseudo code(s), etc. for your implementation
A conclusion, summarizing your understanding and analyses
A list of references, if any.
The final report (submitted to blackboard) and codes are due on April 16, 2020. Project
demonstration (no more than 8 minutes) is on April 14, 2020 (Tasks 1 2), and April 16, 2020
(Tasks 2 3 4).