首页 > > 详细

代做Data Gathering, Transformation, and Enrichment.代做Java语言

项目预算:   开发周期:  发布时间:   要求地区:

Scenario

In the highly competitive movie streaming services market, your client has asked for help with enriching their data with publicly available from the Internet Movie Database (IMDb), one of the most popular websites that contains large amounts of data on movies. You'll need data. Using a dataset from your client containing different data about movies, you are tasked with scraping online publicly available data to transform. the scraped data into a structured format and integrate it with your client's data to come up with an enriched dataset.

Key Tasks

This project has two parts with multiple tasks and separate deliverables for each part. Read each set of instructions carefully.

PART A: Data Gathering, Transformation, and Enrichment.

Download and unzip the needed file to work on Project 03.zip

The project zip file includes three files:

Project_3_Part_A.ipynb

TopVoted_500_Movies_HTML.txt

Movies.csv

Perform. data gathering using web scraping to enrich your client's dataset (Movies.csv) containing top voted 500 movies released between 2018 and 2020. The dataset includes the following fields:

movie_id: alphanumeric unique identifier of the title.

originalTitle: original title, in the original language.

description: a short description of the movie.

ratingCategory: movie rating for empowering families to make informed movie choices.

genres: includes up to three genres associated with the title.

Rename Project_3_Part_A.ipynb Jupyter Notebook by adding your Group# or Lastname to the filename. Edit the code in the notebook to complete the following tasks:

1. Conduct Data Gathering:

Scrape this IMDb webpage of movies released between 2018 and 2020, sorted by votes in descending order. Pull movie_id, rank, title, year, rating, and votes for the top 500 movies sorted by user number of votes in descending order.

Transform. the scraped data to a structured format and write it to a CSV file (name it IMDb_TopVoted.csv).

2. Conduct Data Enrichment:

Import the Movies.csv file to a pandas DataFrame. called df1.

Import the scraped data from the IMDb_TopVoted.csv file to a pandas DataFrame. called df2.

Implement data cleansing and transformation for the df2.

Enrich the given dataset (df1) by merging it with the scraped data (df2).

Rearrange the dataset fields to be listed in the following order:

   o movie_id, rank, title, originalTitle, description, year, votes, rating, runtimeMinutes, ratingCategory, genres

Export the enriched dataset to a CSV file:

   o Use the following naming convention: Project_3_Part_A_Group#.csv

PART B: Automate Data Transformation and Integration.

Use Alteryx to automate the process that you applied in Part A to clean, transform, and integrate the data.

1. Create Alteryx workflow to:

a. Import IMDb_TopVoted.csv dataset you created in Part A.

b. Do the necessary data cleansing and transformation.

c. Import Movies.csv dataset.

d. Merge the two datasets to obtain the enriched dataset.

e. Sort the enriched dataset by rank in ascending order, and rearrange the dataset fields to be listed as the following:

movie_id, rank, title, originalTitle, description, year, votes, rating, runtimeMinutes, ratingCategory, genres

a. Split genres into genre01, genre02, and genre03. Drop the original genres column after the splitting.

b. Export the enriched dataset to CSV file:

Use the following naming convention: Project_3_Part_B_Group#.csv

2. Report in a Word document, a brief description of the following:

Screenshot of your PART B Alteryx workflow.

What data was used to enrich the client's data?

Describe the data cleaning and transformation that was implemented.

What to Submit:

PART A: Upload the following 4 files:

The edited Jupyter notebook in .IPYNB format with annotations that explain and document your work.

A copy of the Jupyter notebook in .HTML format.

CSV file for the scraped data (IMDb_TopVoted_Group#.csv).

CSV file for the enriched dataset (Project_3_Part_A_Group#.csv).

PART B: Upload the following 3 files:

Alteryx file for the workflow (Project_3_Part_B_Group#.yxmd).

CSV file for the output enriched dataset (Project_3_Part_B_Group#.csv).

Word document with written description (Project_3_Part_B_Group#.doc).

Note: do not submit as a zip file.








软件开发、广告设计客服
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 9951568
© 2021 www.rj363.com
软件定制开发网!