Assignment 4 Web Scraping
STAT141B Spring 2024
Due: June 1, 10pm Submit via Canvas
In this assignment, we will use Craigslist.org as an example Web site to explore Web scraping. Remember that there are terms of service that prohibit one from doing this generally and you should not use the data for any purpose other than this assignment.
We will look at apartments/housing for rent and collect information such as
• the rent amount
• number of bedrooms
• number of bathrooms
• type of housing
• square-footage
• address or location information
• whether they allow pets
• amenities such as
– laundry,
– parking - garage (attached, detached, carport), street, off-street, – gym
– internet
– furnished
– pool
– EV charging
– Air conditioning
–
. . .
• length of lease
• security deposit
• utilities included or not?
We’ll look at areas around UC Davis and UC Berkeley and compare rental prices for equivalent housing.
The regions to focus on are
• 10 mile radius from UC Davis
• 6 mile radius from UC Berkeley but only in the East Bay.
You can, of course, look at other regions, in addition to these two.
You are to write functions that allow you to collect data from 2 different Craisglist.org sites
• sacramento.craigslist.org
• sf.craigslist.org
and for these different regions/areas of interest.
The top-level function should return a data.frame for each apartment/house. The columns should contain the variables described above.
Use the functions to collect the information for all of the search results from the first 5 page of results for these regions of interest.
Verify that the data are correct and compare the results for the two areas.
Note
Make certain not to make too many requests too rapidly so that you may get blocked. In other words, use
Sys.sleep() to wait between requests.
Test your functions on a single page rather than running it on all pages and assuming it will work.
Consider manually downloading some sample pages to work on to do the extraction rather than re-fetching it each time as you refine your code.
Report
In your report, describe
• the approach you took,
• challenges you encountered and how you overcame them
• how you verified the data were correct,
• the results and any insights you have.
Submit a PDF for your report and all of the code as an R file. If you used R markdown, also submit the Rmd file.
Potentially Useful Packages
• XML, xml2
• RCurl, curl, httr
• RJSONIO, rjsonlite
• RSelenium