class: center, middle, inverse, title-slide .title[ # ISA 401: Business Intelligence & Data Visualization ] .subtitle[ ## 06: Scraping Multiple Webpages ] .author[ ###
Fadel M. Megahed, PhD
Professor of Information Systems and Business Analytics
Farmer School of Business
Miami University
@FadelMegahed
fmegahed
fmegahed@miamioh.edu
Automated Scheduler for Office Hours
] .date[ ### Fall 2024 ] --- # Quick Refresher from Last Class ✅ Understand when can we scrape data (i.e., `robots.txt`) ✅ Scrape a webpage using
--- # Non-Graded Assessment of Your Understanding
−
+
06
:
00
Scrape the Names and Positions of the current cabinet members from <https://www.whitehouse.gov/administration/cabinet/> and save the results into a **data frame that contains two columns**: (a) name, and (b) position. --- # Learning Objectives for Today's Class - Scrape multiple webpages using
. - Use loops and/or tidymodeling approaches to scrape data from multiple webpages. --- class: middle, inverse, center # Web Scraping Demos (Cont.) --- # Demo 1: Scraping all Plane Crashes 2020-2024 - We will build on the previous example and we will scrape all the plane crashes that were recorded in the [plane crash database](http://www.planecrashinfo.com/) between 2020-2024. - Then, we will create a single **data frame** for all crashes. It will contain the fields in the individual tables as well as the year of crash. - Then, we will **export the results to a CSV** so that we can analyze that in a separate program if we wanted to. --- # Practice Outside of Class The most popular listings on Netflix are rated and reviews on ImDb are available at <https://www.imdb.com/search/title/?companies=co0144901>. Write an
script that will produce a tibble that contains the **following information for the first 300 entries**: - title, which you will save in a column titled `title` - year/years of show, which you will save in a column titled `year` - 1-2 sentence summary of show, which you save in a column titled `summary` --- class: inverse, center, middle # Recap --- # Summary of Main Points By now, you should be able to do the following: - Scrape multiple webpages using
. - Use loops and/or tidy modeling approaches to scrape data from multiple webpages. --- # Kahoot Competition # 1 To assess your understanding and retention of the topics covered last week, you will **compete in a Kahoot competition (consisting of 16 questions)**: - Go to <https://kahoot.it/> - Enter the game pin, which will be shown during class - Provide your first (preferred) and last name - Answer each question within the allocated 20-second window (**fast and correct answers provide more points**) <br> **Winning the competition involves having as many correct answers as possible AND taking the shortest duration to answer these questions.** The winner
of the competition from each section will receive: $10 Starbucks gift card. Good luck!!! .footnote[ <html> <hr> </html> **P.S:** The Kahoot competition will have **no impact on your grade**. It is a **fun** way of assessing your knowledge, motivating you to ask questions about topics covered that you do not have a full understanding of it, and providing me with some data that I can use to pace today's class. ] --- # Things to Do to Prepare for Next Class - Go over your notes, read through the supplementary material (below), and complete [Assignment 05](https://miamioh.instructure.com/courses/223961/assignments/2907509) on Canvas. .pull-left[ .center[ <img src="data:image/png;base64,#../../figures/web_scrape_in_data_science.PNG" height="300px" style="display: block; margin: auto;" /> ] * [PDF of Published Paper](https://www.tandfonline.com/doi/pdf/10.1080/10691898.2020.1787116) * [ePub of Published Paper](https://www.tandfonline.com/doi/epub/10.1080/10691898.2020.1787116?needAccess=true) ] .pull-right[ .center[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/main/PNG/rvest.png" height="300px">] * [Practical Web Scraping in R](https://www.r-bloggers.com/2019/04/practical-introduction-to-web-scraping-in-r/) ]