class: center, middle, inverse, title-slide .title[ # ISA 401: Business Intelligence & Data Visualization ] .subtitle[ ## 04: Scraping Webpages in
] .author[ ###
Fadel M. Megahed, PhD
Professor of Information Systems and Business Analytics
Farmer School of Business
Miami University
@FadelMegahed
fmegahed
fmegahed@miamioh.edu
Automated Scheduler for Office Hours
] .date[ ### Fall 2024 ] --- # Quick Refresher from Last Class ✅ Subset data in
. ✅ Read text-files, binary files (e.g., Excel, SAS, SPSS, Stata, etc), json files, etc. ✅ Export data from
. --- # Learning Objectives for Today's Class - Understand when can we scrape data (i.e., `robots.txt`) - Scrape a webpage using
. --- class: middle, inverse, center # Web Technology
.footnote[ <html> <hr> <html> .left[ .large[Source: Slides 5-15 are from [Dr Earo Wang's STAT 220 Web Scraping Slides](https://stats220.earo.me/09-web-scrape.html#2), which were adapted from [Dr Emi Tanaka's](https://emitanaka.org/about.html) "Communicating with Data" course. ] ] ] --- # World Wide Web (WWW) WWW (or the **Web**) is the information system where documents (web pages) are identified by Uniform Resource Locators (**URL**s) A web page consists of: *
**HTML** provides the basic structure of the web page *
**CSS** controls the look of the web page (optional) *
</span> **JS** is a programming language that can modify the behavior of elements of the web page (optional) --- #
</span> Hypertext Markup Language (HTML) * with the extension `.html`. * rendered using a web browser via an URL. * text files that follows a special syntax that alerts web browsers how to render it. .pull-left[ .center[**via a web browser** <img src="data:image/png;base64,#../../figures/browser_plane_crashes.PNG" width="100%" style="display: block; margin: auto;" /> ] ] .pull-right[ .center[**via a text editor** <img src="data:image/png;base64,#../../figures/text_plane_crashes.PNG" width="347" style="display: block; margin: auto;" /> ] ] --- #
HTML Structure ```html <!DOCTYPE html> <html> <!--This is a comment and ignored by web client.--> <head> <!--This section contains web page metadata.--> <title>ISA 401: Business Intelligence and Data Viz</title> <meta name="author" content="Fadel Megahed"> <link rel="stylesheet" href="css/styles.css"> </head> <body> <!--This section contains what you want to display on your web page.--> <h1>I'm a first level header</h1> <p>This is a <b>paragraph</b>.</p> </body> </html> ``` ??? * servr::httd() to serve * HTML: hier str: elements (`<tags>`) and optional attributes, and contents * > 100 elements: each html page must have `<head>` and `<body>`. (rich format -> md) * block tags: h1, p * inline tags: bold a --- #
HTML Syntax .center[`<span style="color:blue;">Author content</span>` <i class="fas fa-arrow-right"></i> <span style="color:blue;">Author content</span>] <table style="width:100%"> <tr> <td style="text-align:right;padding-right:30px;">start tag:</td><td><span class="remark-code" style="font-size:16pt"><span class="red"><span style="color:blue;"></span><span class="grey">Author content</span></span></span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">end tag: </td><td> <span class="remark-code" style="font-size:16pt"><span class="grey"><span style="color:blue;">Author content<span class="red"></span></span></span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">content: </td><td> <span class="remark-code" style="font-size:16pt"><span class="grey"><span style="color:blue;"></span><span class="red">Author content</span><span class="grey"></span></span></span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">element name: </td><td> <span class="remark-code" style="font-size:16pt"><span class="grey"><</span><span class="red">span</span><span class="grey"> style="color:blue;">Author content</span></span></span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">attribute: </td><td> <span class="remark-code" style="font-size:16pt"><span class="grey"><span <span class="red">style="color:blue;"</span><span class="grey">>Author content</span></span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">attribute name: </td><td> <span class="remark-code" style="font-size:16pt"><span class="grey"><span <span class="red">style</span><span class="grey">="color:blue;">Author content</span></span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">attribute value: </td><td> <span class="remark-code" style="font-size:16pt"><span class="grey"><span style=</span><span class="red">"color:blue;"</span><span class="grey">>Author content</span></span> </td> </tr> </table> <hr> .center[**Not all HTML tags have an end tag**, for example:] .center[ <span style="font-size:18pt;">`<img height="40px" src="https://tinyurl.com/rlogo-svg">`</span>
<img height="40px" src="https://tinyurl.com/rlogo-svg"> ] --- #
HTML Elements <table style="width:100%"> <tr> <td style="text-align:right;padding-right:30px;">block element:</td><td><span class="remark-code red" style="font-size:16pt"><div><span class="grey">content</span></div></span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">inline element:</td><td><span class="remark-code red" style="font-size:16pt"><span><span class="grey">content</span></span></span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">paragraph:</td><td><span class="remark-code red" style="font-size:16pt"><p><span class="grey">content</span></p></span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">header level 1:</td><td><span class="remark-code red" style="font-size:16pt"><h1><span class="grey">content</span></h1></span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">header level 2:</td><td><span class="remark-code red" style="font-size:16pt"><h2><span class="grey">content</span></h2></span></td></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">italic:</td><td><span class="remark-code red" style="font-size:16pt"><i><span class="grey">content</span></i></span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">emphasised text:</td><td><span class="remark-code red" style="font-size:16pt"><em><span class="grey">content</span></em></span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">strong importance:</td><td><span class="remark-code red" style="font-size:16pt"><strong><span class="grey">content</span></strong></span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">link:</td><td><span class="remark-code red" style="font-size:16pt"><a href="https://github.com/fmegahed/isa401"><span class="grey">content</span></a></span></td> </tr> <tr> <td valign="top" style="text-align:right;padding-right:30px;">unordered list:</td><td><span class="remark-code red" style="font-size:16pt"><ul><br><li><span class="grey">item 1</span></li><Br><li><span class="grey">item 2</span></li><Br></ul></span></td> </tr> </table> ??? How these are rendered to the browser depends on the browser default style values, style attribute or CSS... --- #
Cascading Style Sheet (CSS) * with the extension `.css` * 3 ways to style elements in HTML: * **inline** by using the `style` attribute inside HTML start tag: <center> <span class="remark-code grey" style="font-size:14pt;"><h1 <span class="red">style="color:blue;"</span>>Blue Header</h1></span> </center> + **externally** by using the `<link>` element: <center> <span class="remark-code red" style="font-size:14pt;"><link rel="stylesheet" href="styles.css"></span> </center> + **internally** by defining within `<style>` element: <div style="margin-left:35%; width:350px;"> ```html <style type="text/css"> h1 { color: blue; } </style> ``` </div> By convention, the `<style>` and `<link>` elements tend to go into the `<head>` section of the HTML document. --- #
CSS Syntax .pull-left[ ```html <style type="text/css"> h1 { color: blue; } </style> <h1>This is a header</h1> ``` ] <div style="margin-left:55%; width:350px;"> <br> <h2 style="color:blue">This is a header</h2> </div> <table style="width:100%"> <tr> <td style="text-align:right;padding-right:30px;">selector:</td><td><span class="remark-code" style="font-size:16pt"><span class="red">h1</span><span class="grey"> { color: blue; }</span></span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">property:</td><td><span class="remark-code" style="font-size:16pt; color:gray">h1 { <span class="red">color: blue;</span> }</span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">property name:</td><td><span class="remark-code" style="font-size:16pt; color:gray">h1 { <span class="red">color</span>: blue; } </span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">property value:</td><td><span class="remark-code grey" style="font-size:16pt; color:gray">h1 { color: <span class="red">blue</span>; } </span></td> </tr> </table> .pull-left[ You may have multiple properties for a single selector.➡️ ] .pull-right[ ```css h1 { color: blue; font-size: 16pt; } ``` ] --- #
CSS Properties .center[ ```html <div>Sample text</div> ``` ] <table style="width:100%"> <tr> <td style="text-align:right;padding-right:30px;">background color:</td> <td><span class="remark-code grey" style="font-size:16pt; color:gray">div { <span class="red">background-color: yellow;</span> }</span> </td> <td> <div style="background-color: yellow;">Sample text</div> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">text color:</td> <td><span class="remark-code grey" style="font-size:16pt; color:gray">div { <span class="red">color: purple;</span> }</span> </td> <td> <div style="color: purple;">Sample text</div> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">border:</td> <td><span class="remark-code grey" style="font-size:16pt; color:gray">div { <span class="red">border: 1px dashed brown;</span> }</span> </td> <td> <div style="border: 1px dashed brown;">Sample text</div> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">left border only:</td> <td><span class="remark-code grey" style="font-size:16pt; color:gray">div { <span class="red">border-left: 10px solid pink;</span> }</span> </td> <td> <div style="border-left: 10px solid pink;">Sample text</div> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">text size:</td> <td><span class="remark-code grey" style="font-size:16pt; color:gray">div { <span class="red">font-size: 10pt;</span> }</span> </td> <td> <div style="font-size:10pt;">Sample text</div> </td> </tr> <tr> <td valign="top" style="text-align:right;padding-right:30px;">padding:</td> <td valign="top"><span class="remark-code grey" style="font-size:16pt; color:gray">div { background-color: yellow; <br>     <span class="red">padding: 10px;</span> }</span> </td> <td> <div style="background-color: yellow;padding:10px;">Sample text</div> </td> </tr> <tr> <td valign="top" style="text-align:right;padding-right:30px;">margin:</td> <td valign="top"><span class="remark-code grey" style="font-size:16pt; color:gray">div { background-color: yellow; <br>     <span class="red">margin: 10px;</span> }</span> </td> <td> <div style="background-color: yellow;margin:10px;">Sample text</div> </td> </tr> </table> --- #
CSS Properties .center[ ```html <div>Sample text</div> ``` ] <table style="width:100%"> <tr> <td valign="top" style="text-align:right;padding-right:30px;">center align text:</td> <td valign="top"><span class="remark-code grey" style="font-size:16pt; color:gray">div { background-color: yellow; <br>     padding-top: 20px;<br>     <span class="red">text-align: center;</span> }</span> </td> <td> <div style="background-color: yellow;text-align: center;padding-top: 20px;">Sample text</div> </td> </tr> <tr> <td valign="top" style="text-align:right;padding-right:30px;">font family:</td> <td valign="top"><span class="remark-code grey" style="font-size:16pt; color:gray">div { <span class="red">font-family: Marker Felt, times;</span> }</span> </td> <td> <div style="font-family: Marker Felt, times;">Sample text</div> </td> </tr> <tr> <td valign="top" style="text-align:right;padding-right:30px;">strike:</td> <td valign="top"><span class="remark-code grey" style="font-size:16pt; color:gray">div { <span class="red">text-decoration: line-through;</span> }</span> </td> <td> <div style="text-decoration: line-through;">Sample text</div> </td> </tr> <tr> <td valign="top" style="text-align:right;padding-right:30px;">underline:</td> <td valign="top"><span class="remark-code grey" style="font-size:16pt; color:gray">div { <span class="red">text-decoration: underline;</span> }</span> </td> <td> <div style="text-decoration: underline;">Sample text</div> </td> </tr> <tr> <td valign="top" style="text-align:right;padding-right:30px;">opacity:</td> <td valign="top"><span class="remark-code grey" style="font-size:16pt; color:gray">div { <span class="red">opacity: 0.3</span> }</span> </td> <td> <div style="opacity: 0.3;">Sample text</div> </td> </tr> </table> --- count: false .pull-left[ #
CSS Selector <table class="grey" style="width:98%;margin-left:10px;margin-right:10px;"> <tr class="red"> <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr> <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt;" class="red"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ #
CSS Selector <table class="grey" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr class="red"> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr> <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <span class="red"><div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div></span> <p>Household 1</p> <span class="red"><div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div></span> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ #
CSS Selector <table class="grey" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr class="red"> <td class="remark-code">blockquote</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><blockquote></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr> <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <span class="red"><blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote></span> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ #
CSS Selector <table class="grey" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr class="red"> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr> <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <span class="red"><p>Maybe stories are just data with a soul.</p></span> <footer>—Brene Brown</footer> </blockquote> <span class="red"><div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div></span> <span class="red"><p>Household 1</p></span> <span class="red"><div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </span> </div></span> <span class="child"> <span class="parent child rebel"> <span class="red"><p>Clean your room!</p></span> </span> </span> <span class="red"><p>End of households</p></span> </pre> ] --- count: false .pull-left[ #
CSS Selector <table class="grey" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr class="red"> <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <span class="red"><p>Hi!</p></span> How are you? <div class="child nice"> <span class="red"><p>Hello!</p></span> </div> </div> <p>Household 1</p> <div class="parent"> <span class="red"><p>Hi!</p></span> <blockquote class="child rebel"> <span class="red"><p>Don't talk to me!</p></span> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ #
CSS Selector <table class="grey" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr class="red"> <td class="remark-code">p div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> within <span class="remark-code" style="font-size:16pt"><p></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ #
CSS Selector <table class="grey" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr > <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr class="red"> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <span class="red"><p>Hi!</p></span> How are you? <div class="child nice"> <span class="red"><p>Hello!</p></span> </div> </div> <p>Household 1</p> <div class="parent"> <span class="red"><p>Hi!</p></span> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] <div style="position:absolute;top:10px;left:900px;width:300px;background-color:white;border:1px solid black;font-size:16pt;padding:2px;"> <i class="fas fa-exclamation-triangle"></i> Ignores inline elements like <code>span</code>, <code>i</code>, <code>b</code>,... </div> --- count: false .pull-left[ #
CSS Selector <table class="grey" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr > <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top" class="red"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <span class="red"><p>Household 1</p></span> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <span class="red"><p>Clean your room!</p></span> </span> </span> <p>End of households</p> </pre> ] <div style="position:absolute;top:10px;left:900px;width:300px;background-color:white;border:1px solid black;font-size:16pt;padding:2px;"> <i class="fas fa-exclamation-triangle"></i> Ignores inline elements like <code>span</code>, <code>i</code>, <code>b</code>,... </div> --- count: false .pull-left[ #
CSS Selector <table class="grey" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr > <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr class="red"> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <span class="red"><p>Household 1</p></span> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <span class="red"><p>Clean your room!</p></span> </span> </span> <span class="red"><p>End of households</p></span> </pre> ] --- count: false .pull-left[ #
CSS Selector <table class="grey" style="width:98%;margin-left:10px;margin-right:10px;"> <tr> <td class="remark-code" valign="top">.classname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">class="classname"</span>. </td> </tr> <tr> <td class="remark-code" valign="top">.c1.c2</td><td> </td><td>selects all elements with <em>both</em> <span class="remark-code" style="font-size:16pt">c1</span> and <span class="remark-code" style="font-size:16pt">c2</span> within its class attribute. </td> </tr> <tr> <td class="remark-code" valign="top">.c1 .c2</td><td> </td><td>selects all elements with class <span class="remark-code" style="font-size:16pt">c2</span> that is a descendant of an element with class <span class="remark-code" style="font-size:16pt">c1</span>. </td> </tr> <tr> <td class="remark-code" valign="top">#idname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">id="idname"</span>. </td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"></span> <p>Clean your room!</p> </span></span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ #
CSS Selector <table class="grey" style="width:98%;margin-left:10px;margin-right:10px;"> <tr class="red"> <td class="remark-code" valign="top">.parent</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">class="parent"</span>. </td> </tr> <tr> <td class="remark-code" valign="top">.c1.c2</td><td> </td><td>selects all elements with <em>both</em> <span class="remark-code" style="font-size:16pt">c1</span> and <span class="remark-code" style="font-size:16pt">c2</span> within its class attribute. </td> </tr> <tr> <td class="remark-code" valign="top">.c1 .c2</td><td> </td><td>selects all elements with class <span class="remark-code" style="font-size:16pt">c2</span> that is a descendant of an element with class <span class="remark-code" style="font-size:16pt">c1</span>. </td> </tr> <tr> <td class="remark-code" valign="top">#idname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">id="idname"</span>. </td> </tr> </table> <div style="position:absolute;top:10px;left:900px;width:300px;background-color:white;border:1px solid black;font-size:16pt;padding:2px;"> <i class="fas fa-exclamation-triangle"></i> Note some offspring do not inherit class from their parents. </div> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <span class="red"><div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div></span> <p>Household 1</p> <span class="red"><div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div></span> <span class="child"> <span class="red"><span class="parent child rebel"></span> <p>Clean your room!</p> <span class="red"></span></span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ #
CSS Selector <table class="grey" style="width:98%;margin-left:10px;margin-right:10px;"> <td class="remark-code" valign="top">.classname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">class="classname"</span>. </td> </tr> <tr class="red"> <td class="remark-code" valign="top">.child.rebel</td><td> </td><td>selects all elements with <em>both</em> <span class="remark-code" style="font-size:16pt">child</span> and <span class="remark-code" style="font-size:16pt">rebel</span> within its class attribute. </td> </tr> <tr> <td class="remark-code" valign="top">.c1 .c2</td><td> </td><td>selects all elements with class <span class="remark-code" style="font-size:16pt">c2</span> that is a descendant of an element with class <span class="remark-code" style="font-size:16pt">c1</span>. </td> </tr> <tr> <td class="remark-code" valign="top">#idname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">id="idname"</span>. </td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <span class="red"><blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote></span> </div> <span class="child"> <span class="red"><span class="parent child rebel"></span> <p>Clean your room!</p> <span class="red"></span></span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ #
CSS Selector <table class="grey" style="width:98%;margin-left:10px;margin-right:10px;"> <tr> <td class="remark-code" valign="top">.classname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">class="classname"</span>. </td> </tr> <tr> <td class="remark-code" valign="top">.c1.c2</td><td> </td><td>selects all elements with <em>both</em> <span class="remark-code" style="font-size:16pt">c1</span> and <span class="remark-code" style="font-size:16pt">c2</span> within its class attribute. </td> </tr> <tr class="red"> <td class="remark-code" valign="top">.parent .rebel</td><td> </td><td>selects all elements with class <span class="remark-code" style="font-size:16pt">rebel</span> that is a descendant of an element with class <span class="remark-code" style="font-size:16pt">parent</span>. </td> </tr> <tr> <td class="remark-code" valign="top">#idname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">id="idname"</span>. </td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <span class="red"><blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote></span> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ #
CSS Selector <table class="grey" style="width:98%;margin-left:10px;margin-right:10px;"> <td class="remark-code" valign="top">.classname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">class="classname"</span>. </td> </tr> <tr> <td class="remark-code" valign="top">.c1.c2</td><td> </td><td>selects all elements with <em>both</em> <span class="remark-code" style="font-size:16pt">c1</span> and <span class="remark-code" style="font-size:16pt">c2</span> within its class attribute. </td> </tr> <tr> <td class="remark-code" valign="top">.c1 .c2</td><td> </td><td>selects all elements with class <span class="remark-code" style="font-size:16pt">c2</span> that is a descendant of an element with class <span class="remark-code" style="font-size:16pt">c1</span>. </td> </tr> <tr class="red"> <td class="remark-code" valign="top">#p1</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">id="p1"</span>. </td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <span class="red"><div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div></span> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] <div style="position:absolute;top:10px;left:900px;width:300px;background-color:white;border:1px solid black;font-size:16pt;padding:2px;"> <i class="fas fa-exclamation-triangle"></i> Unlike <code style="font-size:16pt">class</code>, you can only have one <code style="font-size:16pt">id</code> value and must be unique in the whole HTML document. </div> --- #
JavaScript (JS)* * JS is a programming language and enable interactive components in HTML documents. * 2 ways to insert JS into a HTML document: + **internally** by defining within `<script>` element: ```html <script> document.getElementById("p1").innerHTML = "content"; </script> ``` + **externally** by using the `src` attribute to refer to the external file: ```html <script src="js/myjs.js"></script> ``` --- class: inverse, center, middle # Web Scraping 🕸 --- # rvest: Step 1 - Reading Static HTML Pages .pull-left[ .center[ <img src="data:image/png;base64,#../../figures/browser_isa_courses.PNG" width="100%" style="display: block; margin: auto;" /> ] ] .pull-right[ Use {rvest} `>= v1.0.2` (if not, update) <img src="data:image/png;base64,#../../figures/rvest.PNG" width="100%" style="display: block; margin: auto;" /> <br> ``` r if(require(pacman)==FALSE) install.packages("pacman") pacman::p_load(rvest) isa_courses = read_html("http://bulletin.miamioh.edu/courses-instruction/isa/") isa_courses ``` ``` ## {html_document} ## <html xml:lang="en" lang="en" dir="ltr"> ## [1] <head>\n<title>Information System ... ## [2] <body>\n\n\n\n\n\n<!-- Google Tag ... ``` ] --- # rvest: Step 2 - Selecting HTML Elements .pull-left[ ##
Inspector <br> .center[ <img src="data:image/png;base64,#../../figures/browser_isa_courses_titles_inspector.png" width="564" style="display: block; margin: auto;" /> ] ] .pull-right[ ##
[Selector Gadget](https://selectorgadget.com/) <br> .center[ <img src="data:image/png;base64,#../../figures/browser_isa_courses_titles.PNG" width="567" style="display: block; margin: auto;" /> ] ] --- class: middle # rvest: Step 2 - Selecting HTML Elements .pull-left[ .center[ <img src="data:image/png;base64,#../../figures/browser_isa_courses_titles.PNG" width="567" style="display: block; margin: auto;" /> ] ] .pull-right[ ``` r isa_course_titles = isa_courses |> html_elements(css = "p.courseblocktitle") isa_course_titles ``` ``` ## {xml_nodeset (50)} ## [1] <p class="courseblocktitle"><str ... ## [2] <p class="courseblocktitle"><str ... ## [3] <p class="courseblocktitle"><str ... ## [4] <p class="courseblocktitle"><str ... ## [5] <p class="courseblocktitle"><str ... ## [6] <p class="courseblocktitle"><str ... ## [7] <p class="courseblocktitle"><str ... ## [8] <p class="courseblocktitle"><str ... ## [9] <p class="courseblocktitle"><str ... ## [10] <p class="courseblocktitle"><str ... ## [11] <p class="courseblocktitle"><str ... ## [12] <p class="courseblocktitle"><str ... ## [13] <p class="courseblocktitle"><str ... ## [14] <p class="courseblocktitle"><str ... ## [15] <p class="courseblocktitle"><str ... ## [16] <p class="courseblocktitle"><str ... ## [17] <p class="courseblocktitle"><str ... ## [18] <p class="courseblocktitle"><str ... ## [19] <p class="courseblocktitle"><str ... ## [20] <p class="courseblocktitle"><str ... ## ... ``` ] --- # rvest: Step 3 - Getting HTML Text .pull-left[ .center[ <img src="data:image/png;base64,#../../figures/browser_isa_courses_titles.PNG" width="567" style="display: block; margin: auto;" /> ] ] .pull-right[ ``` r isa_course_titles_en = isa_course_titles |> html_text2() isa_course_titles_en ``` ``` ## [1] "ISA 125. Introduction to Business Statistics. (3)" ## [2] "ISA 177. Independent Studies. (0-6)" ## [3] "ISA 211. Information Technology and Data Driven Decision Making in Business. (3)" ## [4] "ISA 225. Principles of Business Analytics. (3)" ## [5] "ISA 235. Information Technology and the Intelligent Enterprise. (3)" ## [6] "ISA 241. Database for Analytics. (1.5)" ## [7] "ISA 242. Programming for Analytics. (1.5)" ## [8] "ISA 245. Database Systems and Data Warehousing. (3)" ## [9] "ISA 250. Basic Math for Analytics. (3)" ## [10] "ISA 277. Independent Studies. (0-6)" ## [11] "ISA 281. Concepts in Business Programming. (3)" ## [12] "ISA 291. Applied Regression Analysis in Business. (3)" ## [13] "ISA 301. Business Data Communications and Security. (3)" ## [14] "ISA 303. Enterprise Systems. (3)" ## [15] "ISA 305. Information Technology Governance, Risk Management, Security and Audit. (3)" ## [16] "ISA 321. Optimization in Business Analytics. (3)" ## [17] "ISA 333. Nonparametric Statistics. (3)" ## [18] "ISA 335. Blockchain and Business Applications. (3)" ## [19] "ISA 340. Internship. (0-20)" ## [20] "ISA 365. Statistical Monitoring and Design of Experiments. (3)" ## [21] "ISA 377. Independent Studies. (0-6)" ## [22] "ISA 387. Designing Business Systems. (3)" ## [23] "ISA 401/ISA 501. Business Intelligence and Data Visualization. (3)" ## [24] "ISA 403. Building Web and Mobile Business Applications. (3)" ## [25] "ISA 405. Information Security. (3)" ## [26] "ISA 406. IT Project Management. (3)" ## [27] "ISA 412/ISA 512. Data Warehousing and Business Intelligence. (3)" ## [28] "ISA 414/ISA 514. Managing Big Data. (3)" ## [29] "ISA 419. Data Driven Security. (3)" ## [30] "ISA 424. Data Infrastructure for the Enterprise. (3)" ## [31] "ISA 444/ISA 544. Business Forecasting. (3)" ## [32] "ISA 477. Independent Studies. (0-6)" ## [33] "ISA 480. Topics in Business Analytics. (1-3; maximum 3)" ## [34] "ISA 481. Topics in Information Systems. (3-4; maximum 3)" ## [35] "ISA 491/ISA 591. Introduction to Data Mining in Business. (3)" ## [36] "ISA 495. Managing the Intelligent Enterprise. (3)" ## [37] "ISA 496. Business Analytics Practicum. (3)" ## [38] "ISA 616. Communicating with Data. (3)" ## [39] "ISA 621. Enabling Technology Topics I. (3)" ## [40] "ISA 628. Information Technology and Analytic's Role in the Enterprise. (1.5)" ## [41] "ISA 629. Leveraging IT and Data Across the Business. (1.5)" ## [42] "ISA 630. Machine Learning Applications in Business. (3)" ## [43] "ISA 632. Big Data Analytics and Modern AI. (3)" ## [44] "ISA 633. Prescriptive Analytics in Business. (3)" ## [45] "ISA 634. Analytics Solution Deployment and Lifecycle Management. (3)" ## [46] "ISA 641. Data Discovery Through Business Analytics for Managers. (2)" ## [47] "ISA 645. Business Analytics for the Executive. (3)" ## [48] "ISA 650. Business Analytics Practicum. (3; maximum 6)" ## [49] "ISA 677. Independent Studies. (0-6)" ## [50] "ISA 681. Studies-Management Information Systems. (1-3)" ``` ] --- # Demo: Scraping the Course Descriptions - We will build on the previous example and we will scrape the **course descriptions** associated with these courses. - Then, we will create a **data frame** containing **both** the **course titles** and **descriptions** - Then, we will **export the results to a CSV** so that we can analyze that in a separate program if we wanted to. --- # Non-Graded Class Activity
−
+
04
:
00
.panelset[ .panel[.panel-name[Activity] .small[ - Go to [this database on plane crashes](http://www.planecrashinfo.com/2024/2024.htm) - Scrape the HTML table. **Note the difference from text elements:** + The CSS selector for `html_elements()` will be different. + You will extract a table (in its **entirety**) and hence: * we will use `html_table()` instead of `html_text2()` - Store the scraped data in an appropriate location on your computer (e.g., within the data folder for ISA 401) ] ] .panel[.panel-name[Your Solution] .small[ > _Over the next 4 minutes, use a
script file to perform the tasks outlined in the activity panel._ ] ] .panel[.panel-name[My Solution] **Please refer to our discussion in class** ] ] --- class: inverse, center, middle # Legal and Ethical Issues with Web Scraping --- # `Robots.txt` When scraping/crawling the web you need to be aware of `robots.txt`. > _The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned_. --- [Wikipedia](https://en.wikipedia.org/wiki/Robots_exclusion_standard) Using the excellent [robotstxt](https://cran.r-project.org/package=robotstxt/vignettes/using_robotstxt.html)
to check if scraping/crawling a specific directory is allowed. ``` r if(require(robotstxt)==FALSE) install.packages("robotstxt") robotstxt::paths_allowed(paths = "2024/", domain = "planecrashinfo.com", bot = "*") ``` ``` ## [1] TRUE ``` --- # Terms of Service Most large companies have **terms of service** that supplement what is permitted and/or disallowed on their `robots.txt` file. Examples include: - [Yelp's US Terms of Service](https://terms.yelp.com/tos/en_us/20200101_en_us/) - [LinkedIn Terms of Service](https://www.linkedin.com/legal/l/service-terms) --- counter: false # Ethical/Legal Considerations - **Use of publicly available reviews as a part of your service:** Would you classify the [Yelp vs Google Feud as such an example](https://www.nytimes.com/2017/07/01/technology/yelp-google-european-union-antitrust.html)? <center> <blockquote class="twitter-tweet"><p lang="en" dir="ltr">Wow Google, congrats on a new low. Consumer searches for Yelp gets "reviews" which are Google Ads. <a href="https://t.co/gKSeOOhzWG">pic.twitter.com/gKSeOOhzWG</a></p>— Jeremy Stoppelman (@jeremys) <a href="https://twitter.com/jeremys/status/876978936177082368?ref_src=twsrc%5Etfw">June 20, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> </center> --- counter: false # Ethical/Legal Considerations - **Use of publicly available profiles as a part of your service:** + [LinkedIn vs Hiq Labs: Ninth Circuit Decision in 2019](https://cdn.ca9.uscourts.gov/datastore/opinions/2019/09/09/17-16783.pdf) + [Revival of Case in 2021 by Supreme Court](https://techcrunch.com/2021/06/14/supreme-court-revives-linkedin-bid-to-protect-user-data-from-web-scrapers/) --- counter: false # Ethical/Legal Considerations - **What about scraping entire websites/webpages for the purpose of archiving the internet?** <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#../../figures/wayback_google.PNG" alt="The evolution of the home page for Google per the Wayback Machine" width="80%" /> <p class="caption">The evolution of the home page for Google per the Wayback Machine</p> </div> --- class: inverse, center, middle # Recap --- # Summary of Main Points By now, you should be able to do the following: - Understand when can we scrape data (i.e., `robots.txt`) - Scrape a webpage using
. --- # Things to Do to Prepare for Next Class - Go over your notes, read through the supplementary material (below) and complete [Assignment 04](https://miamioh.instructure.com/courses/223961/assignments/2903164) on Canvas. .pull-left[ .center[ <img src="data:image/png;base64,#../../figures/web_scrape_in_data_science.PNG" height="300px" style="display: block; margin: auto;" /> ] * [PDF of Published Paper](https://www.tandfonline.com/doi/pdf/10.1080/10691898.2020.1787116) * [ePub of Published Paper](https://www.tandfonline.com/doi/epub/10.1080/10691898.2020.1787116?needAccess=true) ] .pull-right[ .center[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/main/PNG/rvest.png" height="300px">] * [Selector Gadget](https://rvest.tidyverse.org/articles/articles/selectorgadget.html) * [Getting Started with rvest](https://rvest.tidyverse.org/articles/articles/selectorgadget.html) * [Practical Web Scraping in R](https://www.r-bloggers.com/2019/04/practical-introduction-to-web-scraping-in-r/) ]