class: center, middle, inverse, title-slide .title[ # ISA 401: Business Intelligence & Data Visualization ] .subtitle[ ## 08: Tidy Data in
] .author[ ###
Fadel M. Megahed, PhD
Professor of Information Systems and Business Analytics
Farmer School of Business
Miami University
@FadelMegahed
fmegahed
fmegahed@miamioh.edu
Automated Scheduler for Office Hours
] .date[ ### Fall 2024 ] --- # Quick Refresher from Last Class ✅ Describe what is an API ✅ Download data using APIs --- # Learning Objectives for Today's Class - Define tidy data - Perform pivot and rectangling operations in
--- class: inverse, center, middle # Tidy Data 🧹 --- background-image: url("data:image/png;base64,#https://images.unsplash.com/photo-1587654780291-39c9404d746b?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1950&q=80") background-size: 90% 90% -- <img src = "https://images.unsplash.com/photo-1615465502839-71d5974f5087?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=1350&q=80", width = 45%, style = "position:absolute; top: 22.5%; left: 5.5%; box-shadow: 3px 5px 3px 1px #00000080;"></img> -- <img src = "https://images.unsplash.com/photo-1615750198206-6e632e147ef9?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1350&q=80", width = 45%, style = "position:absolute; top: 22.5%; left: 52.5%; box-shadow: 3px 5px 3px 1px #00000080;"></img> .footnote[ <html> <hr> </html> **Sources:** All three images are obtained from Upsplash ] ??? * each piece is a scalar, a vector, a tibble, a list * each piece is vector functions or table functions * bc they are standard and consistent blocks, * ensemble them to a simple mario, or complex nintendo * so far you worked with several datasets now, * hopefully you find a pattern is using these dplyr verbs in different ways to solve various problems. --- # The R for Data Science Workflow <img src="data:image/png;base64,#https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png" width="100%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Source:** Image is from Wickham, H. Grolemnund, G. (2017). "R for Data Science", O'Reily. <https://r4ds.had.co.nz/introduction.html> ] --- # The Rationale for Tidy Data - The **tidy framework** provides a **consistent way to organize your data** in
. - Getting your data into this format requires some **upfront work, but that work pays off in the long term.** - Once you have tidy data and the tidy tools provided by packages in the `tidyverse`, you will spend **much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.** .footnote[ <html> <hr> </html> **Source:** Slide is based on Wickham, H. Grolemnund, G. (2017). "R for Data Science", O'Reily. <https://r4ds.had.co.nz/tidy-data.html>. ] --- background-image: url(data:image/png;base64,#https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_1.jpg) background-size: 95% 95% .footnote[ **Source:** Illustration is from the Openscapes blog [Tidy Data for reproducibility, efficiency, and collaboration](https://www.openscapes.org/blog/2020/10/12/tidy-data/) by Julia Lowndes and Allison Horst ] ??? * In database, this is schema. * Tidy data principles are a rephrase of third norm in a database schema design. <https://en.wikipedia.org/wiki/Third_normal_form>, to data scientists. * tidy data is for human consumption. * Tabular data is column-oriented format --- background-image: url(data:image/png;base64,#https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_2.jpg) background-size: contain .footnote[ **Source:** Illustration is from the Openscapes blog [Tidy Data for reproducibility, efficiency, and collaboration](https://www.openscapes.org/blog/2020/10/12/tidy-data/) by Julia Lowndes and Allison Horst ] --- # tidy data
clean data .blue[.center[The `movies` data is tidy but not clean.]] ``` r movies <- tibble::as_tibble(jsonlite::read_json( "https://vega.github.io/vega-editor/app/data/movies.json", simplifyVector = TRUE)) movies |> dplyr::relocate(Release_Date, US_DVD_Sales) |> # move cols to front dplyr::slice(37:39, 268:269) |> # filter specific row numbers print(width = 80) # print nicely ``` ``` ## # A tibble: 5 × 16 ## Release_Date US_DVD_Sales Title US_Gross Worldwide_Gross Production_Budget ## <chr> <int> <chr> <int> <dbl> <int> ## 1 9-Mar-94 NA Four Wed… 52700832 242895809 4500000 ## 2 18-Oct-06 NA 51 Birch… 84689 84689 350000 *## 3 1963-01-01 NA 55 Days … 10000000 10000000 17000000 *## 4 <NA> NA Drei 0 0 7200000 ## 5 16-Jan-98 NA The Dress 16556 16556 2650000 ## # ℹ 10 more variables: MPAA_Rating <chr>, Running_Time_min <int>, ## # Distributor <chr>, Source <chr>, Major_Genre <chr>, Creative_Type <chr>, ## # Director <chr>, Rotten_Tomatoes_Rating <int>, IMDB_Rating <dbl>, ## # IMDB_Votes <int> ``` --- # Non-graded Activity: Tidy or Not?
−
+
05
:
00
.panelset[ .panel[.panel-name[Activity] .small[ - In the next five panels, there five tables all displaying the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. - The data contains values associated with four variables (country, year, cases, and population), but each table organizes the values in a different layout. - **Based on the information in the previous slide, please document which of the table(s) is(are) tidy and if not, which rules are violated.** - **Discuss your answer with your neighboring colleague.** > _Note that you have a total of five minutes for this non-graded activity._ ] ] .panel[.panel-name[`table1`] ``` ## table1 from the tidyr package is printed below: ``` ``` ## # A tibble: 6 × 4 ## country year cases population ## <chr> <dbl> <dbl> <dbl> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` .can-edit.key-activity1[ tidy/not-tidy: .............................. data observations? data variables? ........... rules broken (if any): ........................ ] ] .panel[.panel-name[`table2`] ``` ## table2 from the tidyr package is printed below: ``` ``` ## # A tibble: 12 × 4 ## country year type count ## <chr> <dbl> <chr> <dbl> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583 ``` .can-edit.key-activity2[ tidy/not-tidy: .............................. & rules broken (if any): ........................ ] ] .panel[.panel-name[`table3`] ``` ## table3 from the tidyr package is printed below: ``` ``` ## # A tibble: 6 × 3 ## country year rate ## <chr> <dbl> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 ``` .can-edit.key-activity3[ tidy/not-tidy: .............................. data observations? data variables? ................... rules broken (if any): ........................ ] ] .panel[.panel-name[`table4a`] ``` ## table4a from the tidyr package is printed below: ``` ``` ## # A tibble: 3 × 3 ## country `1999` `2000` ## <chr> <dbl> <dbl> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766 ``` .can-edit.key-activity4a[ tidy/not-tidy: .............................. data observations? data variables? .................. rules broken (if any): ........................ ] ] .panel[.panel-name[`table4b`] ``` ## table4b from the tidyr package is printed below: ``` ``` ## # A tibble: 3 × 3 ## country `1999` `2000` ## <chr> <dbl> <dbl> ## 1 Afghanistan 19987071 20595360 ## 2 Brazil 172006362 174504898 ## 3 China 1272915272 1280428583 ``` .can-edit.key-activity4b[ tidy/not-tidy: .............................. data observations? data variables? ....................... rules broken (if any): ........................ ] ] ] --- class: center, inverse, middle # Getting Data into Tidy Format --- # Key Functions from the `tidyr`
.pull-right-2[ .center[[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/tidyr.png" width="240px">](http://tidyr.tidyverse.org)] ] .pull-left-2[
type
function()
function()
pivoting
pivot_longer()
pivot_wider()
splitting/combining
separate()
unite()
nesting/unnesting
nest()
unnest()
missing
complete()
fill()
] .footnote[ <html> <hr> </html> **Source:** Slide is based on [Earo Wang's STAT 220 Slides](https://stats220.earo.me/06-tidy-data.html#15) ] --- # Wide Vs Long Data <img src="data:image/png;base64,#https://raw.githubusercontent.com/gadenbuie/tidyexplain/main/images/static/png/original-dfs-tidy.png" width="50%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Source:** Image is from Garrick Aden-Buie's excellent [tidyexplain GitHub Repository](https://github.com/gadenbuie/tidyexplain/blob/main/images/static/png/original-dfs-tidy.png) ] --- # `pviot_()` to Transform Wide from/to Long <img src="data:image/png;base64,#https://github.com/gadenbuie/tidyexplain/raw/main/images/tidyr-pivoting.gif" width="40%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Source:** Image is from Garrick Aden-Buie's excellent [tidyexplain GitHub Repository](https://github.com/gadenbuie/tidyexplain/blob/main/images/tidyr-pivoting.gif) ] --- # The `pivot_longer()` Function <img src="data:image/png;base64,#https://github.com/gadenbuie/tidyexplain/raw/main/images/static/png/tidyr-pivot_longer.png" width="25%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Source:** Image is from Garrick Aden-Buie's excellent [tidyexplain GitHub Repository](https://github.com/gadenbuie/tidyexplain/blob/main/images/static/png/tidyr-pivot_longer.png) ] --- # `pivot_longer()` for table4a [1] To tidy a dataset like this, we need to pivot the **offending columns into a new pair of variables**. To describe that operation we need **three parameters:** - The set of columns whose names are values, not variables. In this example, those are the columns `1999` and `2000`. - The name of the variable to move the column names to. Here it is `year`. - The name of the variable to move the column values to. Here it’s `cases`. --- # `pivot_longer()` for table4a [2] <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#https://d33wubrfki0l68.cloudfront.net/3aea19108d39606bbe49981acda07696c0c7fcd8/2de65/images/tidy-9.png" alt="Pivoting table4a into a longer, tidy form" width="100%" /> <p class="caption">Pivoting table4a into a longer, tidy form</p> </div> .footnote[ <html> <hr> </html> **Source:** Slide is based on Wickham, H. Grolemnund, G. (2017). "R for Data Science", O'Reily. <https://r4ds.had.co.nz/tidy-data.html>. ] --- # `pivot_longer()` for table4a [3] ``` r tidyr::pivot_longer(table4a, c(`1999`, `2000`), names_to = "year", values_to = "cases") ``` ``` ## # A tibble: 6 × 3 ## country year cases ## <chr> <chr> <dbl> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766 ``` --- # The `pivot_wider()` Function <img src="data:image/png;base64,#https://github.com/gadenbuie/tidyexplain/raw/main/images/static/png/tidyr-pivot_wider.png" width="30%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Source:** Image is from Garrick Aden-Buie's excellent [tidyexplain GitHub Repository](https://github.com/gadenbuie/tidyexplain/blob/main/images/static/png/tidyr-pivot_wider.png) ] --- # `pivot_wider()` for table2 [1] - `pivot_wider()` is the opposite of `pivot_longer()`. - You use it when an observation is scattered across multiple rows. - For example, take table2: an observation is a country in a year, but each observation is spread across two rows. ``` r head(table2, n = 3) ``` ``` ## # A tibble: 3 × 4 ## country year type count ## <chr> <dbl> <chr> <dbl> *## 1 Afghanistan 1999 cases 745 *## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ``` --- # `pivot_wider()` for table2 [2] <img src="data:image/png;base64,#https://d33wubrfki0l68.cloudfront.net/8350f0dda414629b9d6c354f87acf5c5f722be43/bcb84/images/tidy-8.png" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Source:** Slide is based on Wickham, H. Grolemnund, G. (2017). "R for Data Science", O'Reily. <https://r4ds.had.co.nz/tidy-data.html>. ] --- # `pivot_wider()` for table2 [3] ``` r tidyr::pivot_wider(table2, names_from = type, values_from = count) ``` ``` ## # A tibble: 6 × 4 *## country year cases population *## <chr> <dbl> <dbl> <dbl> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` --- # `separate()` for table3 [1] ``` ## # A tibble: 6 × 3 ## country year rate ## <chr> <dbl> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 ``` `table3` has a different problem: - we have one column (`rate`) that contains two variables (`cases` and `population`). - To fix this problem, we’ll need the `separate()` function. --- # `separate()` for table3 [2] <img src="data:image/png;base64,#https://d33wubrfki0l68.cloudfront.net/f6fca537e77896868fedcd85d9d01031930d76c9/637d9/images/tidy-17.png" width="100%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Source:** Slide is based on Wickham, H. Grolemnund, G. (2017). "R for Data Science", O'Reily. <https://r4ds.had.co.nz/tidy-data.html>. ] --- # `separate()` for table3 [3] ``` r tidyr::separate(table3, rate, into = c("cases", "population"), convert = TRUE) ``` ``` ## # A tibble: 6 × 4 *## country year cases population *## <chr> <dbl> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` --- # Non-graded Class Activity
−
+
05
:
00
.panelset[ .panel[.panel-name[Activity] .small[ > _In this five minute non-graded activity, please do the following_ - Go to <https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/> - Download the data for `Deaths` by clicking on the tab to the right of the page. - **Tidy this data based on the information you have learned in today's class.** ] ] .panel[.panel-name[`Your Solution`] In your RStudio Session, please read the data, load the required packages and write the code needed to transform the `deaths` data into a tidy format. ] ] --- class: center, inverse, middle # Recap --- # Summary of Main Points By now, you should be able to do the following: - Define tidy data - Perform pivot and rectangling operations in
--- background-image: url(data:image/png;base64,#https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_5.jpg) background-size: contain # Advantages of Tidy Data * one set of consistent tools for different datasets * easier for automation and iteration .footnote[ **Source:** Illustration is from the Openscapes blog [Tidy Data for reproducibility, efficiency, and collaboration](https://www.openscapes.org/blog/2020/10/12/tidy-data/) by Julia Lowndes and Allison Horst ] ??? * To work with messy data, every time you need to switch to different gears * learn new tools that just works for that specific dataset * It's much more pleasant to work with tidy data, help you to build a good taste of data analysis * spent less fighting with different tools, focus more on data analysis bc one set of consistent tools * the {tidyverse} philosophy to work with the tidy data structures * build automatic workflow for analysis, feed different data sets. --- # Things to Do Prior to Next Class Please go through the following two supplementary readings and complete [assignment 06: tidy data](https://miamioh.instructure.com/courses/223961/quizzes/665878). .pull-left[ .center[[<img src="https://d33wubrfki0l68.cloudfront.net/b88ef926a004b0fce72b2526b0b5c4413666a4cb/24a30/cover.png" height="400px">](https://r4ds.had.co.nz)] ] .pull-right[ * [Tidy data](https://r4ds.had.co.nz/tidy-data.html) * [{tidyr} cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf) ]