class: center, middle, inverse, title-slide .title[ # ISA 401: Business Intelligence & Data Visualization ] .subtitle[ ## 03: Importing and Exporting Data in R ] .author[ ###
Fadel M. Megahed, PhD
Professor of Information Systems and Business Analytics
Farmer School of Business
Miami University
@FadelMegahed
fmegahed
fmegahed@miamioh.edu
Automated Scheduler for Office Hours
] .date[ ### Fall 2024 ] --- # Quick Refresher from Last Class ✅ Describe why we are using
in this course? ✅ Understand the syntax, data structures and functions ✅ Utilize the project workflow in
and create your second
script. --- # Going Over Assignment 02 Solutions Let us go over the solutions for assignment 02 together. --- # Learning Objectives for Today's Class - Subset data in
. - Read text-files, binary files (e.g., Excel, SAS, SPSS, Stata, etc), json files, etc. - Export data from
. --- class: inverse, center, middle # Subsetting Data --- # Recall: Atomic Vectors (1D) .center[**Atomic vectors** are 1D data structures in R, where all elements **must have the same type.**] .pull-left[Since they are **1D data structures**, they are subsetted using `[element_no(s)]`. ``` r x_vec = rnorm(3) x_vec ``` ``` ## [1] -1.5571314 -0.4981731 0.3998520 ``` ``` r x_vec[2] ``` ``` ## [1] -0.4981731 ``` ``` r x_vec[c(1,3)] ``` ``` ## [1] -1.557131 0.399852 ``` ] .pull-right[<img src="https://d33wubrfki0l68.cloudfront.net/eb6730b841e32292d9ff36b33a590e24b6221f43/57192/diagrams/vectors/summary-tree-atomic.png" width="60%"> <br><br> <img src="https://d33wubrfki0l68.cloudfront.net/8a3d360c80da1186b1373a0ff0ddf7803b96e20d/254c6/diagrams/vectors/atomic.png" width="80%">] .footnote[ <html> <hr> </html> **Sources:** Images are from [Hadley Wickham's Advanced R Book, Chapter 3 on Vectors](https://adv-r.hadley.nz/vectors-chap.html). ] --- # Recall: Lists ``` r lst <- list( 1:5, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9) ) ``` .pull-left[ Subset by `[]` ``` r lst[4] ``` ``` ## [[1]] ## [1] 2.3 5.9 ``` ] .pull-right[ Subset by `[[]]` ``` r lst[[4]] ``` ``` ## [1] 2.3 5.9 ``` ] .center[<img src="../../figures/pepper.png" width="38%">] .footnote[ <html> <hr> </html> **Sources:** Image is from [Hadley Wickham's Tweet on Indexing lists in R](https://twitter.com/hadleywickham/status/643381054758363136?lang=en). ] --- # Recall: Matrices (2D) A matrix is a **2D data structure** made of **one/homogeneous data type.** .pull-left[ **A 2 `\(\times\)` 2 numeric matrix** ``` r x_mat = matrix( sample(1:10, size = 4), nrow = 2, ncol = 2 ) ``` ``` r x_mat # printing it nicely print('-----------------') *x_mat[1, 2] # subsetting ``` ``` ## [,1] [,2] ## [1,] 7 9 ## [2,] 4 8 ## [1] "-----------------" ## [1] 9 ``` ] .pull-right[ **A 3 `\(\times\)` 4 character matrix** ``` r x_char = matrix( sample(letters, size = 12), nrow = 3, ncol =4 ) x_char ``` ``` ## [,1] [,2] [,3] [,4] ## [1,] "o" "l" "w" "z" ## [2,] "p" "v" "f" "m" ## [3,] "r" "y" "k" "a" ``` ``` r *x_char[1:2, 2:3] # subsetting ``` ``` ## [,1] [,2] ## [1,] "l" "w" ## [2,] "v" "f" ``` ] --- # Tibbles .pull-left[ ``` r dept = c('ACC', 'ECO', 'FIN', 'ISA', 'MGMT') nfaculty = c(18L, 19L, 14L, 25L, 22L) fsb_tbl <- tibble::tibble( department = dept, count = nfaculty, percentage = count / sum(count)) fsb_tbl ``` ``` ## # A tibble: 5 × 3 ## department count percentage ## <chr> <int> <dbl> ## 1 ACC 18 0.184 ## 2 ECO 19 0.194 ## 3 FIN 14 0.143 ## 4 ISA 25 0.255 ## 5 MGMT 22 0.224 ``` ] .pull-right[ .center[<img src="../../figures/legos-jbryan-structures.png" width="92%">] ] .left[ .footnote[ <html> <hr> </html> **Source:** The image is from the excellent [lego-rstats GitHub Repository by Jenny Bryan](https://github.com/jennybc/lego-rstats#readme) ] ] --- # Subsetting Tibbles .left-column[ ## **to <br> 1d** ] .right-column[ * with `[[]]` or `$` ``` r fsb_tbl[["count"]] # column name ``` ``` ## [1] 18 19 14 25 22 ``` ``` r fsb_tbl[[2]] # column position ``` ``` ## [1] 18 19 14 25 22 ``` ``` r fsb_tbl$count # column name ``` ``` ## [1] 18 19 14 25 22 ``` ] .footnote[ <html> <hr> </html> **Source:** Slide is based on [Earo Wang's STAT 220 Slides](https://stats220.earo.me/02-import-export.html#19). ] --- # Subsetting Tibbles .left-column[ ## **by <br> columns** ] .right-column[ * with `[]` or `[, col]` .pull-left[ ``` r fsb_tbl["count"] ``` ``` ## # A tibble: 5 × 1 ## count ## <int> ## 1 18 ## 2 19 ## 3 14 ## 4 25 ## 5 22 ``` ] .pull-right[ ``` r fsb_tbl[2] # for data.frames -> fsb_tbl[, 2] ``` ``` ## # A tibble: 5 × 1 ## count ## <int> ## 1 18 ## 2 19 ## 3 14 ## 4 25 ## 5 22 ``` ] ] .footnote[ <html> <hr> </html> **Source:** Slide is based on [Earo Wang's STAT 220 Slides](https://stats220.earo.me/02-import-export.html#20). ] --- # Subsetting Tibbles .left-column[ ## **by rows** ] .right-column[ * with `[row, ]` .pull-left[ ``` r fsb_tbl[c(1, 3), ] ``` ``` ## # A tibble: 2 × 3 ## department count percentage ## <chr> <int> <dbl> ## 1 ACC 18 0.184 ## 2 FIN 14 0.143 ``` ] .pull-right[ ``` r fsb_tbl[-c(2, 4), ] ``` ``` ## # A tibble: 3 × 3 ## department count percentage ## <chr> <int> <dbl> ## 1 ACC 18 0.184 ## 2 FIN 14 0.143 ## 3 MGMT 22 0.224 ``` ] ] .footnote[ <html> <hr> </html> **Source:** Slide is based on [Earo Wang's STAT 220 Slides](https://stats220.earo.me/02-import-export.html#21). ] --- # Subsetting Tibbles .left-column[ ## **by <br> rows <br> and <br> columns** ] .right-column[ * with `[row, col]` ``` r fsb_tbl[1:3, 2:3] ## ## fsb_tbl[-4, 2:3] # same as above ## ## fsb_tbl[1:3, c("count". "percentage")] # same result ## ## fsb_tbl[c(rep(TRUE, 3), FALSE), 2:3] # same as above ``` ``` ## # A tibble: 3 × 2 ## count percentage ## <int> <dbl> ## 1 18 0.184 ## 2 19 0.194 ## 3 14 0.143 ``` ] .footnote[ <html> <hr> </html> **Source:** Slide is based on [Earo Wang's STAT 220 Slides](https://stats220.earo.me/02-import-export.html#22). ] --- # Subsetting Tibbles * Use `[[` to extract 1d vectors from 2d tibbles * Use `[` to subset tibbles to a new tibble + numbers (positive/negative) as indices + characters (column names) as indices + logicals as indices ``` r fsb_tbl[["count"]] # will produce 1-D vector fsb_tbl$count # will produce 1D vector # Resulting in tibbles fsb_tbl[, 2] fsb_tbl[1:3, 2:3] ``` .footnote[ <html> <hr> </html> **Source:** Slide is based on [Earo Wang's STAT 220 Slides](https://stats220.earo.me/02-import-export.html#23). ] --- class: inverse middle # Data import ⬇️ --- .left-column[ .center[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/readr.png" width="60%">] ] .right-column[ # Reading Plain-Text Rectangular
## .small[(a.k.a. flat or spreadsheet-like files)] * delimited text files with `read_delim()` + `.csv`: comma separated values with `read_csv()` + `.tsv`: tab separated values `read_tsv()` * `.fwf`: fixed width files with `read_fwf()` <hr> ] --- # Some Details on Reading CSV Data Files ## `read_csv()` arguments with [`?read_csv()`](https://readr.tidyverse.org/reference/read_delim.html) .left-column[ .center[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/readr.png" width="60%">] ] .right-column[ ``` r readr::read_csv( file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), progress = show_progress(), skip_empty_rows = TRUE ) ``` ] ??? * w/o using arguments, readr makes smart guesses, which means take a little longer * more specific, speed up the reading --- # Demo: Reading CSV Data
In this hands-on demo, you will learn how to: - Import CSV files into your
environment based on: * files that are located on your
, see **Canvas** for downloading an example CSV * files that are hosted on the web. + **Data in Webpages
:** we will cover the following example in class: - **FRED Data:** e.g., [Unempolyment Rate (UNRATE)](https://fred.stlouisfed.org/series/UNRATE) + **GitHub**
Repositories, e.g., - [SuperBowl Ads](https://github.com/rfordatascience/tidytuesday/blob/2e9bd5a67e09b14d01f616b00f7f7e0931515d24/data/2021/2021-03-02/youtube.csv) - [Women's Rights Around the World](https://github.com/glosophy/women-data) - focusing on `WomenTotal.csv` --- # Advanced: Reading CSVs with the vroom
.left-column[ .center[<img src="https://github.com/r-lib/vroom/raw/main/man/figures/logo.png" width="75%">] .center[] ] .right-column[ ### Faster delimited reader at **1.4GB/sec** - [vroom](https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/) is a relatively new `tidyverse` package that can **read** and **write** delimited files very efficiently. - It is recommended for large CSV files, see [tidyverse blog](https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/) for a detailed introduction on the package. ``` r if(require(vroom)==FALSE) install.packages('vroom') fast_df <- vroom::vroom("your_file.csv") ``` ] --- .left-column[ .center[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/readxl.png" width="60%">] ] .right-column[ # Reading Proprietary Binary Files **Microsoft Excel
** (with extensions `.xls`for MSFT Excel 2003 and earlier **OR** `.xlsx` for MSFT Excel 2007 and later) **Non-Graded Class Activity**
−
+
03
:
00
.panelset[ .panel[.panel-name[Activity] .small[ - Download the [AIAAIC Repository.xlsx file from Canvas](https://miamioh.instructure.com/courses/223961/files/32784954?module_item_id=5444124). - Store the data in an appropriate location on your computer (e.g., within the data folder for ISA 401) - Use an appropriate function from the `readxl` package to read the data (either `read_xlsx()` or `read_xls()`). - Report the number of observations, variables and the class of each variable from the data. ] ] .panel[.panel-name[Your Solution] .small[ > _Over the next 3 minutes, use an R script file to answer the questions from the activity and record your answers below_ .can-edit.key-activity1[ Number of observations and variables: ...... and ...... The class of each variable ...... ] ] ] .panel[.panel-name[My Solution] **Please refer to our discussion in class** ] ] ] ??? * contrasting to plain-text, binary files have to be opened by a certain app --- .left-column[ .center[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/haven.png" width="60%">] ] .right-column[ # Reading Proprietary Binary Files Several functions from the [haven](https://haven.tidyverse.org/)
can be used to read and write formats used by other statistical packages. Example functions include: - SAS + `.sas7bdat` with `read_sas()` - Stata + `.dta` with `read_dta()` - SPSS + `.sav` with `read_sav()` **Please refer to the help files for each of those packages for more details.** ] --- # JSON Files > _JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses **human-readable** text to store and transmit data **objects** consisting of **attribute–value pairs** and **arrays**... It is a common data format with diverse uses ... including that of web applications with servers._ --- [Wikipedia's Definition of JSON](https://en.wikipedia.org/wiki/JSON) * **object:** `{}` * **array:** `[]` * **value:** string/character, number, object, array, logical, `null` --- # JSON Files .pull-left[ ### JSON ```json { "firstName": "Mickey", "lastName": "Mouse", "address": { "city": "Mousetown", "postalCode": 10000 } "logical": [true, false] } ``` ] .pull-right[ ### R list ```r list( firstName = "Mickey", lastName = "Mouse", address = list( city = "Mousetown", postalCode = 10000 ), logical = c(TRUE, FALSE) ) ``` ] --- # Demo We will use the [jsonlite](https://cran.r-project.org/web/packages/jsonlite/index.html)
to read an example from one of the [awesome-json-datasets](https://github.com/jdorfman/awesome-json-datasets). Please note the following from the demo: - **Setting up the package**, which should be a one-time event if you are using the same computer. - **Which function** are we using from the package to read the json data? - What is the **type of object returned** by the function? - How are we **converting the object to a tibble?** --- class: inverse, center, middle # Data export ⬆️ --- # From Read to Write `read_*()` to `write_*()` Here are some ideas: **do they come from the same package?** ``` r readr::write_csv(example_tbl, file = "example.csv") haven::write_sas(example_tbl, path = "example.sas7bdat") jsonlite::write_json(example_tbl, path = "example.json") ``` --- class: inverse, center, middle # Recap --- # Summary of Main Points By now, you should be able to do the following: - Subset data in
. - Read text-files, binary files (e.g., Excel, SAS, SPSS, Stata, etc), json files, etc. - Export data from
. --- # Supplementary Reading .pull-left[ .center[[<img src="https://d33wubrfki0l68.cloudfront.net/b88ef926a004b0fce72b2526b0b5c4413666a4cb/24a30/cover.png" height="320px">](https://r4ds.had.co.nz)] * [Tibbles](https://r4ds.had.co.nz/tibbles.html) * [Data import](https://r4ds.had.co.nz/data-import.html) ] .pull-right[ .center[[<img src="https://d33wubrfki0l68.cloudfront.net/565916198b0be51bf88b36f94b80c7ea67cafe7c/7f70b/cover.png" height="320px">](https://adv-r.hadley.nz)] * [Subsetting](https://adv-r.hadley.nz/subsetting.html#subset-single) ] --- # Things to Do to Prepare for Our Next Class - Go over your notes and complete [Assignment 03](https://miamioh.instructure.com/courses/223961/quizzes/664159?module_item_id=5443108) on Canvas. - **Before attempting the assignment, you are encouraged to:** * Go over this slide deck as well as the [slide deck from last class](https://fmegahed.github.io/isa401/fall2024/class02/02_introduction_to_r.html) * Read the supplementary reading for today's class (see previous slide) - **While attempting the assignment, you are encouraged to:** * Google (
)/ChatGPT/[ChatISA](https://chatisa.fsb.miamioh.edu/) any
that you need. * Examine any
functions by utilizing on its help document using the `?function_name`