ISA 401: Business Intelligence & Data Visualization

class: center, middle, inverse, title-slide

.title[
# ISA 401: Business Intelligence & Data Visualization
]
.subtitle[
## 23: A Short Introduction to Exploratory Data Mining
]
.author[
### <br>Fadel M. Megahed, PhD <br><br>Professor of Information Systems and Business Analytics <br> Farmer School of Business<br> Miami University<br><br> <a href="https://twitter.com/FadelMegahed"><svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z" /></svg> <span class="citation">@FadelMegahed</span></a> <br> <a href="https://github.com/fmegahed/"><svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z" /></svg> fmegahed</a> <br> <a href="mailto:fmegahed@miamioh.edu"><svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z" /></svg> fmegahed@miamioh.edu</a><br> <a href="https://calendly.com/fmegahed"><svg viewBox="0 0 384 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M202.021 0C122.202 0 70.503 32.703 29.914 91.026c-7.363 10.58-5.093 25.086 5.178 32.874l43.138 32.709c10.373 7.865 25.132 6.026 33.253-4.148 25.049-31.381 43.63-49.449 82.757-49.449 30.764 0 68.816 19.799 68.816 49.631 0 22.552-18.617 34.134-48.993 51.164-35.423 19.86-82.299 44.576-82.299 106.405V320c0 13.255 10.745 24 24 24h72.471c13.255 0 24-10.745 24-24v-5.773c0-42.86 125.268-44.645 125.268-160.627C377.504 66.256 286.902 0 202.021 0zM192 373.459c-38.196 0-69.271 31.075-69.271 69.271 0 38.195 31.075 69.27 69.271 69.27s69.271-31.075 69.271-69.271-31.075-69.27-69.271-69.27z" /></svg> Automated Scheduler for Office Hours</a><br><br>
]
.date[
### Fall 2024
]

---

# A Recap of What we Learned Last Week

- Define a “business report” & its main functions

- Understand the importance of the right KPIs

- Automate traditional business reports

- Dashboards as real-time business reporting tools

---

# Course Objectives Covered so Far

[Y]ou will be re-introduced to **how data should be explored** ... Instead, the focus is on understanding the underlying methodology and mindset of **how data should be approached, handled, explored, and incorporated back into the domain of interest.** ... You are expected to:

---

# Learning Objectives for Today's Class

- Describe the goals & functions of data mining

- Understand the statistical limits on data mining

- Describe the data mining process

- What is “frequent itemsets” & the application of this concept

- Explain how and why “association rules” are constructed

- Use <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> to populate both concepts

---
class: inverse, center, middle

# An Overview of Data Mining

---

# What is Data Mining?

- The most common definition of data mining is the discovery of models from data.

- Discovery of **patterns and models that are:**  
  + **Valid:** hold on new data with some certainty  
  + **Useful:** should be possible to act on the item  
  + **Unexpected:** non-obvious to the system 
  + **Understandable:** humans should be able to interpret the pattern

- Subsidiary Issues:  
  + **Data cleansing:** detection of bogus data  
  + **Data visualization:** something better than MBs of output 
  + **Warehousing** of data (for retrieval)

.footnote[
<html>
<hr>
</html>

**Source:** The slide is adapted from Jure Leskovic, Stanford CS246, Lecture Notes, see <http://cs246.stanford.edu>
]

---

# A Simplistic View of Data Mining Models

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#../../figures/data_mining_models.png" alt="An Overview of Data Mining Models" width="100%" />
<p class="caption">A simplistic summary of data mining models. Note that, in ISA 401, we will only briefly cover descriptive/exploratory data mining models</p>
</div>

---

# Data Mining is Hard

Data mining is hard since it has the following issues:

- Scalability

- Dimensionality

- Complex and Heterogeneous Data

- Data Quality

- Data Ownership and Distribution

- Privacy Preservation

**Note that I have intentionally not included fitting/training a model since this is relatively easy if you understand the data, engineered/captured the important predictors, and have the data in the "correct" shape/quality.**

---

# Association Rules

.panelset[
.panel[.panel-name[Data]

```
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage
```

]

.panel[.panel-name[Top 5 Rules]

```
##     lhs                                    rhs              support    
## [1] {Instant food products, soda}       => {hamburger meat} 0.001220132
## [2] {soda, popcorn}                     => {salty snack}    0.001220132
## [3] {flour, baking powder}              => {sugar}          0.001016777
## [4] {ham, processed cheese}             => {white bread}    0.001931876
## [5] {whole milk, Instant food products} => {hamburger meat} 0.001525165
##     confidence coverage    lift     count
## [1] 0.6315789  0.001931876 18.99565 12   
## [2] 0.6315789  0.001931876 16.69779 12   
## [3] 0.5555556  0.001830198 16.40807 10   
## [4] 0.6333333  0.003050330 15.04549 19   
## [5] 0.5000000  0.003050330 15.03823 15
```
]

.panel[
.panel-name[Scatter Plot of all Rules]

]

.panel[
.panel-name[Graph-based Plot of Top 5 Rules]

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#23_data_mining_overview_files/figure-html/rules_graph-1.png" alt="Graph-based visualization with items and rules as vertices."  />
<p class="caption">Graph-based visualization with items and rules as vertices.</p>
</div>
]

]

---

# Clustering of Traffic Volume on I-85

.panelset[
.panel[.panel-name[Data]
<img src="data:image/png;base64,#../../figures/i85.png" width="100%" style="display: block; margin: auto;" />

]

.panel[.panel-name[Calendar Plot of Clustered Data]
<img src="data:image/png;base64,#../../figures/tcluster.png" width="100%" style="display: block; margin: auto;" />
]

.panel[.panel-name[Insights from Chart?]

**Based on the previous tab, what are 2-3 main insights you have learned about the traffic volume in Montgomery, AL?** Write them down below

.can-edit.key-activity[Edit me and insert your solution here]

]

---

# Regression vs Classification

.center[
<img src="data:image/png;base64,#../../figures/reg_class.jpg" width="60%" style="display: block; margin: auto;" />
]

---

# An Overview of Common Data Mining Models

.center[
<img src="data:image/png;base64,#../../figures/ml_map.jpg" width="90%" style="display: block; margin: auto;" />
]

---
class: inverse, center, middle

# Limits on Data Mining

---

# Meaningfulness of Answers from DM Models

- .black[.bold[A big risk when data mining is that you will discover patterns that are meaningless.]]

- **Bonferroni’s Principle:** (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find.

.center[
![](data:image/png;base64,#https://imgs.xkcd.com/comics/extrapolating.png)
]

---

# Rhines Paradox: An Example of Overzealous DM?

- Joseph Rhine was a parapsychologist in the 1950s who hypothesized that some people had **Extra-Sensory Perception**.

- He devised an experiment where subjects were asked to guess 10 hidden cards .red[red] or .blue[blue].

- He discovered that almost 1 in 1000 had ESP they were able to get all 10 right!

- He told these people they had ESP and called them in for another test of the same type.

- Alas, he discovered that almost all of them had lost their ESP.

- **What did he conclude?**  
  + He concluded that you should not tell people they have ESP; it causes them to lose it.  
  + **Why is this an incorrect conclusion?**
  
  
---

# Ethical Issues with Data Mining

.pull-left[
.center[
![](data:image/png;base64,#https://images-na.ssl-images-amazon.com/images/I/51eUw-v0X+L._SX329_BO1,204,203,200_.jpg)
]
]

.pull-right[
.center[
![](data:image/png;base64,#https://images-na.ssl-images-amazon.com/images/I/51obBtKNC5L._SX331_BO1,204,203,200_.jpg)
]
]

---

# In the News: AI Implementation Scandals

---
class: inverse, center, middle

# The Data Mining Process

---

# Frameworks for Data Mining Projects

.center[
[![](data:image/png;base64,#https://www.datascience-pm.com/wp-content/uploads/2020/10/process-google-search-volume-2019-2020.png)](https://www.datascience-pm.com/crisp-dm-still-most-popular/)
]

---

# The CRISP-DM Process

.pull-left[

- **You are expected to read the [original CRISP-DM paper](http://www.cs.unibo.it/~danilo.montesi/CBD/Beatriz/10.1.1.198.5133.pdf)**

- Each step has several substeps

- **Most of the project time is typically spent in steps 1-3**
]

.pull-right[

.center[
<img src="data:image/png;base64,#../../figures/CRISP_DM_Data_mining_management_process.jpg" width="100%" style="display: block; margin: auto;" />
]
]

---
class: inverse, center, middle

# Frequent Itemsets, Market Basket Analysis and Association Rule Mining

---

# Association Rule Discovery

**Supermarket shelf management – Market-basket model:**

- **Goal:** Identify items that are bought together by sufficiently many customers

- **Approach:** Process the sales data collected with barcode scanners to find dependencies among items

- **A classic rule:**  
  + If someone buys diaper and milk, then he/she is likely to buy beer  
  + Don’t be surprised if you find six-packs next to diapers!

.footnote[
<html>
<hr>
</html>
**Source:** J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

]

---

# The Market-Basket Model

.pull-left[
- A large set of **items**  
  + e.g., things sold in a supermarket

- A large set of **baskets**

- Each basket is a **small subset of items** 
  + e.g., the things one customer buys on one day

- Want to discover **association rules**  
  + People who bought {x,y,z} tend to buy {v,w} 
    * Amazon!
]

.pull-right[
.center[
**Input:**

<html>
<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-v0hj{background-color:#efefef;border-color:inherit;font-weight:bold;text-align:center;vertical-align:top}
.tg .tg-7btt{border-color:inherit;font-weight:bold;text-align:center;vertical-align:top}
.tg .tg-fymr{border-color:inherit;font-weight:bold;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-v0hj">Basket #</th>
    <th class="tg-v0hj">Items</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-7btt">1</td>
    <td class="tg-fymr"><span style="color:#BEAED4">Bread</span><span style="color:#333">, </span><span style="color:#F0027F">Coke</span><span style="color:#333">, </span><span style="color:#386CB0">Milk</span></td>
  </tr>
  <tr>
    <td class="tg-7btt">2</td>
    <td class="tg-fymr"><span style="color:#FDC086">Beer</span><span style="color:#333">, </span><span style="color:#BEAED4">Bread</span></td>
  </tr>
  <tr>
    <td class="tg-7btt">3</td>
    <td class="tg-fymr"><span style="color:#FDC086">Beer</span>, <span style="color:#F0027F">Coke</span>, <span style="color:#7FC97F">Diaper</span>, <span style="color:#386CB0">Milk</span></td>
  </tr>
  <tr>
    <td class="tg-7btt">4</td>
    <td class="tg-fymr"><span style="color:#FDC086">Beer</span>, <span style="color:#BEAED4">Bread</span>, <span style="color:#7FC97F">Diaper</span>, <span style="color:#386CB0">Milk</span></td>
  </tr>
  <tr>
    <td class="tg-7btt">5</td>
    <td class="tg-fymr"><span style="color:#F0027F">Coke</span>,<span style="color:#7FC97F"> Diaper</span>, <span style="color:#386CB0">Milk</span></td>
  </tr>
</tbody>
</table>
</html>

<br>

**Output:** .black[.bold[Discovered Rules]]

{<span style="color:#386CB0">Milk</span>} --> {<span style="color:#F0027F">Coke</span>}  
{<span style="color:#7FC97F">Diaper</span>, <span style="color:#386CB0">Milk</span>} --> {<span style="color:#FDC086">Beer</span>}

]
]

.footnote[
<html>
<hr>
</html>
**Source:** J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, <http://www.mmds.org>
]

---

# Definitions: Support & Support Threshold

.pull-left[
- **Simplest question:** Find sets of items that appear together “frequently” in baskets

- **Support for itemset `\(I\)`:** Number of baskets containing all items in `\(I\)`
  + Often expressed as a fraction of the total number of baskets

- Given a **support threshold `\(s\)`**, then sets of items that appear in at least `\(s\)` baskets are called frequent itemsets
]

.pull-right[

.center[
**Input:**

<br>

.black[.bold[Support of {<span style="color:#FDC086">Beer</span>, <span style="color:#BEAED4">Bread</span>}:]] = 2

]

.footnote[
<html>
<hr>
</html>
**Source:** J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, <http://www.mmds.org>
]

---

# Non-graded Activity: Frequent Itemsets

.panelset[
.panel[.panel-name[Activity]

.black[.bold[Items]] = {<span style="color:#386CB0">Milk</span>, <span style="color:#F0027F">Coke</span>, <span style="color:#7FC97F">Pepsi</span>, <span style="color:#FDC086">Beer</span>,  <span style="color:#BEAED4">Juice</span>}

<br>

**With a support threshold of 3 baskets, find all frequent itemsets based on these 8 baskets:**

- `\(B_1 =\)` {<span style="color:#386CB0">Milk</span>, <span style="color:#F0027F">Coke</span>, <span style="color:#FDC086">Beer</span>}   `\(\qquad \qquad\)` `\(B_2 =\)` {<span style="color:#386CB0">Milk</span>, <span style="color:#7FC97F">Pepsi</span>, <span style="color:#BEAED4">Juice</span>}

- `\(B_3 =\)` {<span style="color:#386CB0">Milk</span>, <span style="color:#FDC086">Beer</span>}   `\(\qquad \qquad \qquad \quad\)` `\(B_4 =\)` {<span style="color:#F0027F">Coke</span>, <span style="color:#BEAED4">Juice</span>}

- `\(B_5 =\)` {<span style="color:#386CB0">Milk</span>, <span style="color:#7FC97F">Pepsi</span>, <span style="color:#FDC086">Beer</span>}   `\(\qquad \qquad\)` `\(B_6 =\)` {<span style="color:#386CB0">Milk</span>, <span style="color:#F0027F">Coke</span>,  <span style="color:#FDC086">Beer</span>, <span style="color:#BEAED4">Juice</span>}

- `\(B_6 =\)` {<span style="color:#F0027F">Coke</span>, <span style="color:#FDC086">Beer</span>, <span style="color:#BEAED4">Juice</span>}   `\(\qquad \qquad\)` `\(B_8 =\)` {<span style="color:#F0027F">Coke</span>,  <span style="color:#FDC086">Beer</span>}
]

.panel[.panel-name[Your Solution]

**Identify all frequent singletons, doubles, triples, etc.**

.can-edit.key-activity2[Edit me and insert your solution here]

]

---

# Association Rules

- **Association Rules:** If-then rules about the contents of baskets

- .orange[{i<sub>1</sub>, i<sub>2</sub>,…,i<sub>k</sub>} &#8594; j]  means: "if a basket contains all of `\(i_1,…,i_k\)` then it is likely to contain `\(j\)`"

- **In practice there are many rules, want to find significant/interesting ones!**

- **Confidence** of this association rule is the probability of `\(j\)` given `\(I =\)` {i<sub>1</sub>,…,i<sub>k</sub>}

.center[
`\(conf(I \rightarrow j) = P (j \ | \ I) = \frac{support(I \ \cap \ j)}{support(I)}\)`
]

- **Not all high-confidence rules are interesting**
  + The rule .black[.bold[X]] &#8594; .black[.bold[milk]] may have high confidence for many itemsets .black[.bold[X]], because .black[.bold[milk]] is just purchased very often (independent of .black[.bold[X]]) and the confidence will be high

- **Lift** of an association rule `\(I \rightarrow J\)` is the ratio between its confidence and the fraction of baskets containing `\(j\)`: `\(\qquad lift(I \rightarrow j) = \frac{conf(I \rightarrow j)}{Pr(j)}\)`

.footnote[
<html>
<hr>
</html>
**Adaped from** J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, <http://www.mmds.org>
]

---

# Non-Graded Activity: Confidence and Lift

.panelset[
.panel[.panel-name[Activity]

<br>

**For the association rule:** {<span style="color:#386CB0">Milk</span>, <span style="color:#FDC086">Beer</span>} &#8594; <span style="color:#F0027F">Coke</span>, compute both its confidence and lift.

]

.panel[.panel-name[Your Solution]

**Computing the confidence and lift for the association rule** {<span style="color:#386CB0">Milk</span>, <span style="color:#FDC086">Beer</span>} &#8594; <span style="color:#F0027F">Coke</span>

.can-edit.key-activity3[Edit me and insert your solution here]

]
]

---

# Finding Association Rules

- **Problem:** .black[.bold[Find all association rules with support &ge; s and confidence &ge; c]]

+ **Note:** Support of an association rule is the support of the set of items on the left side

- **Hard part:** .black[.bold[Finding the frequent itemsets!]]

+ If .red[{i<sub>1</sub>, i<sub>2</sub>,…,i<sub>k</sub>} &#8594; j] has high support and confidence, then:
  + both .orange[{i<sub>1</sub>, i<sub>2</sub>,… ,i<sub>k</sub>}]  and both .orange[{i<sub>1</sub>, i<sub>2</sub>,…,i<sub>k</sub>, j}] will be “frequent”

---

# Naïve Approach to Counting Frequent Itemsets

- Naïve approach to finding frequent pairs

- **Read file once, counting in main memorythe occurrences of each pair:**
  + From each basket of `\(n\)` items, generate its `\(\frac{n(n-1)}{2}\)` pairs by two nested loops

- Fails if (#items)<sup>2</sup> exceeds main memory
  + Remember: #items can be 100K (Wal-Mart) or 10B (Web pages)
    * Suppose `\(10^5\)` items, counts are 4-byte integers
    * Number of pairs of items: `\(\frac{10^5(10^5-1)}{2} = 5*10^9\)`
    * Therefore, `\(2*10^{10}\)` (20 gigabytes) of memory needed

.footnote[
<html>
<hr>
</html>
**Source:** J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, <http://www.mmds.org>
]

---

# A-Priori Algorithm

- A .black[.bold[two-pass]] approach called **A-Priori** limits the need for main memory

- **Key idea:** .black[.bold[monotonicity]]

+ If a set of items `\(I\)` appears at least `\(s\)` times, so does every subset `\(J\)` of `\(I\)`

- **Contrapositive for pairs:** If item `\(i\)` does not appear in `\(s\)` baskets, then no pair including `\(i\)` can appear in `\(s\)` baskets

<br>

- **So, how does A-Priori find frequent pairs?**

.footnote[
<html>
<hr>
</html>
**Source:** J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, <http://www.mmds.org>
]

---
count:false

# A-Priori Algorithm

.pull-left[
- **Pass 1:** Read baskets and count in main memory the occurrences of each **individual item**
  + Requires only memory proportional to #items

- **Items that appear `\(\ge s\)` times are the frequent items**

- **Pass 2:** Read baskets again and count in main memory **only those pairs where both elements are frequent (from Pass 1)**
]

.pull-right[
.center[
![](data:image/png;base64,#../../figures/apriori.png)
]
]

.footnote[
<html>
<hr>
</html>
**Source:** J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, <http://www.mmds.org>
]

---

# Using <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> to Mine Association Rules

**In class, we will go through this R code, explaining: (a) what each function is doing, and (b) the outputs from each step.**

.font90[

``` r
if(require(pacman)==FALSE) install.packages('pacman')
pacman::p_load(arules, tidyverse)

data('Groceries') # note its class

summary(Groceries)

itemFrequency(Groceries) # returns frequency in alphabetic order
itemFrequency(Groceries) %>% sort(decreasing = T)

itemFrequencyPlot(Groceries, support = 0.1)
itemFrequencyPlot(Groceries, topN = 20)

# mine association rules with a certain min support and confidence
grocery_rules = apriori(
  Groceries, parameter = list(
    support = 0.01, confidence = 0.5, minlen = 2, maxlen = 5)  )

summary(grocery_rules)
inspect(grocery_rules)

sort(grocery_rules, by ='lift', decreasing = T)[1:3] %>% inspect()
```
]

---
class: inverse, center, middle

# Recap

---

# Summary of Main Points

- Describe the goals & functions of data mining

- Understand the statistical limits on data mining

- Describe the data mining process

- What is “frequent itemsets” & the application of this concept

- Explain how and why “association rules” are constructed