ISA 401: Business Intelligence & Data Visualization

class: center, middle, inverse, title-slide

.title[
# ISA 401: Business Intelligence & Data Visualization
]
.subtitle[
## 24: A Short Introduction to Clustering
]
.author[
### <br>Fadel M. Megahed, PhD <br><br>Professor of Information Systems and Business Analytics <br> Farmer School of Business<br> Miami University<br><br> <a href="https://twitter.com/FadelMegahed"><svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z" /></svg> <span class="citation">@FadelMegahed</span></a> <br> <a href="https://github.com/fmegahed/"><svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z" /></svg> fmegahed</a> <br> <a href="mailto:fmegahed@miamioh.edu"><svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z" /></svg> fmegahed@miamioh.edu</a><br> <a href="https://calendly.com/fmegahed"><svg viewBox="0 0 384 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M202.021 0C122.202 0 70.503 32.703 29.914 91.026c-7.363 10.58-5.093 25.086 5.178 32.874l43.138 32.709c10.373 7.865 25.132 6.026 33.253-4.148 25.049-31.381 43.63-49.449 82.757-49.449 30.764 0 68.816 19.799 68.816 49.631 0 22.552-18.617 34.134-48.993 51.164-35.423 19.86-82.299 44.576-82.299 106.405V320c0 13.255 10.745 24 24 24h72.471c13.255 0 24-10.745 24-24v-5.773c0-42.86 125.268-44.645 125.268-160.627C377.504 66.256 286.902 0 202.021 0zM192 373.459c-38.196 0-69.271 31.075-69.271 69.271 0 38.195 31.075 69.27 69.271 69.27s69.271-31.075 69.271-69.271-31.075-69.27-69.271-69.27z" /></svg> Automated Scheduler for Office Hours</a><br><br>
]
.date[
### Fall 2024
]

---

# A Recap of What we Learned Last Class

- Describe the goals & functions of data mining

- Understand the statistical limits on data mining

- Describe the data mining process

- What is “frequent itemsets” & the application of this concept

- Explain how and why “association rules” are constructed

- Use <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#C3142D;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> to populate both concepts

---

# Kahoot: A Recap of Phase 3 of Class So Far

Let us go to Kahoot and compete for a $10 <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:darkgreen;overflow:visible;position:relative;"><path d="M96 64c0-17.7 14.3-32 32-32H448h64c70.7 0 128 57.3 128 128s-57.3 128-128 128H480c0 53-43 96-96 96H192c-53 0-96-43-96-96V64zM480 224h32c35.3 0 64-28.7 64-64s-28.7-64-64-64H480V224zM32 416H544c17.7 0 32 14.3 32 32s-14.3 32-32 32H32c-17.7 0-32-14.3-32-32s14.3-32 32-32z"/></svg> Starbucks gift card. To evaluate your understanding of the material, please answer the questions correctly and as quickly as possible to get the most points.

---

# Learning Objectives for Today's Class

- Describe the different steps of the `$k$`-means algorithm

- Cluster using `$k$`-means (by hand)

- Cluster using `$k$`-means (software)

+ <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#C3142D;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>  
  + Tableau

---
class: inverse, center, middle

# An Overview of Clustering Techniques

---

# The Problem of Clustering

- Given a **set of (high-dimensional) observations**, with a notion of **distance** between observations, **group the observations** into **some number of clusters**, so that:   
  +  Members of a cluster are close/similar to each other  
  +  Members of different clusters are dissimilar

- **Usually:**  
  + The observations are in a high-dimensional space  
  + Similarity is defined using a distance measure, e.g., 
      * Euclidean, Cosine, Jaccard, edit distance, etc.
      
.footnote[
<html>
<hr>
</html>
**Source:** J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, <http://www.mmds.org>
]

---

# Clustering in 2D Space

.pull-left[
.center[

**Meet the Palmer penguins**

[![](data:image/png;base64,#https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png)](https://allisonhorst.github.io/palmerpenguins/)
]
]

.pull-right[
.center[

**Anatomical description of the dataset:**

[![](data:image/png;base64,#https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png)](https://allisonhorst.github.io/palmerpenguins/)
]
]

.footnote[
<html>
<hr>
</html>

**Source:** The data are available by [CC-0 license](https://creativecommons.org/share-your-work/public-domain/cc0/) in accordance with the [Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data](http://pal.lternet.edu/data/policies). The artwork is by Allison Horst and available at <https://allisonhorst.github.io/palmerpenguins/>, and the data is downloaded using the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/) <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#C3142D;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> package.
]

---

# Clustering in 2D Space: Formulation

- Given a **set of observations (each containing bill length and depth)**, with a notion of **Euclidean distance** between observations, **group the observations** into **3 clusters**, so that:

+  Members of a cluster are close/similar to each other  
  +  Members of different clusters are dissimilar  
  
- Note we are assuming that we did not have a "label/type" for each penguin.

.footnote[
<html>
<hr>
</html>

---

# Clustering in 2D Space: Raw Data

.footnote[
<html>
<hr>
</html>

---

# Clustering in 2D Space: Labeled Raw Data

.footnote[
<html>
<hr>
</html>

---

# Clustering in 2D Space: Clustering Results

.footnote[
<html>
<hr>
</html>

---

# Comments on the 2D Clustering Problem

Even though the 2D Space clustering problem is the easiest problem to "solve" since we can benefit by plotting the data, **clustering is hard**.

**Some important questions:**

1. With all the variables being numerical, we often assume **Euclidean distance**. This can be problematic when:  
    - variables have significantly different scales  
    - we are including information that is not pertinent to grouping
  
  2. How do you determine the number of clusters (*k*)?  
  
  3. How to represent a cluster of many points?  
  
  4. How do we determine the "nearness" of clusters?

---

# An Overview of Clustering Methods

[![](data:image/png;base64,#https://csdl-images.ieeecomputer.org/trans/ec/2014/03/figures/fahad.t1-2330519.gif)](https://www.computer.org/csdl/journal/ec/2014/03/06832486/13rRUEgs2xB)

.footnote[
<html>
<hr>
</html>

**Source:** A. Fahad, et al.,"A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis" in IEEE Transactions on Emerging Topics in Computing, vol. 2, no. 03, pp. 267-279, 2014. <https://doi.org/10.1109/TETC.2014.2330519>
]

---
class: inverse, center, middle

# `$k$`-means Algorithm

---

# General Idea

The `$k$`-means algorithm clusters data by trying to separate samples in `$n$` groups of equal variance, minimizing a criterion known as the **inertia** or **within-cluster sum-of-squares** (see below). This algorithm requires the **number of clusters to be specified**.

.center[
`$\sum_{i=0}^{n}\min_{\mu_j \in C}(||x_i - \mu_j||^2)$`
]

**Inertia is a measure of how internally coherent clusters are; however, it suffers from various drawbacks:**

- Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes.

- Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated.

.footnote[
<html>
<hr>
</html>

**Source:** Clustering &mdash; scikit-learn 1.0.2 documentation <https://scikit-learn.org/stable/modules/clustering.html#$k$-means>
]

---

# The Steps of the `$k$`-means Algorithm

In basic terms, the algorithm has three steps. 
 
 0. Step 0 chooses the initial centroids, with the most basic method being to choose `$k$`samples from the dataset `$X$` . After initialization, `$k$`-means consists of looping between the remaining two steps.  
 
 1. Step 1 assigns each sample to its nearest centroid. 
 
 2. Step 2 creates new centroids by taking the mean value of all of the samples assigned to each previous centroid. The difference between the old and the new centroids are computed.
 
 **The algorithm repeats these last two steps the centroids do not move significantly.**

.footnote[
<html>
<hr>
</html>

**Source:** Clustering &mdash; scikit-learn 1.0.2 documentation <https://scikit-learn.org/stable/modules/clustering.html#$k$-means>
]

---

# Out-Of-Class Activity: Finish by Friday

Use the `$k$`-means algorithm to cluster the following observations. Use `$k=2$` and Euclidean distance. **Use [this handout](https://miamioh.instructure.com/courses/223961/files/33712604?module_item_id=5523886) to go through the `$k$`-means algorithm implementation (by hand).**

.footnote[
<html>
<hr>
</html>
**Solution:** Once you complete the handout, you can check your solution (starting from Saturday) by downloading [this file](https://miamioh.instructure.com/courses/223961/files/33712605?module_item_id=5523887). 
]

---

# Practical Issues with `$k$`-means Clustering

.panelset[
.panel[.panel-name[Data]

.font80[

``` r
penguins_tbl = palmerpenguins::penguins # our data for today

penguins_tbl # printing it out
```

```
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
```
]

]

.panel[.panel-name[Prep]

.font80[

``` r
penguins_tbl = penguins_tbl |>  
  # selecting relevant cols
  dplyr::select(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |> 
  na.omit() |> # removing NAs
  dplyr::mutate_at(dplyr::vars(-species), scale) # scaling numeric variables

penguins_tbl # printing it out
```

```
## # A tibble: 342 × 5
##    species bill_length_mm[,1] bill_depth_mm[,1] flipper_length_mm[,1]
##    <fct>                <dbl>             <dbl>                 <dbl>
##  1 Adelie              -0.883            0.784                 -1.42 
##  2 Adelie              -0.810            0.126                 -1.06 
##  3 Adelie              -0.663            0.430                 -0.421
##  4 Adelie              -1.32             1.09                  -0.563
##  5 Adelie              -0.847            1.75                  -0.776
##  6 Adelie              -0.920            0.329                 -1.42 
##  7 Adelie              -0.865            1.24                  -0.421
##  8 Adelie              -1.80             0.480                 -0.563
##  9 Adelie              -0.352            1.54                  -0.776
## 10 Adelie              -1.12            -0.0259                -1.06 
## # ℹ 332 more rows
## # ℹ 1 more variable: body_mass_g <dbl[,1]>
```
]

]

.panel[.panel-name[k-means (k=3)]

``` r
km_res = kmeans(
  x = penguins_tbl |> 
    dplyr::select(-species), # input data with no label
  centers = 3) # k =3

# tabulating the results with rows corresponding to true labels and the columns corresponding to cluster
table(penguins_tbl$species, km_res$cluster)
```

```
##            
##               1   2   3
##   Adelie      0   0 151
##   Chinstrap   0   1  67
##   Gentoo     66  57   0
```

]

.panel[.panel-name[Optimal k]

.pull-left[
.font70[

``` r
km_res_nbclust = NbClust::NbClust(
  data = penguins_tbl |> dplyr::select(-species),
  distance = "euclidean",
  min.nc = 2, max.nc = 10, 
  method = "kmeans", index ="all")

table(penguins_tbl$species, km_res_nbclust$Best.partition)
```
]
]

.pull-right[
.font60[

```
## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 
```

```
## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 8 proposed 2 as the best number of clusters 
## * 11 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
```

```
##            
##               1   2   3
##   Adelie      8   0 143
##   Chinstrap  63   0   5
##   Gentoo      0 123   0
```
]
]

]

.panel[.panel-name[Viz Clusters]

.pull-left[
.font70[

``` r
factoextra::fviz_cluster(
  object = 
    list(
      cluster = km_res_nbclust$Best.partition, 
      data = penguins_tbl |> dplyr::select(-species) 
    ),
  ellipse.type = "convex",
  palette = "jco",
  ggtheme = theme_minimal()
)
```
]
]

.pull-right[
.font70[
<img src="data:image/png;base64,#24_clustering_intro_files/figure-html/penguins6_out-1.png" style="display: block; margin: auto;" />
]
]

]
]

---

# Summary of Practical Issues

- Rescale numeric data prior to `$k$`-means implementation. The scaling can be:  
  + a z-transformation similar to what we did in the example  
  + a 0-1 scaling 
  + converting count data into percentage or counts per a certain number of the population  
  + etc.

- Use more than one metric to determine `$k$` when using `$k$`-means clustering

- Your cluster solution is not the end result, you will need to:  
  + visualize it in appropriate way (simple representation as in the previous slide, [spatially](https://fmegahed.github.io/covid_analysis_final.html#33_Visualizing_the_Clustering_Results), [time-based](https://fmegahed.github.io/isa401/class23/23_data_mining_overview.html?panelset1=calendar-plot-of-clustered-data&panelset4=data3&panelset5=data4&panelset6=activity3&panelset7=activity4#10), etc.)  
  + Attempt to explain the cluster membership using an appropriate binomial/multinomial model (e.g., see [this analysis](https://fmegahed.github.io/covid_analysis_final.html#4_Explanatory_Modeling_of_Cluster_Assignments))

---

# `$k$`-means in Tableau

Let us use Tableau to implement the `$k$`-means clustering implementation on the 60 sample observations from the penguins dataset as shown in Slide 11 of this presentation.

---
class: inverse, center, middle

# Recap

---

# Summary of Main Points

- Describe the different steps of the `$k$`-means algorithm

- Cluster using `$k$`-means (by hand)

- Cluster using `$k$`-means (software)