class: center, middle, inverse, title-slide .title[ # ISA 401: Business Intelligence & Data Visualization ] .subtitle[ ## 24: A Short Introduction to Clustering ] .author[ ###
Fadel M. Megahed, PhD
Professor of Information Systems and Business Analytics
Farmer School of Business
Miami University
@FadelMegahed
fmegahed
fmegahed@miamioh.edu
Automated Scheduler for Office Hours
] .date[ ### Fall 2024 ] --- # A Recap of What we Learned Last Class - Describe the goals & functions of data mining - Understand the statistical limits on data mining - Describe the data mining process - What is “frequent itemsets” & the application of this concept - Explain how and why “association rules” are constructed - Use
to populate both concepts --- # Kahoot: A Recap of Phase 3 of Class So Far Let us go to Kahoot and compete for a $10
Starbucks gift card. To evaluate your understanding of the material, please answer the questions correctly and as quickly as possible to get the most points. --- # Learning Objectives for Today's Class - Describe the different steps of the `\(k\)`-means algorithm - Cluster using `\(k\)`-means (by hand) - Cluster using `\(k\)`-means (software) +
+ Tableau --- class: inverse, center, middle # An Overview of Clustering Techniques --- # The Problem of Clustering - Given a **set of (high-dimensional) observations**, with a notion of **distance** between observations, **group the observations** into **some number of clusters**, so that: + Members of a cluster are close/similar to each other + Members of different clusters are dissimilar - **Usually:** + The observations are in a high-dimensional space + Similarity is defined using a distance measure, e.g., * Euclidean, Cosine, Jaccard, edit distance, etc. .footnote[ <html> <hr> </html> **Source:** J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, <http://www.mmds.org> ] --- # Clustering in 2D Space .pull-left[ .center[ **Meet the Palmer penguins** [](https://allisonhorst.github.io/palmerpenguins/) ] ] .pull-right[ .center[ **Anatomical description of the dataset:** [](https://allisonhorst.github.io/palmerpenguins/) ] ] .footnote[ <html> <hr> </html> **Source:** The data are available by [CC-0 license](https://creativecommons.org/share-your-work/public-domain/cc0/) in accordance with the [Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data](http://pal.lternet.edu/data/policies). The artwork is by Allison Horst and available at <https://allisonhorst.github.io/palmerpenguins/>, and the data is downloaded using the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/)
package. ] --- # Clustering in 2D Space: Formulation - Given a **set of observations (each containing bill length and depth)**, with a notion of **Euclidean distance** between observations, **group the observations** into **3 clusters**, so that: + Members of a cluster are close/similar to each other + Members of different clusters are dissimilar - Note we are assuming that we did not have a "label/type" for each penguin. .footnote[ <html> <hr> </html> **Source:** The data are available by [CC-0 license](https://creativecommons.org/share-your-work/public-domain/cc0/) in accordance with the [Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data](http://pal.lternet.edu/data/policies). The artwork is by Allison Horst and available at <https://allisonhorst.github.io/palmerpenguins/>, and the data is downloaded using the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/)
package. ] --- # Clustering in 2D Space: Raw Data <img src="data:image/png;base64,#24_clustering_intro_files/figure-html/bill_length_depth1-1.png" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Source:** The data are available by [CC-0 license](https://creativecommons.org/share-your-work/public-domain/cc0/) in accordance with the [Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data](http://pal.lternet.edu/data/policies). The artwork is by Allison Horst and available at <https://allisonhorst.github.io/palmerpenguins/>, and the data is downloaded using the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/)
package. ] --- # Clustering in 2D Space: Labeled Raw Data <img src="data:image/png;base64,#24_clustering_intro_files/figure-html/bill_length_depth2-1.png" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Source:** The data are available by [CC-0 license](https://creativecommons.org/share-your-work/public-domain/cc0/) in accordance with the [Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data](http://pal.lternet.edu/data/policies). The artwork is by Allison Horst and available at <https://allisonhorst.github.io/palmerpenguins/>, and the data is downloaded using the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/)
package. ] --- # Clustering in 2D Space: Clustering Results <img src="data:image/png;base64,#24_clustering_intro_files/figure-html/bill_length_depth3-1.png" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Source:** The data are available by [CC-0 license](https://creativecommons.org/share-your-work/public-domain/cc0/) in accordance with the [Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data](http://pal.lternet.edu/data/policies). The artwork is by Allison Horst and available at <https://allisonhorst.github.io/palmerpenguins/>, and the data is downloaded using the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/)
package. ] --- # Comments on the 2D Clustering Problem Even though the 2D Space clustering problem is the easiest problem to "solve" since we can benefit by plotting the data, **clustering is hard**. **Some important questions:** 1. With all the variables being numerical, we often assume **Euclidean distance**. This can be problematic when: - variables have significantly different scales - we are including information that is not pertinent to grouping 2. How do you determine the number of clusters (*k*)? 3. How to represent a cluster of many points? 4. How do we determine the "nearness" of clusters? --- # An Overview of Clustering Methods [](https://www.computer.org/csdl/journal/ec/2014/03/06832486/13rRUEgs2xB) .footnote[ <html> <hr> </html> **Source:** A. Fahad, et al.,"A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis" in IEEE Transactions on Emerging Topics in Computing, vol. 2, no. 03, pp. 267-279, 2014. <https://doi.org/10.1109/TETC.2014.2330519> ] --- class: inverse, center, middle # `\(k\)`-means Algorithm --- # General Idea The `\(k\)`-means algorithm clusters data by trying to separate samples in `\(n\)` groups of equal variance, minimizing a criterion known as the **inertia** or **within-cluster sum-of-squares** (see below). This algorithm requires the **number of clusters to be specified**. .center[ `\(\sum_{i=0}^{n}\min_{\mu_j \in C}(||x_i - \mu_j||^2)\)` ] **Inertia is a measure of how internally coherent clusters are; however, it suffers from various drawbacks:** - Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes. - Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated. .footnote[ <html> <hr> </html> **Source:** Clustering — scikit-learn 1.0.2 documentation <https://scikit-learn.org/stable/modules/clustering.html#$k$-means> ] --- # The Steps of the `\(k\)`-means Algorithm In basic terms, the algorithm has three steps. 0. Step 0 chooses the initial centroids, with the most basic method being to choose `\(k\)`samples from the dataset `\(X\)` . After initialization, `\(k\)`-means consists of looping between the remaining two steps. 1. Step 1 assigns each sample to its nearest centroid. 2. Step 2 creates new centroids by taking the mean value of all of the samples assigned to each previous centroid. The difference between the old and the new centroids are computed. **The algorithm repeats these last two steps the centroids do not move significantly.** .footnote[ <html> <hr> </html> **Source:** Clustering — scikit-learn 1.0.2 documentation <https://scikit-learn.org/stable/modules/clustering.html#$k$-means> ] --- # Out-Of-Class Activity: Finish by Friday Use the `\(k\)`-means algorithm to cluster the following observations. Use `\(k=2\)` and Euclidean distance. **Use [this handout](https://miamioh.instructure.com/courses/223961/files/33712604?module_item_id=5523886) to go through the `\(k\)`-means algorithm implementation (by hand).** <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-baqh{text-align:center;vertical-align:top} .tg .tg-amwm{font-weight:bold;text-align:center;vertical-align:top} .tg .tg-0lax{text-align:left;vertical-align:top} </style> <table class="tg"> <thead> <tr> <th class="tg-amwm">Observation</th> <th class="tg-amwm">X1</th> <th class="tg-amwm">X2</th> </tr> </thead> <tbody> <tr> <td class="tg-baqh">1</td> <td class="tg-0lax">1.0</td> <td class="tg-0lax">1.0</td> </tr> <tr> <td class="tg-baqh">2</td> <td class="tg-0lax">1.5</td> <td class="tg-0lax">2.0</td> </tr> <tr> <td class="tg-baqh">3</td> <td class="tg-0lax">3.0</td> <td class="tg-0lax">4.0</td> </tr> <tr> <td class="tg-baqh">4</td> <td class="tg-0lax">5.0</td> <td class="tg-0lax">7.0</td> </tr> <tr> <td class="tg-baqh">5</td> <td class="tg-0lax">3.5</td> <td class="tg-0lax">5.0</td> </tr> <tr> <td class="tg-baqh">6</td> <td class="tg-0lax">4.5</td> <td class="tg-0lax">5.0</td> </tr> <tr> <td class="tg-baqh">7</td> <td class="tg-0lax">3.5</td> <td class="tg-0lax">4.5</td> </tr> </tbody> </table> .footnote[ <html> <hr> </html> **Solution:** Once you complete the handout, you can check your solution (starting from Saturday) by downloading [this file](https://miamioh.instructure.com/courses/223961/files/33712605?module_item_id=5523887). ] --- # Practical Issues with `\(k\)`-means Clustering .panelset[ .panel[.panel-name[Data] .font80[ ``` r penguins_tbl = palmerpenguins::penguins # our data for today penguins_tbl # printing it out ``` ``` ## # A tibble: 344 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 3800 ## 3 Adelie Torgersen 40.3 18 195 3250 ## 4 Adelie Torgersen NA NA NA NA ## 5 Adelie Torgersen 36.7 19.3 193 3450 ## 6 Adelie Torgersen 39.3 20.6 190 3650 ## 7 Adelie Torgersen 38.9 17.8 181 3625 ## 8 Adelie Torgersen 39.2 19.6 195 4675 ## 9 Adelie Torgersen 34.1 18.1 193 3475 ## 10 Adelie Torgersen 42 20.2 190 4250 ## # ℹ 334 more rows ## # ℹ 2 more variables: sex <fct>, year <int> ``` ] ] .panel[.panel-name[Prep] .font80[ ``` r penguins_tbl = penguins_tbl |> # selecting relevant cols dplyr::select(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |> na.omit() |> # removing NAs dplyr::mutate_at(dplyr::vars(-species), scale) # scaling numeric variables penguins_tbl # printing it out ``` ``` ## # A tibble: 342 × 5 ## species bill_length_mm[,1] bill_depth_mm[,1] flipper_length_mm[,1] ## <fct> <dbl> <dbl> <dbl> ## 1 Adelie -0.883 0.784 -1.42 ## 2 Adelie -0.810 0.126 -1.06 ## 3 Adelie -0.663 0.430 -0.421 ## 4 Adelie -1.32 1.09 -0.563 ## 5 Adelie -0.847 1.75 -0.776 ## 6 Adelie -0.920 0.329 -1.42 ## 7 Adelie -0.865 1.24 -0.421 ## 8 Adelie -1.80 0.480 -0.563 ## 9 Adelie -0.352 1.54 -0.776 ## 10 Adelie -1.12 -0.0259 -1.06 ## # ℹ 332 more rows ## # ℹ 1 more variable: body_mass_g <dbl[,1]> ``` ] ] .panel[.panel-name[k-means (k=3)] ``` r km_res = kmeans( x = penguins_tbl |> dplyr::select(-species), # input data with no label centers = 3) # k =3 # tabulating the results with rows corresponding to true labels and the columns corresponding to cluster table(penguins_tbl$species, km_res$cluster) ``` ``` ## ## 1 2 3 ## Adelie 0 0 151 ## Chinstrap 0 1 67 ## Gentoo 66 57 0 ``` ] .panel[.panel-name[Optimal k] .pull-left[ .font70[ ``` r km_res_nbclust = NbClust::NbClust( data = penguins_tbl |> dplyr::select(-species), distance = "euclidean", min.nc = 2, max.nc = 10, method = "kmeans", index ="all") table(penguins_tbl$species, km_res_nbclust$Best.partition) ``` ] ] .pull-right[ .font60[ ``` ## *** : The Hubert index is a graphical method of determining the number of clusters. ## In the plot of Hubert index, we seek a significant knee that corresponds to a ## significant increase of the value of the measure i.e the significant peak in Hubert ## index second differences plot. ## ``` ``` ## *** : The D index is a graphical method of determining the number of clusters. ## In the plot of D index, we seek a significant knee (the significant peak in Dindex ## second differences plot) that corresponds to a significant increase of the value of ## the measure. ## ## ******************************************************************* ## * Among all indices: ## * 8 proposed 2 as the best number of clusters ## * 11 proposed 3 as the best number of clusters ## * 1 proposed 4 as the best number of clusters ## * 3 proposed 5 as the best number of clusters ## * 1 proposed 10 as the best number of clusters ## ## ***** Conclusion ***** ## ## * According to the majority rule, the best number of clusters is 3 ## ## ## ******************************************************************* ``` ``` ## ## 1 2 3 ## Adelie 8 0 143 ## Chinstrap 63 0 5 ## Gentoo 0 123 0 ``` ] ] ] .panel[.panel-name[Viz Clusters] .pull-left[ .font70[ ``` r factoextra::fviz_cluster( object = list( cluster = km_res_nbclust$Best.partition, data = penguins_tbl |> dplyr::select(-species) ), ellipse.type = "convex", palette = "jco", ggtheme = theme_minimal() ) ``` ] ] .pull-right[ .font70[ <img src="data:image/png;base64,#24_clustering_intro_files/figure-html/penguins6_out-1.png" style="display: block; margin: auto;" /> ] ] ] ] --- # Summary of Practical Issues - Rescale numeric data prior to `\(k\)`-means implementation. The scaling can be: + a z-transformation similar to what we did in the example + a 0-1 scaling + converting count data into percentage or counts per a certain number of the population + etc. - Use more than one metric to determine `\(k\)` when using `\(k\)`-means clustering - Your cluster solution is not the end result, you will need to: + visualize it in appropriate way (simple representation as in the previous slide, [spatially](https://fmegahed.github.io/covid_analysis_final.html#33_Visualizing_the_Clustering_Results), [time-based](https://fmegahed.github.io/isa401/class23/23_data_mining_overview.html?panelset1=calendar-plot-of-clustered-data&panelset4=data3&panelset5=data4&panelset6=activity3&panelset7=activity4#10), etc.) + Attempt to explain the cluster membership using an appropriate binomial/multinomial model (e.g., see [this analysis](https://fmegahed.github.io/covid_analysis_final.html#4_Explanatory_Modeling_of_Cluster_Assignments)) --- # `\(k\)`-means in Tableau Let us use Tableau to implement the `\(k\)`-means clustering implementation on the 60 sample observations from the penguins dataset as shown in Slide 11 of this presentation. --- class: inverse, center, middle # Recap --- # Summary of Main Points - Describe the different steps of the `\(k\)`-means algorithm - Cluster using `\(k\)`-means (by hand) - Cluster using `\(k\)`-means (software) +
+ Tableau