Clustering Algorithm

Updated May 04, 2023 ·

Overview

Clustering uses machine learning algorithms to group data into categories, or clusters. This technique reveals patterns in complex datasets, aiding in tasks such as customer segmentation, image categorization, and behavior analysis.

Clustering is also a type of Unsupervised Learning.

Supervised Learning requires data with labels and features
Unsupervised Learning only uses features

Unsupervised Learning's features makes clustering especially useful when detailed information about the dataset is not available.

Example: Identifying New Flower Species

Scenario: You are a botanist exploring a new island and you discover several types of flowers you haven’t seen before. You want to determine how many new species you’ve found and how to classify them.

Start by defining features. With measurements from over 100 flowers, you can use these as features for the clustering model. This is an unsupervised problem because you have features but lack labels or information on the number of species.
- Flower colors
- Petal length and width
- Sepal length and width
- Number of petals
Specify the number of clusters. This decision affects how the algorithm segments your data. By plotting features like petal width, sepal length, and number of petals, you can visualize how different cluster counts affect the results.

The algorithm’s results may vary with different cluster numbers:

Two clusters: May indicate one new species.
Three clusters: Adds more segmentation.
Four and eight clusters: Eight might be excessive, as it may not clearly define distinct groups.

Clustering helps in making informed decisions but doesn’t specify the exact number of clusters. In our case, three or four clusters may be ideal for identifying new species. We can also leverage your botanical knowledge, such as understanding petal width variance, to improve clustering results.

Overview​

Example: Identifying New Flower Species​

Overview

Example: Identifying New Flower Species