Clustering Algorithms
Clustering is the task of grouping data points into clusters such that points within a cluster are similar and points in different clusters are dissimilar.
Hierarchical Algorithms
Hierarchical clustering builds hierarchies or trees of clusters. It can be either:
Agglomerative Clustering (BottomUp)

Each data point starts in its cluster.

Clusters are merged as we move up the hierarchy.

Performs well for large datasets.

Merging is based on the similarity between clusters using:

Singlelinkage: Minimum distance between points in two clusters.

Completelinkage: Maximum distance between points in two clusters.

Average linkage: Average distance between points in two clusters.


Produces a dendrogram showing the cluster hierarchy.
Steps:

Assign each point to its own cluster

Calculate the distance matrix

Merge the closest pair of clusters

Update the distance matrix

Repeat steps 3 and 4 until all points are in one cluster
Divisive Clustering (TopDown)

All points start in one cluster.

Splits clusters as we move down the hierarchy.

Performs poorly for large datasets due to computational complexity.

Splits are based on similarity measures like variance.
Steps:

Start with all points in one cluster

Split the cluster with the largest variance into two subclusters

Continue splitting clusters with the largest variance

Stop when each point is in its cluster
Partitional Algorithms
Partitional clustering algorithms partition the data into a set of disjoint clusters.
KMeans Clustering

The most popular partitional clustering algorithm.

Assumes that clusters are spherical in shape with similar spread.
Steps:

Choose the number of clusters K

Randomly initialize K cluster centers

Assign each point to the nearest cluster centre

Recalculate the cluster centres

Repeat steps 3 and 4 until convergence.
Advantages:

Simple and easy to implement

Scales well to large datasets
Disadvantages:

Requires specifying the number of clusters K

Sensitive to initialization

Assumes spherical clusters
Expectation Maximization (EM)

Assumes that data points are generated from a mixture of underlying probability distributions (Gaussian, Poisson etc.)

Estimates model parameters using an iterative refinement technique.
Steps:

Initialize model parameters randomly

Estep: Calculate the probability that each point belongs to each cluster

Mstep: Reestimate model parameters to maximize the likelihood

Repeat the E and M steps until convergence.
Advantages:

Can estimate the number of clusters automatically.

Can handle nonspherical clusters.
Disadvantages:

Computationally expensive

Can get stuck in local optima
Clustering Large Databases
Traditional clustering algorithms do not scale well to large datasets due to their time and memory requirements. Various algorithms have been proposed to cluster large databases:
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)

Agglomerative clustering algorithm.

Builds a CF tree (Clustering Feature tree) to cluster large amounts of numeric data.

The CFtree summarizes subclusters using clustering features  number of points, linear sum and squared sum.

Performs clustering on a sample of the data and then assigns the remaining points.
Advantages:

Very efficient for large datasets

Can handle data streams
Disadvantage:
 Performance degrades for nonspherical clusters
Diagram and Code: Here
DBSCAN (DensityBased Spatial Clustering of Applications with Noise)

Densitybased clustering algorithm.

Finds clusters based on areas of high density separated by areas of low density.

Uses two parameters  epsilon (neighbourhood radius) and minPoints (minimum number of points).

Can identify noise points and handle clusters of arbitrary shape.
To scale to large datasets:

Use spatial indexes like Rtrees to index the data

Only calculate the density for points in neighbourhoods
Advantage:
 Very efficient for clustering large spatial datasets
CURE (Clustering Using Representatives)

Agglomerative clustering algorithm.

Uses a representative set of points to summarize clusters.

The representatives are selected by shrinking each data point towards the centroid of the cluster.
To scale to large datasets:
 Use a small representative set of points instead of all points.
Advantage:
 Can handle clusters of arbitrary shape and density.
Evaluation of Clusters
Various metrics can be used to evaluate the quality of clusters produced by clustering algorithms:
Silhouette Coefficient

Measures how closely points are grouped within a cluster compared to other clusters.

Calculates a silhouette score for each data point which ranges from 1 to 1.

A score near +1 indicates the point is wellmatched to its own cluster and poorly matched to neighbouring clusters.

A score near 1 indicates the point is probably assigned to the wrong cluster.

The average of all silhouette scores is the Silhouette Coefficient for the overall clustering.

A high Silhouette Coefficient indicates a good clustering.
DaviesBouldin Index

Measures the ratio of withincluster distances to betweencluster distances.

Lower values of the index indicate better clustering.

Calculated as the average similarity measure between each cluster and its most similar cluster.

Smaller clusters that are well separated give a lower index.
Formula:
DB = (1/n) ∑ (i=1 to n) max (j≠i) { Ri + Rj / d(Ci , Cj) }
Where:

Ri is the withincluster distance for cluster i

d(Ci, Cj) is the distance between clusters i and j
Association Rule Mining
Association rule mining aims to find interesting associations or correlation relationships among items in a transaction database.
For example, in a supermarket transaction database:

80% of customers who buy milk also buy bread.

60% of customers who buy nappies also buy beer.
An association rule has two parts:

Antecedent (if)

Consequent (then)
For example:

{milk} → {bread}

{nappies} → {beer}
Two important metrics for association rules:

Support  how often the rule occurs in the dataset. Measures importance.

Confidence  how often items, in consequence, appear given items in the antecedent. Measures reliability.
For example:

Support = 20% (out of all transactions, 20% contain both milk and bread)

Confidence = 80% (out of transactions with milk, 80% also contain bread)
Popular algorithms:

Apriori  generates candidate itemsets and then scans the database to determine frequent itemsets.

FPGrowth  builds a frequent pattern tree and mines the database to find frequent itemsets.
To scale to large datasets:

Use parallel and distributed algorithms that partition the dataset across multiple machines.

Apriori and FPGrowth can both be parallelized.
Apriori Algorithm
The Apriori algorithm is a popular method for finding frequent item sets and generating association rules.
How it works

It makes multiple passes over the transaction database.

In the first pass, it counts item occurrences to find frequent 1itemsets.

In subsequent passes, it generates candidate kitemsets from frequent (k1)itemsets found in the previous pass.

It then prunes the candidates that have any subset which is not frequent.

Finally, it scans the database to determine frequent candidates among the pruned candidates.

This process repeats until no frequent kitemsets are found.
Pseudocode

Find all frequent 1itemsets

Join frequent kitemsets to generate candidates for (k + 1)itemsets

Prune (k + 1)itemsets whose ksubsets are infrequent

Scan the database and calculate support for candidates

Return all frequent (k + 1)itemsets
Ck: Candidate itemset of size k
Lk: frequent itemset of size k
L1 = {frequent items}; for (k = 1; Lk !=∅; k++) do begin C(k+1) = candidates generated from Lk; for each transaction t in the database do increment the count of all candidates in C(k+1) that are contained in t L(k+1) = candidates in Ck+1 with min_support end return ∪k Lk;
Pros and cons
Pros:

Simple and easytounderstand algorithm

Guarantees to find all frequent itemsets
Cons:

Multiple passes over the dataset required

Generates a lot of candidate sets, especially for large itemsets

Performance degrades with lower minimum support
FPGrowth Algorithm
The FPgrowth algorithm is an efficient method for mining frequent item sets without candidate generation.
How it works

FPgrowth works by constructing a frequent pattern tree (FPtree), which stores transaction data in a compressed form.

The root of the tree contains all frequent items. Each node represents an item and contains information about that item.

The FPtree is constructed by first scanning the database to find frequent items. Transactions are then inserted into the tree.

Pattern growth is performed by constructing conditional FPtrees for each frequent item. These trees contain transactions that contain that item.

Frequent itemsets are generated by mining each conditional FPtree. The itemsets found are combined with the conditional item to generate all frequent itemsets.
Pseudocode

Scan the database to find frequent items and sort them in descending order by frequency

Construct the FPtree by inserting transactions

Call FPgrowth on the FPtree to generate frequent itemsets

FPgrowth works as follows:

If the FPtree contains a single path, generate itemsets from the path

Otherwise, recursively generate itemsets from conditional FPtrees

Pros and cons
Pros:

Efficient for mining frequent itemsets without candidate generation

Uses less memory than Apriori

Scales well for large datasets
Cons:

Requires sorting of items by frequency before tree construction

Performance degrades with low minimum support
Parallel and Distributed Association Rule Mining
With the advent of big data, traditional association rule mining algorithms like Apriori and FPgrowth face scalability challenges. Parallel and distributed techniques are needed to mine large datasets efficiently.
MapReducebased Apriori

MapReduce can be used to parallelize the Apriori algorithm.

In the map phase, transactions are partitioned among map tasks. Each map task finds local frequent 1itemsets.

The reduce phase aggregates the local frequent itemsets from all maps to find global frequent 1itemsets.

This process is repeated to find higherorder frequent itemsets.

The candidate generation and frequency counting steps of Apriori are well suited for the map and reduce functions.
Distributed FPGrowth

The FPtree structure can be partitioned and distributed across nodes.

Each node constructs a local FP tree from its partition of the dataset.

Local frequent itemsets are found by mining each local FPtree.

The local frequent itemsets are aggregated to find global frequent itemsets.

The conditional FPtrees can also be distributed in the same way.

This approach scales well for large datasets since the FPtree is divided and mined in parallel.