Hands-On with MultiDendrograms: Step-by-Step Examples and Code### Introduction
MultiDendrograms are a variation of hierarchical clustering that handle ties in distance (or similarity) by allowing simultaneous agglomeration of more than two clusters when equal distances occur. This avoids arbitrary tie-breaking and produces clusterings that reflect the true structure in data where ties are present. This article gives an intuitive explanation, shows when and why they matter, walks through algorithms and math, and provides step-by-step examples with code (Python) to build and visualize MultiDendrograms.
Why MultiDendrograms?
Hierarchical agglomerative clustering (HAC) typically merges the two closest clusters at each step, producing a binary tree (dendrogram). However, in many datasets — especially with discrete or coarsely measured distances — multiple pairs of clusters can share the same minimum distance. Standard HAC forces an arbitrary ordering or tie-breaking, which can lead to inconsistent or misleading dendrograms. MultiDendrograms address this by allowing simultaneous merges of all clusters tied at the same distance, creating multifurcating (non-binary) trees that more faithfully represent the equivalence among clusters.
Key benefit: MultiDendrograms avoid arbitrary tie-breaking and reflect true simultaneous merges.
The algorithm (conceptual)
- Compute the pairwise distance matrix for n items.
- At each iteration:
- Find the minimum distance d among all current inter-cluster distances.
- Identify all pairs of clusters whose distance equals d.
- Compute the connected components of the graph formed by linking clusters with edges for each such pair; each connected component becomes a single merged cluster (this can merge more than two clusters at once).
- Update the distance matrix using the chosen linkage criterion (single, complete, average, Ward, etc.), computing distances between the new merged clusters and remaining clusters.
- Repeat until a single cluster remains.
This procedure produces merges at particular heights; when a merge involves k>2 clusters, the dendrogram shows a multifurcation at that height.
Linkage and tie handling
MultiDendrograms can be combined with any linkage rule. However, the update formula must be applied to the multifurcating merge properly—typically by computing the new cluster’s distances as weighted combinations of its members according to the linkage. For example:
- Single linkage: distance between merged cluster C and cluster X is min_{i in C} d(i, X).
- Complete linkage: max_{i in C} d(i, X).
- Average linkage (UPGMA): average of pairwise distances between members.
- Ward: based on increase in within-cluster variance — requires cluster sizes and centroids.
Using average linkage in the presence of simultaneous merges requires careful averaging across all constituent elements.
Example dataset
We’ll use a small toy dataset with clear ties to demonstrate a case where MultiDendrograms differ from standard HAC.
Data points (1D):
- A: 0
- B: 0
- C: 1
- D: 1
- E: 3
Pairwise distances (absolute difference) produce ties: d(A,B)=0, d(C,D)=0, d(B,C)=1, d(A,C)=1, etc.
Python implementation — step-by-step
Below is a minimal Python implementation of MultiDendrograms using average linkage. It is educational and not optimized for very large datasets.
# multi_dendrogram.py import numpy as np from collections import defaultdict, deque def pairwise_distances(points): n = len(points) D = np.zeros((n, n), dtype=float) for i in range(n): for j in range(i+1, n): d = abs(points[i] - points[j]) D[i, j] = D[j, i] = d return D def find_min_pairs(D, active): # find minimum positive distance among active clusters min_d = np.inf pairs = [] for i in active: for j in active: if j <= i: continue d = D[i, j] if d < min_d: min_d = d pairs = [(i, j)] elif d == min_d: pairs.append((i, j)) return min_d, pairs def connected_components_from_pairs(pairs, active): # build adjacency and extract connected components adj = defaultdict(set) for i in active: adj[i] # ensure node present for i, j in pairs: adj[i].add(j); adj[j].add(i) seen = set() components = [] for node in active: if node in seen: continue # BFS comp = [] q = deque([node]) seen.add(node) while q: u = q.popleft() comp.append(u) for v in adj[u]: if v not in seen: seen.add(v) q.append(v) components.append(sorted(comp)) return components def average_linkage_update(D, members, cluster_map): # cluster_map: map cluster_id -> list of original indices # compute new distances between merged cluster and others by averaging newD = D.copy() clusters = list(cluster_map.keys()) for i in clusters: for j in clusters: if j <= i: continue mi, mj = cluster_map[i], cluster_map[j] # compute average pairwise distance s = 0.0; cnt = 0 for a in mi: for b in mj: s += D[a, b] cnt += 1 newD[i, j] = newD[j, i] = s / cnt if cnt else 0.0 return newD def multi_dendrogram(points): n = len(points) D = pairwise_distances(points) # initial clusters: each index maps to itself's list cluster_map = {i: [i] for i in range(n)} active = set(range(n)) history = [] # list of (components_merged, height, members) next_cluster_id = n while len(active) > 1: min_d, pairs = find_min_pairs(D, active) comps = connected_components_from_pairs(pairs, active) # Only keep components that actually have an edge (size>1) merging = [c for c in comps if len(c) > 1] if not merging: # no ties? fallback to single minimal pair (shouldn't happen) i, j = pairs[0] merging = [[i, j]] for comp in merging: # create new cluster members = [] for cid in comp: members.extend(cluster_map[cid]) active.remove(cid) del cluster_map[cid] cluster_map[next_cluster_id] = sorted(members) history.append((comp, min_d, sorted(members))) active.add(next_cluster_id) next_cluster_id += 1 # rebuild distance matrix using average linkage among active cluster ids D = average_linkage_update(D, points, cluster_map) return history if __name__ == "__main__": pts = [0, 0, 1, 1, 3] # A,B,C,D,E hist = multi_dendrogram(pts) for step in hist: print("merged clusters", step[0], "at height", step[1], "members", step[2])
Visualizing MultiDendrograms
To visualize multifurcating trees you can adapt matplotlib’s dendrogram function or draw custom trees. Scipy’s dendrogram assumes binary merges, so you must convert multifurcations into a plotting structure that supports polytomies, or use network/graph plotting.
Here’s a simple matplotlib approach to draw merges as horizontal lines and vertical connectors, handling multifurcations by connecting multiple leaves at the same height.
# visualize_multi.py import matplotlib.pyplot as plt def plot_multi_dendrogram(history, labels): # history: list of (comp_ids, height, members) # labels: original labels in order of indices n = len(labels) # compute x positions for original leaves x_pos = {i: i for i in range(n)} cluster_x = {} cluster_height = {} for step in history: comp_ids, height, members = step # x positions are average of member leaves xs = [x_pos[m] for m in members] x_c = sum(xs) / len(xs) cid = tuple(members) # use members tuple as key cluster_x[cid] = x_c cluster_height[cid] = height # draw horizontal line spanning min/max of xs at y=height plt.hlines(y=height, xmin=min(xs), xmax=max(xs), colors='k') # draw vertical lines from each leaf/child to the horizontal for m in members: y0 = 0 if m < n else cluster_height.get(tuple([m]), 0) x0 = x_pos[m] plt.vlines(x=x0, ymin=y0, ymax=height, colors='k') # set new x position for merged cluster to allow further merges new_idx = max(x_pos.keys()) + 1 x_pos[new_idx] = x_c # plot leaves for i, lab in enumerate(labels): plt.text(i, -0.05 * max(cluster_height.values()), lab, ha='center') plt.ylim(-0.1 * max(cluster_height.values()), max(cluster_height.values()) * 1.1) plt.show()
Worked example interpretation
Using the toy data A,B (0), C,D (1), E (3):
- A and B merge at height 0.
- C and D merge at height 0.
- If A/B and C/D are at equal distance to each other (depending on linkage), MultiDendrograms may merge all four simultaneously at height 1, producing a 4-way merge rather than a sequence of binary merges.
This highlights why enforcing binary merges can hide simultaneous equivalences.
When to use MultiDendrograms
- Data with discrete or coarse distances (categorical data with simple matching distances).
- When ties are common and tie-breaking would be arbitrary.
- When interpretability requires reflecting simultaneous relationships (phylogenetics with polytomies, social network clustering with equal strengths).
Limitations and considerations
- Implementations are more complex than standard hierarchical clustering.
- Visualization tools often assume binary trees; custom plotting is needed for polytomies.
- Linkage calculations must correctly aggregate distances across multiple merged clusters.
Further reading and libraries
- Look for the original MultiDendrograms paper by Fernández and colleagues for formal definitions and proofs.
- Some specialized implementations or packages may exist; check for updates in clustering libraries.
Leave a Reply