Every developer who has shipped a news or social clustering feature knows the pain: clusters are easy, stable identities are hard. One minute an event looks clean; five minutes later, a bridging article lands and two clusters snap together. Or a weak connection evaporates and a cluster fractures. Users don’t care about the algorithm—they care that the event ID they bookmarked yesterday still resolves to the same story today.


What’s on the table (quick facts)

  • A news ingestion system clusters incoming articles into events powering maps and graph views.
  • Pipeline today: embeddings → cosine hierarchical agglomerative clustering (HAC) at a fixed threshold → periodic recluster every ~5 minutes.
  • Granularity, time-decay, and summarization are already tuned; the pain point is stable event identity under streaming updates.
  • As new articles arrive, clusters should sometimes merge (a real bridge appears) or split (a spurious bridge fades).
  • Goal: persistent user-facing event IDs across merges/splits to minimize label churn between snapshots.
  • Constraint: respect hierarchical/threshold logic and keep latency low.
  • Ask: What algorithmic approach (and open-source references) best supports evolutionary/streaming hierarchical clustering with persistent labels and explicit merge/split handling?

Why this is hard (and why it matters)

Batch HAC is deterministic: run it, cut at a threshold, and you’re done. Streaming news is not batch—it’s a moving target. A single new article can create a path that collapses two clusters (single-link behavior), while time decay can dissolve that same path and split them later. If event IDs reshuffle every snapshot, downstream UX breaks: bookmarks, notifications, deduplication, and analytics all get noisy.

At AI Tech Inspire, we’ve seen teams succeed by separating two concerns:

  • Event detection (which clusters exist right now?)
  • Identity continuity (which current clusters inherit which prior IDs?)

Keeping these separate lets you upgrade your clustering method without rewriting your identity logic.

Three practical approaches that play well in streaming

Below are patterns used in production pipelines that need merge/split awareness and sticky IDs.

1) Microclusters + periodic hierarchical reclustering

Instead of reclustering all articles every 5 minutes, maintain microclusters (compact summaries) online and run HAC only on those summaries periodically. This adds temporal inertia and reduces flip-flops.

  • How: Maintain a CF-tree (cluster feature tree) or density-based microclusters with decay. New articles update the nearest microcluster; stale microclusters fade out.
  • Why it helps: Microclusters act as memory. Spurious single-link bridges are less likely to immediately merge entire events; splits/merges manifest first at the microcluster level, which is less jittery.
  • Refs: BIRCH (available in scikit-learn), DenStream/DBSTREAM (implementations in River and scikit-multiflow).

You can still cut a hierarchy at a cosine threshold—just run it on microcluster centroids/weights instead of raw articles.

2) Evolutionary clustering (temporal smoothness)

Evolutionary clustering formalizes the idea of don’t change the partition too fast. Optimize a combined objective at time t:

F_t = α · SnapshotCost(C_t | X_t) + (1 − α) · DriftCost(C_t, C_{t−1})

  • Snapshot cost: your normal clustering criterion (e.g., HAC threshold consistency).
  • Drift cost: penalizes deviation from the previous clustering to reduce churn.
  • Refs: Evolutionary spectral clustering (Chakrabarti et al., KDD’06), FacetNet (Lin et al., ICDM’08) for overlapping communities; dynamic community detection libraries like CDlib and tnetwork offer trackers and evaluation tools.

In practice, you can preserve your HAC mindset by first building a similarity graph (articles or microclusters as nodes, edges above threshold), then running a dynamic community detection algorithm with a temporal smoothness parameter α.

3) Graph connectivity with hysteresis

For single-link style pipelines, treat the system as a dynamic graph and define events as connected components at a fixed edge threshold. Handle updates via:

  • Merges: new high-similarity edges union two components (cheap via union-find).
  • Splits: remove expired edges and recompute components; to reduce thrashing, require edges to be present for k snapshots before merging (hysteresis) and/or absent for k snapshots before splitting.
  • Refs: Fully dynamic connectivity is a classic problem; if you need scale, specialized structures (e.g., Holm–de Lichtenberg–Thorup) exist in literature, but many teams simply use snapshot recomputation with caching, thanks to the 5-minute cadence.

Persistent IDs: treat it as an optimization problem

Regardless of clustering method, the ID assignment between snapshots is where the churn dies (or lives). Model it explicitly.

Step A: Build an inter-snapshot overlap graph

  • Let Prev = {A_i} be clusters at time t−1 and Curr = {B_j} at time t.
  • Define weights w(i, j) = overlap(A_i, B_j). Practical choices: Jaccard on member sets; weighted Jaccard on article IDs with time decay; or centroid cosine if sets are large.

Step B: Solve a min-cost mapping with merge/split penalties

Map clusters to preserve continuity while allowing many-to-one (merges) and one-to-many (splits). A simple, fast strategy:

  • Primary assignment: one-to-one maximum weight matching (Hungarian) for the bulk of clusters.
  • Merge handling: if multiple A_i map strongly to a single B_j, pick a canonical ID (e.g., min(article_timestamp_of_seed)) to survive; store redirects from other IDs.
  • Split handling: the largest B_j inheriting from A_i keeps the old ID; other fragments get new IDs with lineage parent=A_i.
  • Penalties: add a fixed cost for creating a new ID and another for abandoning an old one. Tune these to match your tolerance for churn vs. drift.

For higher fidelity, cast it as optimal transport (OT): distribute each previous cluster’s “mass” to current clusters with a cost matrix C = 1 − w(i, j). Solve with entropic OT (Sinkhorn) and recover a flow; then derive ID transfers from dominant flows per cluster. Libraries: POT (Python Optimal Transport). Many production systems stick to greedy overlap + thresholds—it’s fast and works.

Key takeaway: Separate “which clusters exist” from “which IDs persist,” and make ID persistence an explicit optimization with penalties for churn.

Operational heuristics that reduce jitter

  • Inertia windows: require an overlap above θ for two snapshots before committing to an ID transfer; use a cooldown before splitting.
  • Alias history: keep ID → canonical_ID redirects for at least 24–48 hours so clients can catch up without 404s.
  • Canonical tie-breaks: for merges, pick the ID with the earliest seed article or the largest cumulative weight; consistency beats cleverness.
  • Message schema: emit events with fields {id, parent_id, merge_of, split_from, version} so downstream systems can render lineage-aware UIs.

Implementation sketch

For embeddings, most teams use Hugging Face sentence models and ANN backends like Faiss. For training or custom encoders, PyTorch is a natural fit. A pragmatic stack for the clustering and identity layer:

  • Similarity graph: maintain edges over a sliding window; compute neighbors per insert via ANN; decay edges with timestamps.
  • Clustering: microclusters with BIRCH/DenStream or dynamic community detection via CDlib (Leiden/Louvain variants have incremental flavors).
  • ID mapping: compute overlaps, run Hungarian for base matches, then apply merge/split rules; for OT, use POT.
  • Evaluation: measure churn as the fraction of items whose event ID changed between t−1 and t after mapping; also track ARI or VI between partitions for sanity.

One subtle optimization: if CSR adjacency and cluster memberships haven’t changed much, you can cache overlaps for stable pairs and only recompute affected neighborhoods—handy when working under a 5-minute SLA.

Putting it together: a merge/split-aware playbook

  • Stay with HAC? Add hysteresis and identity mapping. Represent clusters as connected components over an edge-thresholded graph; use overlap-based mapping plus deterministic canonical IDs.
  • Need less jitter? Insert microclusters. Update online; recluster microclusters periodically; it’s the sweet spot for stability vs. responsiveness.
  • Want formal guarantees on smoothness? Adopt evolutionary clustering. Tune α to balance snapshot fit and temporal continuity.

None of these approaches prevent merges/splits—nor should they. They make merges/splits interpretable and identities predictable.

Open-source references worth exploring


For developers building news maps, alerting, or knowledge graphs, the message is simple: treat identity persistence as a first-class objective. Whether you add microclusters, adopt evolutionary clustering, or layer a min-cost ID mapping atop HAC, the outcome is the same—less churn, clearer lineage, happier users. And in a streaming world, that’s the difference between a neat demo and a dependable product.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.