Most vision pipelines nail object detection, then stall when the task turns into structured reasoning. One example that caught the AI Tech Inspire team’s attention: converting per-frame detections into clean, left-to-right strand groupings like 1-2-3. It’s a deceptively simple output that demands a smarter approach than throwing a generic classifier at the data.

When the label you want is a structured sequence, treat the problem as structure first — not just classification.


What we know from the setup

  • A YOLO-based CV system detects the target object across videos.
  • Visualizations include x vs t, y vs t, and x vs y vs t for all detections; point size reflects bounding box area.
  • Goal: cluster these detected strands into left-to-right groups by spatial separation and output a string like 1-2-3 (counts per group, ordered).
  • Example targets per video: 1-2-3, 1-2-3-2-3, 1-1-2-3-3-3-3.
  • Background false positives exist; area helps identify them and should be ignored.
  • An XGBoost classifier reached ~70% accuracy; better seems possible.
  • Constraints: at most 8 groups; each group has at most 3 strands.

Why a straight classifier underperforms

Per-detection classification ignores the underlying structure: this is a spatial-temporal grouping problem with global consistency requirements (left-to-right ordering, limited group sizes). Even a strong tabular model will struggle if the features are local and the target is a sequence-level label.

The fix: model the problem as a pipeline that converts detections into stable tracks, then performs 1D clustering on their spatial statistics with constraints, and finally emits a compact sequence string.


A practical pipeline that tends to work

At AI Tech Inspire, the most reliable blueprint for this kind of task looks like this:

  • Step 1 — Track detected objects across time. Convert YOLO boxes to trajectories. Use a multi-object tracker like SORT, DeepSORT, or ByteTrack (e.g., ByteTrack is fast and robust for crowded scenes). Each track ≈ one physical strand.
  • Step 2 — Filter background via temporal consistency. Background hits often have short lifetimes, unstable areas, or low confidence. Remove tracks with: (a) short duration, (b) high area variance, (c) low median confidence. This leveraging of persistence beats single-frame filtering.
  • Step 3 — Reduce each track to a 1D spatial statistic. Compute per-track x features such as median_x, mean_x, and robust spread (e.g., MAD). If the scene is vertical, flip to y. Optionally, smooth with a Kalman filter.
  • Step 4 — Segment left-to-right groups using 1D gaps. Sort tracks by median_x, compute adjacent gaps, then split where gaps are “large.” A simple and effective trick: fit a 2-component Gaussian Mixture to the gap distribution (small = within-group, large = between-group) and cut at the intersection.
  • Step 5 — Enforce constraints (≤8 groups, ≤3 strands/group). If a segment has >3 tracks, split it using its internal gaps. If >8 groups, merge the closest adjacent groups with the smallest boundary gap.
  • Step 6 — Emit the sequence string. Count strands in each group, left-to-right, and join with dashes: 1-2-3, etc.

Compared to a monolithic classifier, this pipeline exploits the geometry, time, and known constraints — all the signal you already have.


Tracking, then clustering: the details that matter

Tracking: Turn detections into tracks using a robust online tracker. ByteTrack is a popular option; DeepSORT adds appearance features. Implementation stacks commonly lean on PyTorch for model/runtime interchange, though the tracker itself can be framework-agnostic.

  • Track state: use bounding box (x, y, w, h) and optionally a centroid.
  • Stability: smooth positions over time; discard tracks below a frame-count threshold (e.g., Fmin).
  • Background suppression: filter tracks with high normalized area variance or irregular temporal presence.

1D grouping from gaps: After computing median_x per track:

  • Sort tracks by median_x.
  • Compute adjacent gaps g_i = x_{i+1} - x_i.
  • Fit a 2-Gaussian mixture on g. Estimate the threshold g* at the intersection. Split where g_i > g*.

This is elegant because it adapts to changing densities across videos without hardcoding distance thresholds. Alternatives: hierarchical clustering with complete linkage on 1D positions, DBSCAN (with care on eps), or optimal 1D k-means (e.g., Jenks natural breaks).


Reference pseudocode

Here’s a compact outline many engineers use as a starting point:

tracks = track_objects(detections) # e.g., ByteTrack/DeepSORT
tracks = [t for t in tracks if len(t) >= Fmin and conf_median(t) >= Cmin]
# area filter (optional):
tracks = [t for t in tracks if area_var(t)/area_mean(t) <= Amax]

X = [(t.id, median_x(t)) for t in tracks]
X.sort(key=lambda x: x[1])

# compute gaps
G = [X[i+1][1] - X[i][1] for i in range(len(X)-1)]

# fit 2-Gaussian mixture on G and get threshold g*
g_star = fit_gmm_and_intersection(G)

# segment into groups by cutting where gap > g*
segments = cut_by_gaps(X, G, g_star)

# enforce constraints
segments = split_if_size_gt_3(segments) # split using internal largest gaps
segments = merge_if_groups_gt_8(segments) # merge by smallest boundary gaps

# stringify
result = '-'.join(str(len(seg)) for seg in segments)

Replace fit_gmm_and_intersection with a quick EM fit (scikit-learn’s GaussianMixture works well), or use a percentile-based threshold if you want fewer dependencies.


Edge cases and how to tame them

  • Crossing tracks or curved strands: If strands aren’t vertical in t, compute median_x over a stable time window or fit a line per track and use the intercept at a canonical time.
  • Occlusions and fragments: Use track stitching (IoU/appearance matching) to merge fragments. Short, jittery fragments are likely background — drop them early.
  • Threshold sensitivity: Instead of a single cut, do a small beam search over plausible thresholds and pick the segmentation that best satisfies constraints plus an internal compactness score.
  • Background with similar sizes: Train a small verifier on the cropped detections (mobile CNN head) to label strand vs background. It can run post-tracker for only long-lived tracks to keep compute light.

Metrics beyond plain accuracy

When the output is a sequence such as 1-2-3-2-3, scalar accuracy can be misleading. Consider:

  • Boundary F1: Evaluate how often group boundaries (cuts) are placed correctly.
  • Edit distance on sequences: Levenshtein distance between predicted and target strings.
  • Group-size deviation: Penalize groups exceeding 3 and reward compact intra-group spread.

These metrics guide the thresholding and constraint logic more concretely than a single accuracy score.


If you really want a learned end-to-end model

The structured pipeline above tends to be simpler and more reliable. Still, if a learned approach is required, two directions are promising:

  • Pairwise linking with a graph model: Build a graph of tracks, edges weighted by spatial proximity and trajectory similarity (e.g., Hausdorff distance). Train a small GNN to predict “same-group” edges, then run community detection with constraints.
  • Set-to-sequence transformer: Encode per-track features and predict group boundaries with a transformer, using a Hungarian loss to match predicted groups to ground-truth segments (a la DETR-style assignment). More complex, but can learn nontrivial priors.

In most production cases, the 1D gap-segmentation approach achieves higher than 70% with minimal engineering and far less data-hungry training.


Why it matters for developers and engineers

This pattern appears everywhere: lane counting, fiber or cable bundling, manufacturing QC, even biological microscopy where filaments form bundles. The key insight is to honor the data’s structure — temporal coherence and 1D spatial separability — instead of flattening everything into per-detection classification. With a good tracker, robust statistics like median_x, and a principled gap-based cut, you get a crisp left-to-right sequence that reflects the physical layout.

If you’re sitting on a detector and a 70% ceiling, this is the sign to switch perspectives. Treat the output as a sequence, use temporal consistency to fight background, segment in 1D with a data-driven threshold, and enforce constraints directly. The payoff is not just better accuracy — it’s a cleaner, more interpretable pipeline that’s easier to debug and extend.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.