Most vision pipelines nail object detection, then stall when the task turns into structured reasoning. One example that caught the AI Tech Inspire team’s attention: converting per-frame detections into clean, left-to-right strand groupings like 1-2-3. It’s a deceptively simple output that demands a smarter approach than throwing a generic classifier at the data.
When the label you want is a structured sequence, treat the problem as structure first — not just classification.
What we know from the setup
- A YOLO-based CV system detects the target object across videos.
- Visualizations include x vs t, y vs t, and x vs y vs t for all detections; point size reflects bounding box area.
- Goal: cluster these detected strands into left-to-right groups by spatial separation and output a string like
1-2-3(counts per group, ordered). - Example targets per video:
1-2-3,1-2-3-2-3,1-1-2-3-3-3-3. - Background false positives exist; area helps identify them and should be ignored.
- An XGBoost classifier reached ~70% accuracy; better seems possible.
- Constraints: at most 8 groups; each group has at most 3 strands.
Why a straight classifier underperforms
Per-detection classification ignores the underlying structure: this is a spatial-temporal grouping problem with global consistency requirements (left-to-right ordering, limited group sizes). Even a strong tabular model will struggle if the features are local and the target is a sequence-level label.
The fix: model the problem as a pipeline that converts detections into stable tracks, then performs 1D clustering on their spatial statistics with constraints, and finally emits a compact sequence string.
A practical pipeline that tends to work
At AI Tech Inspire, the most reliable blueprint for this kind of task looks like this:
- Step 1 — Track detected objects across time. Convert YOLO boxes to trajectories. Use a multi-object tracker like SORT, DeepSORT, or ByteTrack (e.g., ByteTrack is fast and robust for crowded scenes). Each track ≈ one physical strand.
- Step 2 — Filter background via temporal consistency. Background hits often have short lifetimes, unstable areas, or low confidence. Remove tracks with: (a) short duration, (b) high area variance, (c) low median confidence. This leveraging of persistence beats single-frame filtering.
- Step 3 — Reduce each track to a 1D spatial statistic. Compute per-track
xfeatures such asmedian_x,mean_x, and robust spread (e.g., MAD). If the scene is vertical, flip toy. Optionally, smooth with a Kalman filter. - Step 4 — Segment left-to-right groups using 1D gaps. Sort tracks by
median_x, compute adjacent gaps, then split where gaps are “large.” A simple and effective trick: fit a 2-component Gaussian Mixture to the gap distribution (small = within-group, large = between-group) and cut at the intersection. - Step 5 — Enforce constraints (≤8 groups, ≤3 strands/group). If a segment has >3 tracks, split it using its internal gaps. If >8 groups, merge the closest adjacent groups with the smallest boundary gap.
- Step 6 — Emit the sequence string. Count strands in each group, left-to-right, and join with dashes:
1-2-3, etc.
Compared to a monolithic classifier, this pipeline exploits the geometry, time, and known constraints — all the signal you already have.
Tracking, then clustering: the details that matter
Tracking: Turn detections into tracks using a robust online tracker. ByteTrack is a popular option; DeepSORT adds appearance features. Implementation stacks commonly lean on PyTorch for model/runtime interchange, though the tracker itself can be framework-agnostic.
- Track state: use bounding box
(x, y, w, h)and optionally a centroid. - Stability: smooth positions over time; discard tracks below a frame-count threshold (e.g., Fmin).
- Background suppression: filter tracks with high normalized area variance or irregular temporal presence.
1D grouping from gaps: After computing median_x per track:
- Sort tracks by
median_x. - Compute adjacent gaps
g_i = x_{i+1} - x_i. - Fit a 2-Gaussian mixture on
g. Estimate the thresholdg*at the intersection. Split whereg_i > g*.
This is elegant because it adapts to changing densities across videos without hardcoding distance thresholds. Alternatives: hierarchical clustering with complete linkage on 1D positions, DBSCAN (with care on eps), or optimal 1D k-means (e.g., Jenks natural breaks).
Reference pseudocode
Here’s a compact outline many engineers use as a starting point:
tracks = track_objects(detections) # e.g., ByteTrack/DeepSORT
tracks = [t for t in tracks if len(t) >= Fmin and conf_median(t) >= Cmin]
# area filter (optional):
tracks = [t for t in tracks if area_var(t)/area_mean(t) <= Amax]
X = [(t.id, median_x(t)) for t in tracks]
X.sort(key=lambda x: x[1])
# compute gaps
G = [X[i+1][1] - X[i][1] for i in range(len(X)-1)]
# fit 2-Gaussian mixture on G and get threshold g*
g_star = fit_gmm_and_intersection(G)
# segment into groups by cutting where gap > g*
segments = cut_by_gaps(X, G, g_star)
# enforce constraints
segments = split_if_size_gt_3(segments) # split using internal largest gaps
segments = merge_if_groups_gt_8(segments) # merge by smallest boundary gaps
# stringify
result = '-'.join(str(len(seg)) for seg in segments)
Replace fit_gmm_and_intersection with a quick EM fit (scikit-learn’s GaussianMixture works well), or use a percentile-based threshold if you want fewer dependencies.
Edge cases and how to tame them
- Crossing tracks or curved strands: If strands aren’t vertical in t, compute
median_xover a stable time window or fit a line per track and use the intercept at a canonical time. - Occlusions and fragments: Use track stitching (IoU/appearance matching) to merge fragments. Short, jittery fragments are likely background — drop them early.
- Threshold sensitivity: Instead of a single cut, do a small beam search over plausible thresholds and pick the segmentation that best satisfies constraints plus an internal compactness score.
- Background with similar sizes: Train a small verifier on the cropped detections (mobile CNN head) to label strand vs background. It can run post-tracker for only long-lived tracks to keep compute light.
Metrics beyond plain accuracy
When the output is a sequence such as 1-2-3-2-3, scalar accuracy can be misleading. Consider:
- Boundary F1: Evaluate how often group boundaries (cuts) are placed correctly.
- Edit distance on sequences: Levenshtein distance between predicted and target strings.
- Group-size deviation: Penalize groups exceeding 3 and reward compact intra-group spread.
These metrics guide the thresholding and constraint logic more concretely than a single accuracy score.
If you really want a learned end-to-end model
The structured pipeline above tends to be simpler and more reliable. Still, if a learned approach is required, two directions are promising:
- Pairwise linking with a graph model: Build a graph of tracks, edges weighted by spatial proximity and trajectory similarity (e.g., Hausdorff distance). Train a small GNN to predict “same-group” edges, then run community detection with constraints.
- Set-to-sequence transformer: Encode per-track features and predict group boundaries with a transformer, using a Hungarian loss to match predicted groups to ground-truth segments (a la DETR-style assignment). More complex, but can learn nontrivial priors.
In most production cases, the 1D gap-segmentation approach achieves higher than 70% with minimal engineering and far less data-hungry training.
Why it matters for developers and engineers
This pattern appears everywhere: lane counting, fiber or cable bundling, manufacturing QC, even biological microscopy where filaments form bundles. The key insight is to honor the data’s structure — temporal coherence and 1D spatial separability — instead of flattening everything into per-detection classification. With a good tracker, robust statistics like median_x, and a principled gap-based cut, you get a crisp left-to-right sequence that reflects the physical layout.
If you’re sitting on a detector and a 70% ceiling, this is the sign to switch perspectives. Treat the output as a sequence, use temporal consistency to fight background, segment in 1D with a data-driven threshold, and enforce constraints directly. The payoff is not just better accuracy — it’s a cleaner, more interpretable pipeline that’s easier to debug and extend.
Recommended Resources
As an Amazon Associate, I earn from qualifying purchases.