Boundary Metrics for Thin-Structure Segmentation at 2% Foreground Sparsity

Pixel-perfect segmentation sounds easy until the foreground is barely there. Think whiteboard photos where actual ink strokes make up only a sliver of the image. At AI Tech Inspire, we spotted a preprint that zeroes in on this scenario and asks a deceptively simple question: when just ~1.8% of pixels are positive, how should segmentation be evaluated so the numbers actually mean something?

Key facts from the preprint

Investigates semantic segmentation under extreme foreground sparsity (~1.8% positive pixels) in a whiteboard digitization setting.
Target application: identify ink strokes (foreground) versus background or smudges and export clean results to a note-taking workflow.
Focus is on evaluation methodology and stress-testing, not on a new loss function.
Analyzes region metrics such as F1 and IoU.
Emphasizes boundary metrics including BF1 (Boundary F1) and Boundary-IoU.
Studies equity across core strokes vs. thin-stroke subsets.
Uses multi-seed training to capture variability across random initializations.
Reports per-image robustness statistics instead of only dataset averages.
Preprint link: arxiv.org/abs/2603.00163.

Why thin structures quietly break common metrics

Region metrics like IoU and F1 are popular because they’re easy to compute and widely comparable. But they can be misleading for ultra-sparse, thin structures. When positives are rare, small improvements to the background can dominate the score, while tiny geometric errors along a line can destroy the usefulness of the output without moving the metric much.

“If your positives are sparse and thin, region metrics will flatter you long before users do.”

Consider a whiteboard line one pixel thick. Predicting a slightly dilated, slightly misaligned line might still yield a reasonable IoU or F1, yet the output is unusable if the goal is crisp digital ink. This is why boundary-aware metrics—BF1 and Boundary-IoU—matter: they evaluate whether predictions hug the ground-truth edges within a tolerance band, better capturing the visual and functional quality of thin structures.

Boundary metrics in practice

Boundary-aware metrics shift the question from “How much region overlap is there?” to “How well do the predicted edges align with the true edges?” Two families highlighted in the preprint are:

BF1 (Boundary F1): Computes precision and recall over boundary pixels, using a distance tolerance (e.g., a few pixels) to allow for small misalignments.
Boundary-IoU: Applies an IoU-style evaluation, but only within a band around the ground-truth boundaries, reducing the impact of thick-region agreement that’s irrelevant to thin lines.

That tolerance band—often implemented as a morphological dilation of ground-truth edges—becomes your quality knob. Too small, and you unfairly punish small shifts. Too large, and you let sloppy edges pass. Teams building stroke, edge, or vessel detectors can make this a tunable parameter in their evaluation script:

# pseudo-interface: tune boundary tolerance evaluate(pred_mask, gt_mask, tolerance_px=2, metric="BF1")

In everyday pipelines, this approach aligns with what users actually see and need. If your app sends “digital ink” to a notebook or a vector layer, a one-pixel wobble is a user-visible bug; BF1 helps make that bug visible to your metrics.

Core vs. thin-subset equity: are you overfitting to the obvious strokes?

Another intriguing piece is the equity analysis across “core” regions versus the thinnest strokes. A model can look solid on bolder writing (or thicker roads/vessels) but crumble on the faint, narrow parts. If you only report a single average, that underperformance is hidden.

A practical pattern for engineering teams:

Define subsets (e.g., thin vs. core) based on local stroke width or skeletonization.
Report IoU/F1 and BF1/Boundary-IoU per subset.
Check for gaps. If thin-stroke BF1 lags far behind, you know where to focus data collection or augmentation.

For whiteboard capture, that could mean additional training samples of light marker pressure, low-contrast colors, or glancing reflections. For other domains—like road centerlines, blood vessels, or wire harnesses—the same idea applies.

Multi-seed training and per-image robustness

It’s easy to cherry-pick a seed that looks great. It’s much harder to build a model that’s robust across seeds and images. The preprint leans into this by running multiple seeds and by reporting per-image stats—not just dataset averages. That’s a nudge we endorse.

Multi-seed: Show means and standard deviations across random initializations to communicate model stability.
Per-image: Surface the long tail. Which images consistently fail? What patterns do they share (lighting, angle, marker color)?

For production teams, consider tracking distributions alongside averages. An internal dashboard that plots per-image BF1 over time can catch regressions faster than a single IoU number ever will.

Practical tips for developers and engineers

Match metric to use case: If the output is meant for vectorization (e.g., exporting strokes), prioritize BF1 and Boundary-IoU in your gates.
Tune boundary tolerance: Empirically pick a pixel tolerance that matches your device resolution and end-user expectations. A phone camera at 12MP might justify a smaller tolerance than a 2MP webcam.
Report subset equity: Split by stroke width or confidence strata to avoid hiding weaknesses.
Audit seeds and images: Maintain a “nasty nine” set of failure images and run all seeds against them before release.
Pre/post-processing counts: Small moves—like contrast normalization, adaptive thresholding, or non-maximum suppression—can improve boundary faithfulness more than another 10 epochs of training.

If you’re prototyping in PyTorch or TensorFlow, it’s straightforward to drop in boundary metrics during validation. For acceleration-heavy pipelines, ensure distance transforms and morphological ops are efficient; a CUDA-backed implementation or a compiled library can keep these evaluations fast. If you share models on Hugging Face, consider attaching boundary-metric reports to model cards for clarity.

Whiteboard digitization as a proving ground

The whiteboard digitization use case provides a great stress test for thin-structure segmentation. The task is simple to describe—“find only the ink”—but gnarly in practice due to glare, ghosting, parallax, and inconsistent marker pressure. Region metrics reward flooding a bit more white area (false positives), but users want clean, precise strokes.

Ideas to operationalize:

Collect a “studio” set (controlled lighting) and a “field” set (meeting rooms, classrooms, glass boards) and report per-split results.
Try a width-aware loss or thin-structure augmentations, then see if BF1 gains translate to lower human correction time.
For apps that export to vector paths, evaluate on stroke continuity: count breaks per line and measure average gap size.

In workflows, it’s often the last 5% of quality that determines whether users trust the feature. Boundary metrics make that last 5% visible.

Beyond whiteboards: where this matters

Medical imaging: vessel and nerve tracing where pixel-precise boundaries matter.
Mapping: road centerlines, power lines, and footpaths from aerial imagery.
Manufacturing: wire harness checks, PCB trace inspection, sealant bead validation.
OCR/sketch: pen strokes, architectural sketches, and edge extraction for vectorization.

Each of these domains shares the same pain: tiny misalignments hurt usability. Region metrics alone under-report the damage; boundary metrics illuminate it.

A lightweight evaluation checklist

Here’s a checklist you can fold into your next experiment:

Compute both region (IoU, F1) and boundary (BF1, Boundary-IoU) metrics.
Pick and justify a boundary tolerance; document it near the numbers.
Break out thin vs. core subsets; report gaps and track them over time.
Run 3–5 seeds; report mean and std for each metric.
Publish per-image metrics; highlight worst 10 samples with thumbnails.

Bonus: Add a keyboard-friendly toggle in your internal viewer—press T to switch between region and boundary overlays—to build intuition about failure cases.

Why it matters

This preprint’s contribution isn’t a new architecture or loss; it’s a clear-eyed evaluation lens for a class of problems many teams face but rarely benchmark well. If your foreground is sparse and your structures are thin, boundary-aware metrics can prevent you from shipping a model that “looks good on paper” but frustrates users in practice.

At AI Tech Inspire, the takeaway is simple: when only ~1.8% of pixels carry the meaning, measure what the pixels mean—not just how many you matched.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.

Raspberry Pi Kits

Edge AI & robotics.

The Hundred-Page LLMs Book (PyTorch)

Hands-on LLMs.