Machine translation quality is often judged by proxy metrics and distant signals. But every so often, a clean, human-annotated resource drops that invites a closer look at what models actually get wrong. At AI Tech Inspire, this new release caught our eye because it fills a practical gap: professional, fine-grained error labels you can actually use.


Key facts at a glance

  • Open-source MQM-annotated machine translation dataset.
  • 362 translation segments.
  • 16 language pairs.
  • 48 professional linguists (not crowdsourced).
  • Full MQM error annotations: category, severity, and span.
  • Multiple annotators per segment for inter-annotator agreement (IAA) analysis.
  • Methodology follows WMT guidelines (error typology, severity levels).
  • Reported Kendall’s τ = 0.317 IAA, approximately 2.6× typical WMT campaigns.
  • Dataset link: alconost/mqm-translation-gold on Hugging Face.
  • Authors welcome feedback on annotation process and methodology.

Why this matters for practitioners

Most open MT test sets lean on crowdsourcing or are locked behind paywalls. That creates two pain points for engineers: noisy annotations and limited access. This release aims straight at both by offering professionally curated MQM labels in the open, with enough signal density to power careful error analyses.

MQM (Multidimensional Quality Metrics) goes beyond thumbs-up/down quality judgments. Each error is tagged with a category (e.g., Accuracy, Fluency, Terminology, Style), a severity level (minor, major, critical), and the exact span in the translation. For anyone building translation systems—or prompting general-purpose LLMs like GPT to translate—these labels are gold. They turn vague “it’s wrong” feedback into structured, model-actionable guidance.

“Span-level MQM is where debugging meets measurement: you don’t just know that a system underperforms—you learn how and where.”

What’s inside—and what’s notable

There are 362 segments across 16 language pairs, each annotated by professional linguists. The dataset follows WMT guidelines for error taxonomy and severities, aligning with how many research-grade evaluations are already conducted. Multiple annotators per segment unlock inter-annotator agreement (IAA) studies—critically important when you’re calibrating your own reviewers or comparing vendor quality.

One standout metric: Kendall’s τ = 0.317 for IAA, which is reported to be roughly 2.6× what typical WMT campaigns see. That suggests consistent annotator training and instructions can move the needle in a meaningful way. In practical terms, higher IAA means you can place more trust in the signal you’re getting from the labels—and build more confident correlations with automated metrics.

How developers can use this right now

  • Calibrate automatic metrics: Test how well BLEU, chrF, COMET, or BLEURT track MQM categories and severities on your models.
  • Error profiling: Identify whether your system fails more on Accuracy vs. Fluency, or if Terminology issues spike for certain language pairs.
  • Prompt engineering for LLM translation: Use span-level feedback to craft prompts that emphasize faithfulness or style depending on consistent failure modes.
  • Quality Estimation (QE): Train or fine-tune QE models that predict MQM-like labels or severities from source–hypothesis pairs.
  • Review pipeline design: Compare your in-house linguist reviews with this dataset’s patterns to validate training, rubrics, and reviewer agreement.

If you’re working in PyTorch or building evaluation scripts around the Hugging Face ecosystem, it’s straightforward to get started:

from datasets import load_dataset

# Load the dataset
mqm = load_dataset("alconost/mqm-translation-gold")

# Peek at the schema
print(mqm)
print(mqm["train"][0])  # or split that exists

# Example: group errors by category/severity and compute simple counts
# (Adjust field names after inspecting the sample above)

Because annotations include spans, you can go beyond counts and compute per-token or per-character error rates, or visualize heatmaps over translations to see which regions systematically break.

Small but sharp: thinking about the size

With 362 segments, the dataset isn’t large. But high-fidelity human annotation is expensive, and the depth of each labeled segment is nontrivial. In many workflows, a carefully constructed, professionally annotated slice is worth more than a sea of noisy labels—especially for diagnosing failure modes and creating targeted improvements.

A practical approach is to treat this dataset as a calibration and validation set for your own evaluations. Use it to tune how you score, aggregate, and weight severities. Then, scale similar scoring to your larger, noisier pools of translations to maintain consistency across projects and languages.

Agreement that actually informs decisions

The reported Kendall’s τ = 0.317 is a useful signal. Perfect agreement is unrealistic in linguistics, and agreement measures can vary widely by task and domain. Still, a jump in IAA—achieved here through consistent annotator training—speaks to process discipline. For teams building review pipelines, that’s a reminder: investing in clear rubrics, shared examples, and calibration sessions can raise reviewer consistency and make your metrics more stable over time.

Even with improved IAA, disagreement is expected. Consider using adjudication passes or majority-vote aggregation on critical segments, and track how robust your conclusions are under different aggregation rules (e.g., severity-weighted MQM scores versus plain counts).

Comparisons and complementary tools

This resource slots in nicely alongside automated metrics. For example, if you observe that COMET correlates strongly with Accuracy but weakly with Style on this dataset, you could adjust how you stack or select metrics depending on your product goals. If latency or compute cost matters, automated metrics remain essential for scale, but MQM labels show where those metrics miss. The blend—fast automated screening plus targeted human MQM—is often the pragmatic sweet spot.

It also helps with broader model evaluation beyond MT. If you prompt general-purpose LLMs (e.g., GPT) for translation, MQM spans can steer prompt templates or system messages, and provide concrete examples for few-shot corrections. The same span-level error taxonomy can inspire tooling for summarization and data-to-text tasks where Accuracy versus Fluency tensions appear in different forms.

Limitations and gotchas to keep in mind

  • Sample size: 362 segments won’t cover every domain or linguistic phenomenon; treat it as a high-quality microscope, not a telescope.
  • Domain bias: Without broad domain detail, avoid overgeneralizing system rankings; instead, use the data to profile error types.
  • Metric selection: MQM is manual and costly; combine it with automated measures for sustainable evaluation at scale.
  • IAA ≠ perfection: A higher τ is encouraging, but always inspect disagreement distributions and edge cases.

Ideas to explore

  • Severity weighting experiments: What weights best align with user impact in your product? Try tuning minor/major/critical weights to predict real-world complaints or support tickets.
  • Category-specific regressions: Train lightweight predictors for each error category to guide targeted model improvements.
  • Span-aware diffing: Use spans to compute token-level confusion matrices and uncover brittle lexical or morphological patterns.
  • Cross-pair transfer: Are error profiles stable across related language pairs? Could you bootstrap annotator training from one pair to another?

Getting started quickly

Access the dataset on Hugging Face here: alconost/mqm-translation-gold. A simple workflow many teams follow:

  1. Load the data and print a few examples to confirm schema and fields.
  2. Aggregate by error category and severity to create baseline dashboards.
  3. Correlate MQM-derived scores with your automated metrics and human acceptability ratings.
  4. Translate a small in-house sample with your systems and replicate the annotation rubric for comparison (even with one or two trained reviewers).

As the authors note, they welcome feedback on the annotation process and methodology. That openness invites replication studies and improvements—exactly what the community needs to converge on reliable evaluation practices.


Final takeaway from AI Tech Inspire: This dataset is a compact, professional lens on translation errors, not a one-stop benchmark. Use it to sharpen your evaluation stack, validate your reviewer training, and generate targeted hypotheses for model or prompt improvements. In an era awash with generic scores, span-level MQM is a refreshing reminder that where a model fails often matters more than by how much.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.