Working with the Yelp Open Dataset is a rite of passage for many recommendation and ranking experiments. But there’s a deceptively simple field that can tilt your results without you noticing: business.review_count. At AI Tech Inspire, this came up in a reader question that hits a core issue in ML practice: what counts as data leakage, and when is a metadata field fair game?

Quick context

  • The dataset includes business.json (“business_df”) and review.json (“review_df”).
  • business.json has a review_count field for each business.
  • Open question: Is review_count computed directly from the interactions in review.json?
  • Concern: When splitting data into train/test for a recommender, should review_count be recomputed from only training interactions to avoid exposing test data?
  • Goal: Use review_count as a feature/embedding input—ideally without leakage.

What review_count typically represents

In the Yelp dataset, review_count is a snapshot aggregate: the number of reviews Yelp has recorded for that business at the time the dataset was generated. It’s provided by Yelp as part of the business metadata. In many dumps, this number will align closely with counting rows in review.json for the same business_id, but it’s best treated as dataset-era metadata—not something derived on the fly by your pipeline.

That distinction matters. Using Yelp’s provided snapshot-era value means you might be incorporating information from interactions that you intend to hold out for testing—especially if you’re simulating a “train on the past, predict the future” scenario.

Rule of thumb: If your evaluation simulates the future, only use metadata and aggregates available up to your training cutoff date.

So… is it data leakage?

It depends on your split strategy and your problem framing:

  • Random split (no temporal semantics): Using Yelp’s review_count will often leak information about items’ overall popularity into the model. That can inflate offline metrics because the count correlates with many test interactions. It’s not catastrophic for a toy experiment, but the metrics will be optimistic.
  • Temporal split (simulating production): This is where leakage is real. If the model “knows” a business has 8,000 reviews when you train on data up to January—but many of those reviews happened in February and beyond—your features are peeking into the future.

In other words, review_count is a legitimate feature, but the timing of how you compute or source it must match your evaluation protocol.

Best practice by goal

  • You’re predicting future interactions (recommended):
    If you want credible offline metrics, recalc review_count from only the training interactions (or more generally, from interactions dated before your cutoff). Doing so makes the feature “point-in-time correct.” This mirrors how a production system would behave: the model only sees what existed then.
  • You’re doing a quick random-split baseline:
    You could use Yelp’s static review_count, note the caveat about optimism, and treat it as a popularity prior. For rigorous comparisons, prefer recomputing from train-only interactions.
  • You’re modeling exposure-based ranking:
    Sometimes you want popularity priors regardless of time (e.g., to reflect external prominence). In that case, using the snapshot count can be reasonable—but be explicit in your write-up that you include a full-snapshot popularity signal.

How to recompute counts safely

Here’s a simple approach using a temporal cutoff and your training interactions. You’d derive features from review_df filtered to your train window and then join them onto business_df:

# pandas-style pseudocode
train_reviews = review_df[review_df.date <= cutoff_date]
train_counts = train_reviews.groupby('business_id').size().rename('train_review_count')
features = business_df.merge(train_counts, on='business_id', how='left').fillna({'train_review_count': 0})
# Often helpful: log scale and clip
features['log_train_review_count'] = np.log1p(features['train_review_count'].clip(0, 1e6))

Why log1p? Count features tend to be heavy-tailed. A log transform stabilizes gradients and keeps models in frameworks like TensorFlow or PyTorch from over-weighting mega-popular items.

If you process at scale, remember aggregation is embarrassingly parallel. On large dumps, you might accelerate with GPUs via CUDA-backed dataframes, but for most experiments the CPU path is fine.

Point-in-time correctness: a mental model

Think of your feature pipeline like pressing Ctrl+Z on reality. When you train on data up to day T, you must also roll back every derived feature to reflect what you knew at T. That includes counts, averages, and even seemingly harmless “summary” fields. If your downstream task is sequential or time-aware, this discipline will save you from publishing over-optimistic results.

Other fields to sanity-check for leakage

  • Average rating (business.stars): If it reflects reviews beyond the train cutoff, it leaks. Consider recomputing a train_avg_rating and smoothing with a prior: (alpha*global_mean + sum_ratings) / (alpha + count).
  • Check-in counts, tip counts, or similar aggregates: Same story—recompute in your train window.
  • Open/closed flags and hours: If these change over time, ensure the value aligns to the train cutoff if your task depends on it.

Modeling with popularity-aware features

Used correctly, review_count is a powerful signal. A few practical patterns developers use:

  • Cold-start smoothing: Bucketize counts (e.g., 0, 1–5, 6–20, 21–100, 100+) and learn a small embedding per bucket. This avoids giving rare items noisy continuous values.
  • Two-tower recommenders: Pass log_train_review_count into the item tower alongside text/category embeddings. Tooling like TensorFlow Recommenders or a custom PyTorch setup handles this cleanly.
  • Calibration and fairness: Popularity priors can drown out niche items. Consider adding a loss term or post-ranking step that balances relevance and diversity.

What about using Yelp’s snapshot as-is?

There are cases where using the provided review_count makes sense—e.g., if your goal is to benchmark algorithms under identical, widely-used metadata. But the key is transparency. If the count includes interactions that appear in your test set, call it out and frame your results appropriately. For production-like evaluation, stick to train-only aggregates.

Practical decision checklist

  • Is your test set simulating the future? Recompute counts from data available before the split time.
  • Are you doing a quick, random baseline? You can use the snapshot count, but expect optimistic metrics.
  • Do you care about popularity as an external signal? Use the snapshot, document it, and consider fairness calibration.

Takeaway

The short answer: review_count in business.json is a static field provided by Yelp at the dataset snapshot time. It is not tied to your train/test split, and yes, it can leak information if your evaluation is temporal. The reliable path is to recompute a train_review_count from interactions within your training window, optionally log-transform and smooth it, and feed it into your model.

It’s a small change with a big impact on result credibility. As always, the headline is simple but the engineering is in the details—exactly the kind of rigor the AI Tech Inspire community values. Curious how you handle point-in-time features in your stack or with Hugging Face Datasets? Drop those approaches into your workflow and pressure-test your metrics. The next time a leaderboard jumps, you’ll know it’s for the right reasons.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.