// Methodology

How We Test

Creator: Tracker Benchmark
Published: 2026-05-17T00:00:00.000Z

Every Tracker Benchmark review is scored on the same 100-point rubric. The protocol is published below in enough detail that an outside party could replicate it.

The 100-point rubric

Scoring rubric
Criterion	Weight	What we measure
Accuracy	30%	Mean Absolute Percentage Error (MAPE) vs weighed reference meals
Database quality	20%	Coverage, verification source, freshness, noise resilience
AI photo recognition	20%	Top-1 / top-3 identification, portion-size MAPE, graceful failure
Speed	10%	Median time-to-log across a standardized 20-task battery
UX	10%	Friction-of-correction, accessibility, onboarding clarity, absence of dark patterns
Price	10%	Real cost after 12 months, useful free-tier surface

How we measure accuracy

The reference battery is built from USDA FoodData Central composition values, with portions weighed on a calibrated kitchen scale (precision 0.1g). We compute the Mean Absolute Percentage Error (MAPE) of each app's predicted kcal vs the reference value across the battery.

Scoring anchor: 100 − (overall MAPE × 4), capped at 100, floored at 0. A 5% MAPE earns 80 points; 15% MAPE earns 40; 25%+ earns zero.

Sample size, equipment model numbers, and the full reference-meal list will be published as a downloadable CSV alongside the first batch of reviews. We will publish the scoring code on GitHub.

How we measure database quality

Coverage is sampled by querying each app for the same 50-item list (single ingredients, composed plates, regional dishes). Verification source is graded on a 4-tier scale: USDA / manufacturer label / verified user / unverified user. Noise resilience scores how often a common-foods search surfaces a usable result in the top three hits.

How we score AI photo recognition

For each AI-photo-capable app we run a 30-plate photo battery across three lighting conditions, three angles, and three plate sizes. Sub-scoring:

Top-1 identification correctness (40 of 100 AI-subscore points)
Top-3 identification correctness (20)
Portion-size MAPE (30)
Graceful failure when confidence is low (10)

How we score speed

Speed is the median time from app-open to logged meal across a standardized 20-task battery covering five input modes: barcode scan, search-and-select, photo AI, custom food entry, and recurring-meal recall. Tasks are timed with a stopwatch by the same logger across all apps to remove procedural variance.

Per-task targets (full speed sub-score = meeting all five):

Barcode → logged: ≤ 10 seconds
Search common food → logged: ≤ 20 seconds
Photo AI → logged (where supported): ≤ 15 seconds
Custom food entry (first-time): ≤ 60 seconds
Re-log a recent meal: ≤ 5 seconds

An entry logged incorrectly (wrong food, wrong portion) is counted as taking infinite time for that task — speed without accuracy doesn't earn speed points. Macro tracking quality is evaluated as part of the Database criterion (per-meal macro breakdown completeness) and the UX criterion (macro target editing flow), rather than as a standalone weight.

Test cadence

Top-tier apps are re-tested quarterly. Mid-tier apps are re-tested semi-annually. A vendor release that changes scoring methodology, database source, or core AI model triggers a 30-day re-test window.

Quality control

Until we publish named contributor bios, all writing and scoring is done by the editorial group (currently small) and reviewed against the test data before publication. Substantive corrections are logged with date and reason (corrections policy).

How we use AI

We use AI tools (Anthropic Claude, OpenAI ChatGPT) for research summarization and copy editing. AI does not write reviews, does not generate scores, and is never the source of a factual claim. Full disclosure: how we use AI.

Why we don't take affiliate money

We don't maintain affiliate accounts with any of the apps we cover. Our reasoning is documented in our no-affiliate disclosure.