How We Test
Every Tracker Benchmark review is scored on the same 100-point rubric. The protocol is published below in enough detail that an outside party could replicate it.
The 100-point rubric
| Criterion | Weight | What we measure |
|---|---|---|
| Accuracy | 30% | Mean Absolute Percentage Error (MAPE) vs weighed reference meals |
| Database quality | 20% | Coverage, verification source, freshness, noise resilience |
| AI photo recognition | 20% | Top-1 / top-3 identification, portion-size MAPE, graceful failure |
| Speed | 10% | Median time-to-log across a standardized 20-task battery |
| UX | 10% | Friction-of-correction, accessibility, onboarding clarity, absence of dark patterns |
| Price | 10% | Real cost after 12 months, useful free-tier surface |
How we measure accuracy
The reference battery is built from USDA FoodData Central composition values, with portions weighed on a calibrated kitchen scale (precision 0.1g). We compute the Mean Absolute Percentage Error (MAPE) of each app's predicted kcal vs the reference value across the battery.
Scoring anchor: 100 − (overall MAPE × 4), capped at 100, floored at 0. A 5% MAPE earns 80 points; 15% MAPE earns 40; 25%+ earns zero.
Sample size, equipment model numbers, and the full reference-meal list will be published as a downloadable CSV alongside the first batch of reviews. We will publish the scoring code on GitHub.
How we measure database quality
Coverage is sampled by querying each app for the same 50-item list (single ingredients, composed plates, regional dishes). Verification source is graded on a 4-tier scale: USDA / manufacturer label / verified user / unverified user. Noise resilience scores how often a common-foods search surfaces a usable result in the top three hits.
How we score AI photo recognition
For each AI-photo-capable app we run a 30-plate photo battery across three lighting conditions, three angles, and three plate sizes. Sub-scoring:
- Top-1 identification correctness (40 of 100 AI-subscore points)
- Top-3 identification correctness (20)
- Portion-size MAPE (30)
- Graceful failure when confidence is low (10)
How we score speed
Speed is the median time from app-open to logged meal across a standardized 20-task battery covering five input modes: barcode scan, search-and-select, photo AI, custom food entry, and recurring-meal recall. Tasks are timed with a stopwatch by the same logger across all apps to remove procedural variance.
Per-task targets (full speed sub-score = meeting all five):
- Barcode → logged: ≤ 10 seconds
- Search common food → logged: ≤ 20 seconds
- Photo AI → logged (where supported): ≤ 15 seconds
- Custom food entry (first-time): ≤ 60 seconds
- Re-log a recent meal: ≤ 5 seconds
An entry logged incorrectly (wrong food, wrong portion) is counted as taking infinite time for that task — speed without accuracy doesn't earn speed points. Macro tracking quality is evaluated as part of the Database criterion (per-meal macro breakdown completeness) and the UX criterion (macro target editing flow), rather than as a standalone weight.
Test cadence
Top-tier apps are re-tested quarterly. Mid-tier apps are re-tested semi-annually. A vendor release that changes scoring methodology, database source, or core AI model triggers a 30-day re-test window.
Quality control
Until we publish named contributor bios, all writing and scoring is done by the editorial group (currently small) and reviewed against the test data before publication. Substantive corrections are logged with date and reason (corrections policy).
How we use AI
We use AI tools (Anthropic Claude, OpenAI ChatGPT) for research summarization and copy editing. AI does not write reviews, does not generate scores, and is never the source of a factual claim. Full disclosure: how we use AI.
Why we don't take affiliate money
We don't maintain affiliate accounts with any of the apps we cover. Our reasoning is documented in our no-affiliate disclosure.