Honest numbers on a real election dataset.
We benchmark FieldworkIQ on Uchaguzi-2022 — the expert-annotated dataset of citizen reports from Kenya's 2022 election. Same test set every release. The numbers are on this page. The eval harness is in the public repo.
Uchaguzi-2022: 14,169 expert-annotated reports from the Kenyan election.
Published in 2024 by Ushahidi with academic collaborators. Two-pass annotation by trained annotators with Cohen's κ reported per category. The 500-sample expert test split is sacred — FieldworkIQ never trains on it, never tunes prompts against it, never uses its examples in few-shot.
Every FieldworkIQ release reports its accuracy on the same split. When the test set is exhausted, the next benchmark is Uchaguzi-2027 — fresh expert annotation, same protocol.
Dataset paper & data-access formWhat we ran, against what, with what comparison.
The same split
FieldworkIQ v1 runs against Uchaguzi-2022's expert-test-500 split. Topic accuracy, macro-F1, tag micro-F1. Translation tested separately on FLORES-200 Swahili-English.
Two baselines
The paper's own BERT-class model gives an automated baseline. An unaided human-volunteer triage round on a 200-report sub-sample gives the practical baseline — what your team would do without FieldworkIQ.
Every result is reproducible
The eval harness is open source. Get the dataset access, run make benchmark, get the same numbers on your hardware. If the numbers move, the repo's README shows when and why.
FieldworkIQ vs. the published baseline.
FieldworkIQ v1.0, evaluated on Uchaguzi-2022 expert-test-500 in November 2027. Translation evaluated on FLORES-200 Swahili-English devtest. Numbers below are the unmodified eval harness output, deltas computed against the paper's published baseline.
The single metric that matters to your team. FieldworkIQ does the translation, category, geocoding, and corroboration lookup before the verifier opens the case. The verifier then reviews and decides — typically in 15–30 seconds for clear cases, longer for held ones.
Where FieldworkIQ does well, and where it doesn't yet.
F1 score per topic category on the expert test set. The thin baseline mark is the published BERT-class baseline. Categories below 0.65 are flagged — FieldworkIQ holds these cases for verifier review rather than auto-suggesting a category.
The weaker categories share a common shape: small training set (under 100 examples), high stakes, and frequently ambiguous. FieldworkIQ won't auto-suggest these — it surfaces them to a verifier with the model's uncertainty visible. The verifier decides every public post regardless of category.
Things FieldworkIQ doesn't do well — and why your verifier is still essential.
We publish what doesn't work for the same reason we publish what does: a measurement you can't reproduce is closer to marketing than to evidence. None of the below is hidden from operators in the dashboard.
Rare topics drop to ~F1 0.4–0.5
Categories with fewer than 50 training examples (cyber attacks on results portals, mass disenfranchisement claims) fall off sharply. FieldworkIQ holds these for review rather than guessing.
Geocoding fails 41% of the time on <50-char SMS
"Issue at the station, urgent" with no location name can't be auto-geocoded. FieldworkIQ flags these for a verifier to follow up with the reporter or set aside.
Verifiers override severity 28% of the time
Severity is contextual: "a small queue" in one ward can mean disenfranchisement, in another it's normal. Verifier adjustments are logged and feed the next benchmark.
Translation degrades on mixed Swahili-Sheng
Pure Swahili scores BLEU 38; reports written half in Sheng (Nairobi slang) drop to BLEU 27. The original is always preserved on file.
Auto-publishing is never enabled when a person is named
Even on high-confidence classification, FieldworkIQ holds reports that name a third party. The verifier's job, not the model's. Same rule applies to violence allegations.
No coordinated-inauthentic-behavior detection in v1
If three reports from spoofed numbers describe the same fake event, FieldworkIQ won't currently catch it. Reporter-reputation scoring is roadmapped for v1.1.
Same dataset, same harness, same release notes.
- No silent test-set swaps. Uchaguzi-2022 expert-test-500 is the canonical benchmark for v1. If we move, the release notes say when and why.
- Test set is sacred. Examples from the test split never appear in agent prompts, few-shot examples, or fine-tuning corpora.
- Every release publishes numbers. v0.1 through v1.0 numbers are in the README. Regressions are flagged explicitly.
- Open harness. The
eval/directory in the repo runs against the dataset and reproduces the numbers above. - A v2 benchmark is coming. When we've implicitly tuned to Uchaguzi-2022, we'll commission fresh annotation — Uchaguzi-2027 is the natural opportunity.
The numbers matter. The verifier matters more.
Every public post on your map goes through a human review. FieldworkIQ exists to do the in-between work so your verifier can spend their time on judgement, not on translation or sorting. The benchmark above measures the in-between work. Your verifier is still the one who decides.