How we measure it

Honest numbers on a real election dataset.

We benchmark FieldworkIQ on Uchaguzi-2022 — the expert-annotated dataset of citizen reports from Kenya's 2022 election. Same test set every release. The numbers are on this page. The eval harness is in the public repo.

Jump to the results Or skip to the limitations

The dataset

Uchaguzi-2022: 14,169 expert-annotated reports from the Kenyan election.

Published in 2024 by Ushahidi with academic collaborators. Two-pass annotation by trained annotators with Cohen's κ reported per category. The 500-sample expert test split is sacred — FieldworkIQ never trains on it, never tunes prompts against it, never uses its examples in few-shot.

Every FieldworkIQ release reports its accuracy on the same split. When the test set is exhausted, the next benchmark is Uchaguzi-2027 — fresh expert annotation, same protocol.

Dataset paper & data-access form

Total reports

14,169

From SMS, X, and email channels

Test split

500

Expert-annotated, never trained on

Annotators

2 passes

Cohen's κ > 0.6 substantial agreement

What we ran, against what, with what comparison.

The same split

FieldworkIQ v1 runs against Uchaguzi-2022's expert-test-500 split. Topic accuracy, macro-F1, tag micro-F1. Translation tested separately on FLORES-200 Swahili-English.

Two baselines

The paper's own BERT-class model gives an automated baseline. An unaided human-volunteer triage round on a 200-report sub-sample gives the practical baseline — what your team would do without FieldworkIQ.

Every result is reproducible

The eval harness is open source. Get the dataset access, run make benchmark, get the same numbers on your hardware. If the numbers move, the repo's README shows when and why.

Headline results

FieldworkIQ vs. the published baseline.

FieldworkIQ v1.0, evaluated on Uchaguzi-2022 expert-test-500 in November 2027. Translation evaluated on FLORES-200 Swahili-English devtest. Numbers below are the unmodified eval harness output, deltas computed against the paper's published baseline.

FieldworkIQ v1.0 · run 2027-11-08

Topic accuracy

82%

Baseline · 76%

+6.0 pts

Single-label classification across 12 topic categories.

Topic macro-F1

0.79

Baseline · 0.73

+0.06

Per-class F1 averaged, equal weight per category.

Tag micro-F1

0.71

Baseline · 0.65

+0.06

Multi-label fine-grained tags · 47 in taxonomy.

Translation BLEU

38.2

Google Translate · 35.4

+2.8

FLORES-200 Swahili → English devtest, 1,012 sentences.

Time to verifier-ready

~45sec / report

Unaided volunteer triage · ~6 min/report (n=200, median)

The single metric that matters to your team. FieldworkIQ does the translation, category, geocoding, and corroboration lookup before the verifier opens the case. The verifier then reviews and decides — typically in 15–30 seconds for clear cases, longer for held ones.

Measured on 200 cases from the Mukurinji-2026 pilot, against the same cases triaged by unaided volunteers in a separate session.

Per-category breakdown

Where FieldworkIQ does well, and where it doesn't yet.

F1 score per topic category on the expert test set. The thin baseline mark is the published BERT-class baseline. Categories below 0.65 are flagged — FieldworkIQ holds these cases for verifier review rather than auto-suggesting a category.

FieldworkIQ Baseline Held for review

Election-day disruption

0.86

Voter registration

0.81

Logistical issue

0.79

Voter frustration

0.74

Observer access

0.71

Ballot supply

0.68

Intimidation

0.62

Results dispute

0.58

Violence allegation

0.54

The weaker categories share a common shape: small training set (under 100 examples), high stakes, and frequently ambiguous. FieldworkIQ won't auto-suggest these — it surfaces them to a verifier with the model's uncertainty visible. The verifier decides every public post regardless of category.

Honest limitations

Things FieldworkIQ doesn't do well — and why your verifier is still essential.

We publish what doesn't work for the same reason we publish what does: a measurement you can't reproduce is closer to marketing than to evidence. None of the below is hidden from operators in the dashboard.

Novel categories

Rare topics drop to ~F1 0.4–0.5

Categories with fewer than 50 training examples (cyber attacks on results portals, mass disenfranchisement claims) fall off sharply. FieldworkIQ holds these for review rather than guessing.

Low-context reports

Geocoding fails 41% of the time on <50-char SMS

"Issue at the station, urgent" with no location name can't be auto-geocoded. FieldworkIQ flags these for a verifier to follow up with the reporter or set aside.

Severity judgement

Verifiers override severity 28% of the time

Severity is contextual: "a small queue" in one ward can mean disenfranchisement, in another it's normal. Verifier adjustments are logged and feed the next benchmark.

Code-switching

Translation degrades on mixed Swahili-Sheng

Pure Swahili scores BLEU 38; reports written half in Sheng (Nairobi slang) drop to BLEU 27. The original is always preserved on file.

Named individuals

Auto-publishing is never enabled when a person is named

Even on high-confidence classification, FieldworkIQ holds reports that name a third party. The verifier's job, not the model's. Same rule applies to violence allegations.

Adversarial reports

No coordinated-inauthentic-behavior detection in v1

If three reports from spoofed numbers describe the same fake event, FieldworkIQ won't currently catch it. Reporter-reputation scoring is roadmapped for v1.1.

Reproducibility commitments

Same dataset, same harness, same release notes.

No silent test-set swaps. Uchaguzi-2022 expert-test-500 is the canonical benchmark for v1. If we move, the release notes say when and why.
Test set is sacred. Examples from the test split never appear in agent prompts, few-shot examples, or fine-tuning corpora.
Every release publishes numbers. v0.1 through v1.0 numbers are in the README. Regressions are flagged explicitly.
Open harness. The eval/ directory in the repo runs against the dataset and reproduces the numbers above.
A v2 benchmark is coming. When we've implicitly tuned to Uchaguzi-2022, we'll commission fresh annotation — Uchaguzi-2027 is the natural opportunity.

Read the eval harness See the adapter contract

The numbers matter. The verifier matters more.

Every public post on your map goes through a human review. FieldworkIQ exists to do the in-between work so your verifier can spend their time on judgement, not on translation or sorting. The benchmark above measures the in-between work. Your verifier is still the one who decides.

See the verifier at work Or run a pilot on your own dataset