
Why AI Benchmarks Miss Real-World Deployment Risks

Imagine a loan officer at a rural savings cooperative in rural Kenya. She has just received an AI-generated credit score for a smallholder farmer applying for a loan to buy seed before the planting season. The score says: high risk—not because of poor repayment behavior, but because the model places significant weight on formal indicators such as documented land ownership and standardized income records. This farmer’s land is family-held without a formal title deed, a common situation among smallholder farmers, making it difficult for the system to fully recognize his creditworthiness.
This farmer is a long-standing customer. The loan officer looks through his recent mobile transactions, his repayment history with the cooperative, his track record across two harvests. She overrides the system’s decision—and is probably right to do so—but she should not have to.
This scenario reflects a real tension in how AI credit tools are being deployed across Africa. For decades, smallholder farmers were excluded from the formal credit system. This is not because they were uncreditworthy, but because traditional, collateral-based lending, relying on land titles, salary slips, or formal credit bureau records, had no way to see them or to account for community-based banking practices common in informal economic settings..
Most African countries still lack comprehensive credit bureaus, so lenders have long had to work around this gap. That context matters: the shift to alternative data is not new, it is the direction the industry has been moving in for years. Research presented at the 2024 International Conference on Electrical, Computer, Communications and Mechatronics Engineering shows that models using mobile phone usage patterns such as call frequency, data usage, and mobile money transactions, can reach prediction accuracy of around 90%, compared to under 80% for traditional methods. In Kenya, platforms are adjusting their scoring algorithms during agricultural seasons to account for different cash flow patterns among farming communities. This is why contextualising AI models matter.
So the question is not whether AI can score this farmer’s credit worthiness, it increasingly can. The question is whether the systems being deployed are evaluated against the real conditions they will face.
The AI model in the above example did not fail because data was absent. It failed because the data was distributed across platforms and payment channels in ways that published benchmark evaluations rarely simulate. Many production systems are modular and proprietary; banks can weigh different variables depending on their customer base, and internal benchmarks are seldom made public. But the academic benchmarks that drive research, influence procurement, and shape how funders assess AI tools tend to use curated datasets that assume relatively complete and structured records. Neither type of benchmark transparently answers the question that matters most: does this system perform reliably for the people it is meant to serve, under the conditions they actually live in?
That is the gap this piece is concerned with—and it runs deeper than credit scoring alone. It shapes how AI systems are built, funded, and trusted across every sector where the Global Majority depends on them most.
Researchers have criticised AI benchmarks before. Raji et al. showed that AI auditing frameworks are often more about public perception than meaningful accountability. The AI Now Institute has documented how benchmark scores can hide deep structural problems in how AI systems are built and used. More recently, important work has emerged to build safety benchmarks in low-resource languages, including several African languages, to ensure that AI systems do not cause harm in these contexts. This is valuable and necessary work.
There remains a gap that these researches have not yet closed. Even the best new benchmarks still evaluate AI models in isolation, measuring what a model predicts under controlled test conditions. What they do not evaluate is whether an AI system, the model plus the data environment, the infrastructure, the human operator, the institution, actually works where it is deployed. That is the distinction this piece argues matters most, and it is most consequential in Global Majority contexts.
What We Are Missing: The Deployment Benchmark
A deployment benchmark does not just ask: "Does this model predict accurately on a test dataset?" It asks: "Does this system perform reliably in the conditions where it will actually be used?" Those are very different questions.
Consider what that would mean for a credit scoring system deployed in rural Kenya. A genuine deployment benchmark would examine how the model behaves when a borrower's transaction history is spread across multiple mobile money platforms, a normal pattern for people managing finances informally. It would assess whether the system remains useful when connectivity is intermittent, or when the person interpreting the output is a cooperative loan officer rather than a data scientist. These are not edge cases; they are baseline realities in many deployment contexts.
The challenge is not that banks and fintechs are ignoring these issues. Many are actively working on them. Credit scoring systems used in core banking software are often modular, allowing lenders to weigh variables differently depending on their customer base and adjust for local conditions. Some of the most innovative work in alternative data scoring is happening precisely in markets where formal credit infrastructure is thin. The benchmarks used to evaluate, compare, and procure AI systems, particularly the published academic benchmarks that shape research agendas and inform funding decisions, rarely reflect these conditions. They tend to assume relatively complete data, stable infrastructure, and a technically literate user. And because proprietary benchmarks are seldom made public, there is no way to independently verify whether the systems being deployed have been tested against the realities they will actually face.
That transparency gap is the problem. It is not that the technology cannot handle these conditions, increasingly, it can. It is that the evaluation frameworks used to assess and procure it do not require proof that it does.
Why This Hits Hardest in the Global Majority
In well-resourced settings, when an AI system underperforms, there are usually backstops: specialist review, alternative providers, institutional protocols. Loan applicants go to another lender. Workers escalate to a manager. In Global Majority contexts, those buffers often do not exist. The AI system may be the only formal credit assessment available to a smallholder farmer, with the savings cooperative being the only place they can access the capital needed to scale farming operations.
When these systems fail silently—performing well on benchmarks but poorly in deployment—the people least able to absorb that failure pay the price.
The Masakhane Research Foundation has shown how natural language technologies consistently underperform for African languages in real conditions, even when benchmark scores look reasonable. The World Bank research on alternative credit scoring has documented how models built on conventional financial data misjudge risk for populations whose financial lives are informal. These are not edge cases. They are the primary use case for AI in much of the world.
What Needs to Change
The fix is not simply more representative datasets even though those matter. It is changing what counts as evidence that a system works. Governments and development organisations procuring AI systems for public services should require evidence of deployments in comparable environments, not just benchmark scores from laboratory evaluations. When a ministry of agriculture buys a crop stress prediction tool, the question should not only be: "What was its F1 score (precision and recall)?" It should be: "Was it tested in areas with sparse weather station coverage or the absence of weather data? Did field officers find it useful? Did it improve decisions?"
This kind of human-centred evaluation, measuring whether a system improves the decisions of the people who use it, not just the accuracy of its predictions, is the most important shift we can make. It would surface the loan officer's override as data rather than noise. It would reveal where models are failing users who cannot easily walk away. It would make benchmark scores answer to the contextual reality world, not just the lab.
Researchers working on safety benchmarks in low-resource settings are asking the right questions about what AI says. We now need to ask equally hard questions about what AI does in the places, and for the people, where getting it wrong is least forgivable.
About Author
Brian Gillo is an AI researcher and project practitioner focused on building intelligent systems and digital infrastructure across sub-Saharan Africa. His work spans key sectors including financial inclusion, agriculture, and public health, where he explores how Technology can drive equitable access and better decision-making. This piece draws on insights from both his research and hands-on fieldwork.