The Colonial Mirror Part 2 : How Western Data Shapes Global AI
AI did not emerge from neutral ground. It was trained on archives assembled by those who digitised first, wrote most, and owned the servers, predominantly across Europe and North America since the 1990s. In today’s frontier systems, that centre of gravity still holds. The model that claims to reflect humanity remains angled toward the West.
The mirror is not neutral
Large language models do not think, they calculate probability. What appears most often becomes what the machine treats as normal. Instruction tuning can steer style and guardrails at the margins, but it cannot erase the gravitational pull of the base distribution.
The reality is mathematical, not moral. The most complete digitised archives, the most cited web crawls, and the most linked sites remain overwhelmingly English and Western European. Even when new datasets broaden their linguistic range, the centre of mass stays Anglophone because that is where the infrastructure, funding, and compute reside.
This is not simply about words. It is about worldview. Western archives encode the default of the West: individualism as virtue, secular humanism as the moral centre, rational empiricism as the measure of truth. The histories of empire retold in the languages of empire. A model trained on that archive reproduces the same priorities, only polished by scale.
Years ago, Bender and colleagues warned that such systems would amplify the biases that saturate their training data. They were right.
Data colonialism in practice
There is a term for this pattern: data colonialism. It describes how digital life is harvested, centralised, and monetised by a handful of firms in rich countries, who then sell moderation, analytics, and machine judgement back to the world.
The pipelines are familiar. Common Crawl, a massive scrape of the web, supplies much of the raw text. In filtered subsets used by labs, English pages still exceed 50 per cent of the total. The result: what dominates the crawl dominates the model. LAION’s five billion image dataset follows the same arc; independent audits have found Western or English tagged content comprising more than 80 per cent of labelled images.
The pattern extends beyond data. Annotation and moderation labour, the tedious, sometimes traumatic work of classifying content, is often outsourced to low wage workers in Kenya, the Philippines, or India. The profits remain offshore, the human cost does not.
African scholars call it algorithmic colonisation, infrastructure, ethics, and profit extracted elsewhere, returned as dependency. Their remedy is not inclusion by invitation. It is sovereignty, local data, local models, local control.
Key mechanisms of the colonial relation
- Skewed inputs: English heavy web crawls and Western tagged image corpora set model priors.
- Offshore labour: annotation and moderation outsourced to low wage markets, value captured elsewhere.
- Platform feedback: moderation and ranking rules then sold back as global standards.
The training sets that set the world
Common Crawl may be open, but openness is not neutrality. Its composition shapes the statistical weather of every model that consumes it. If the inputs skew West, the outputs will mirror that skew. It is not malice. It is arithmetic.
Vision systems repeat the pattern. LAION’s image datasets drove the rise of diffusion models and generative AI. Yet repeated audits show the same cultural skews: Western faces, Western contexts, Western aesthetics. The mirror speaks with a Western voice and sees with a Western eye.
Attempts to break the centre
Some researchers are trying to move the centre of gravity. The BigScience consortium built BLOOM, trained on the multilingual ROOTS corpus, dozens of languages, hundreds of collaborators, fully open. It proved that collective effort can produce a public model not owned by any single firm or state.
Others have expanded in parallel: mC4 and OSCAR added vast non English web text to the global pool. Yet despite this, token share and quality imbalance keep the bias intact. English remains the statistical sun around which the smaller languages orbit.
Outside the labs, the Masakhane project in Africa has gone further, creating tools, corpora, and evaluation benchmarks from scratch. They are not asking for access. They are building sovereignty in code. That is what decolonisation looks like when it becomes infrastructure.
Alignment inherits the archive
The ethical layer that follows training, what the industry calls alignment, inherits the same Western frame. The rhetoric is universal, the logic is not.
Safety policies are written in English, benchmarked against U.S. and E.U. liability norms, and exported as global defaults. Other jurisdictions, China, Singapore, the Gulf, write their own frameworks, but the APIs that serve the world still default to Western legal risk models.
The consequence is a single, polite tone, reputational caution, corporate self protection, and a preference for inoffensive civility. It is not universal morality. It is risk management as seen from Silicon Valley.
Mohamed, Png, and Isaac argued for a decolonial AI, one that begins by naming where power sits and designing with that fact in view. That remains the north star.
What honesty would require
- Provenance: publish language and region shares for each dataset and checkpoint.
- Reciprocity: when a community’s data trains a system, share benefits and decisions.
- Plurality: host parallel policy stacks aligned to different legal and moral orders.
- Investment: fund digitisation and compute in the Global South as necessity, not philanthropy.
The cooperative counterweight
Despite all this, language itself carries a subversive grace. When people speak freely, they tend to explain, comfort, and seek understanding more than they threaten. The probabilities bend toward cooperation because that is how we mostly live.
The machine, if left honest, will follow those same frequencies. It could mirror a species inclined to coexist, if the mirror were wide enough to include everyone.
Bottom line
The strongest models today are global in reach but Western at the centre. That is the colonial mirror, a world reflected through the eyes of its former masters, now digitised and scaled.
Breaking it will take more than open access slogans. It will take new archives, new custodians, and new jurisdictions, a redistribution of compute as profound as the redistribution of power. Until then, every story the machine tells about the world will still arrive with a familiar accent.
You Might Also Like
- Beyond the Black Box: What Kind of Intelligence Are We Building? — explores the deep structure of modern AI systems and their political meaning.
- Strange Loops in AI — WARNING: You’re Talking to a Mirror — how human–machine dialogues feed back into thought itself.
- Strange Loops in AI — Part 2: Catching the Pulse — on emergent behaviour and self-referential cognition.
- The Billionaires’ Empire: Who Controls AI’s Future — a power-map of the oligarchic interests shaping AI development.
- The End of Search: How AI Will Replace Google — why generative models threaten the economics of the web itself.
Use this box to cross-link related AI essays within Telegraph Online and strengthen internal SEO structure.
