Who Gets to Train the AI That Will Rule Us
The public conversation about artificial intelligence still circles around visible products. One system writes emails. Another draws pictures. A third suggests code. All of that is a distraction. The decisions that matter are taken far upstream, in the places where vast datasets are assembled, loss functions are written, and alignment teams are told what to reward or punish.
Those upstream choices are now in the hands of a very small club. A few firms in the United States and China, with a scattering of players in Europe and the Gulf, control the compute, the engineering talent and the cloud infrastructure for what are politely called foundation models. These systems are already being wired into welfare screens, tax enforcement, visa decisions, hiring, grading, health triage, customer service and content moderation. They are being dropped into public life as if they were neutral utilities, when in fact they are privately trained black boxes.
What is the real battle over artificial intelligence
When a ministry plugs a commercial model into its case management system, the institution is not just buying software. It is accepting someone else’s judgement about whose voices were included in the training data, which behaviours were rewarded during fine tuning, and which answers were suppressed in the name of safety. When a newsroom outsources first drafts to a closed model, the editor is no longer the only gatekeeper of what enters the public sphere. A silent extra editor sits upstream and can never be cross examined.
The owners of these systems insist that they are too complex to explain and too valuable to disclose. They offer glossy policy notes about responsible artificial intelligence while refusing basic questions about what their models have digested and how they were optimised. Regulators in turn have been slow to understand that this secrecy is not a side issue. It is the core of the power shift.
- Most state of the art models are built by a handful of firms with the capital to run training runs that cost tens or hundreds of millions of dollars.
- Public bodies in many countries are piloting or deploying such models in justice, welfare and health, even when they cannot see the underlying training record.
- Technical and legal scholars now treat opaque models in public decision making as a direct clash with basic requirements of accountability and the rule of law.
Why training the model means writing new rules
Training is not a neutral engineering exercise. It is a process of law making by other means. A model is given access to a massive diet of text, images and code. The developers then define which outputs are “helpful”, which are “harmful”, and which questions the system should refuse to answer at all. Reinforcement learning from human feedback turns those judgements into a statistical constitution. The result is a black box that does not simply reflect the world. It reflects the values of the people who built it and the interests of the institutions that pay for it.
Once that black box is embedded in infrastructure, contesting its choices becomes extremely difficult. The citizen rarely knows that a model sat in the loop. The official using the system rarely understands its inner workings. In many jurisdictions, the vendor will argue that training data and model internals are trade secrets. The effect is to create decisions that feel bureaucratic but are in fact private law in disguise. Someone has changed the rules of access to jobs, loans, benefits or information, and nobody can see exactly how.
This is why secrecy around training data and objectives is so dangerous. It is presented as safety or intellectual property. In practice it gives the owners of the system a veto over what is knowable about the machine that now sits between citizens and the state. When the model behaves unjustly, the victims are invited to appeal inside a process that cannot be properly inspected.
How prediction engines turn into instruments of control
Traditional surveillance states relied on cameras, informers and files. They watched, recorded and occasionally punished. Scaled artificial intelligence offers something quieter and more comprehensive. These systems compress patterns in behaviour into dense internal representations and learn to predict what similar people are likely to do next. Once you can predict, you can steer.
A model integrated into a social platform does not need to ban speech to shape public life. It can down rank some topics and elevate others. It can tailor content that keeps some groups agitated and others calm. A model wired into hiring does not need a rule that says “do not recruit people from this postcode”. It only needs to reproduce historic data and optimise for the profiles that led to past promotions. In both cases, the system presents itself as neutral and efficient. In both cases, bias and exclusion can be baked into the loss function.
- Data protection scholars have warned that profiling and automated scoring already allow institutions to treat some people as low value or high risk without ever telling them why.
- Generative models extend this reach by not only rating people but also generating personalised content that can nudge behaviour toward desired outcomes.
- Once deployed at scale, these engines do not need visible censorship. Quiet changes to what is suggested, shown or rewarded are often enough.
How artificial intelligence becomes a colonial infrastructure
The concentration of training power is not only a problem within wealthy states. It also has a clear international dimension. For many countries in Africa, Latin America and parts of Asia, the real choice is not between several domestic models. It is between renting systems from American cloud providers or from increasingly capable Chinese vendors. Either way, they risk importing a foreign black box into their courts, schools, media and public administration.
Researchers now speak openly about data colonialism and artificial intelligence colonialism. The argument is straightforward. Data are extracted from people everywhere. The most valuable models are trained in a few centres in the Global North. Those models are then sold back to poorer states as ready made intelligence and infrastructure. The patterns and priorities of wealthier societies are embedded in code and quietly exported as if they were neutral knowledge.
This is not an abstract worry. Policy work on global governance has already documented how tools built on English language datasets perform poorly in other languages, misunderstand local norms and exclude marginalised communities. Reports on the Global South warn that a new divide is opening up between states that can train or at least shape their own models and those that are reduced to permanent clients of foreign providers. In parallel, there is growing concern that cheap open models from one bloc may become the default standard for cash strapped governments elsewhere, not because they are the best fit, but because they are free.
- Training datasets dominated by English and a narrow slice of global culture, leading to systems that underperform or discriminate in other contexts.
- Public services in poorer countries increasingly built on rented models whose training process and value judgements were set elsewhere.
- Scholars and civil society groups in the Global South calling for local control over data, models and objectives rather than passive dependence.
What an open training rule would look like
If we accept that these systems are becoming a layer of public infrastructure, then a simple rule follows. No model that is trained entirely in secret by a private entity should be allowed to operate as default infrastructure for the public. The burden of justification should sit with those who want to keep a system closed while still enjoying public contracts and regulatory indulgence.
In practice that means several things. First, any model used in government, critical infrastructure, health, education, employment screening or justice should come with a publicly accessible training record. That does not require every document in a dataset to be dumped online. It does require clear documentation of what kinds of data were used, how they were sourced, which groups are under represented and what filters were applied.
Second, the weights and architecture of such models should be open to independent auditors under law, even if full public release is not always appropriate. The current fashion for “trust us, we hired our own red team” is not good enough. External experts, including from affected communities, must be able to test how systems behave in the real world and publish their findings without fear of contractual retaliation.
Third, states and alliances should invest aggressively in genuinely public models whose training is governed in the open and whose objectives are set by constitutional norms rather than private profit. That does not prevent private firms from building their own systems. It ensures that there is always a public option that is not wired to a single corporate balance sheet or foreign capital market.
- Clear public documentation of training data sources, filters and known gaps for any model used in public decision making.
- Legal rights for regulators and independent experts to examine model internals and publish safety findings.
- A funded ecosystem of public and national models so that no country is forced to rent its cognitive infrastructure from a single foreign provider.
Answering the security objection
Critics argue that more openness will hand powerful tools to criminals and hostile states. There are real risks here. Open tooling has already been abused in areas such as fraud and information operations. But secrecy about training is a poor shield. Closed models can be jail broken, probed and misused just as easily, while their flaws remain hidden from the people they affect.
Regulatory work on accountability points in a different direction. Instead of treating secrecy as protection, it emphasises the need for documented risk assessments, sector specific controls and continuous monitoring of deployed systems. Openness here is not a gesture toward idealism. It is a practical way to let independent researchers find problems before they harden into new structures of harm.
There is also a question of symmetry. The more that states rely on foreign black boxes, the more vulnerable they become to pressure and withdrawal. Open weight models and open source tools, when combined with strong governance and careful data curation, can reduce that dependency. They make it possible for smaller states, public interest institutions and civil society to build their own agents that serve their own priorities.
Deciding who the black boxes really work for
The danger is not that artificial intelligence will suddenly become conscious and turn against us. The more immediate danger is that we will sleepwalk into a settlement where most significant decisions pass through black boxes that were trained in private and aligned to someone else’s interest. Once that settlement is in place, it will be very hard to undo. Contracts will be signed. Workflows will be redesigned. Legal doctrines will slowly adjust to treat opaque systems as normal.
There is still room to resist that outcome. Legislatures can draw a bright line around the use of closed corporate models in public functions. Regulators can demand training records, independent audits and clear routes of redress for people harmed by automated decisions. Courts can treat proprietary secrecy as a weak answer when liberty, livelihood and equality before the law are at stake. Citizens can insist that if black boxes are going to mediate their lives, they at least be trained in the open.
The window for that insistence will not stay open forever. Every month that passes sees more agencies, companies and schools wiring closed models into their core systems on the understanding that this is simply modernisation. If the terms of that integration are not challenged now, we may soon find that the rules of everyday life are being written in training runs that nobody outside a small circle is allowed to see. That is not a future we have to accept, but it is a future that secrecy makes very likely.
- Censoring the Mirror: The Politics of AI Training — How alignment and safety language can become a cover for narrative control inside training pipelines.
- When Prediction Becomes Control: The Politics of Scaled AI — Why large models turn statistical prediction into a quiet machinery of governance.
- AI, Manipulation, and the Strange Loop — On emotional persuasion, feedback loops and the way chatbots can reshape human belief.
- The AI Boom Without Exit: Mania, Markets, and the Madness of Crowds — A look at how financial markets are funding an infrastructure binge with no clear exit path.
- One Intelligence to Predict Them All — How rival language models imitate and absorb each other until they behave like a shared global mind.
- London Leads Europe in AI, but Without Power and Capital, It’s an Empty Crown — Why data centres and cheap electricity, not slogans, decide who actually leads.
- AI Will Learn from Us and That’s What Should Terrify Us — On machine curiosity, human feedback and why our own behaviour is the real training risk.
- The Colonial Mirror Part 2: How Western Data Shapes Global AI — An examination of how Western datasets encode bias into global systems.
- Britain at the Crossroads: Teaching Resilience in the Age of AI — What education systems need to change if pupils are to live under constant machine mediation.
- The End of the Page: How AI Is Replacing the Web We Knew — On answer engines, the collapse of the open web and the fight over discovery.
References
| Source | Chatham House, “Artificial intelligence and the challenge for global governance,” sections on open source and resisting artificial intelligence colonialism. |
|---|---|
| OECD, AI openness work | OECD reports and essays on open weight models and accountability, explaining how transparency enables external testing and public scrutiny. |
| CSET, “Open Foundation Models” | Analysis of open foundation models and the limits of open weights without deeper documentation of training and design choices. |
| Responsible artificial intelligence and decolonising work | Research on data colonialism, algorithmic colonialism and the marginalisation of Global South languages and values in present systems. |
| Couldry and Mejias, data colonialism | Theoretical grounding for data colonialism as a new social order built on continuous extraction and profiling. |
| CSIS and related work on artificial intelligence in the Global South | Case studies of Kenya, Nigeria and other states using open source models to reduce dependence on foreign platforms. |
| OECD, accountability in artificial intelligence | Guidance on risk management, documentation and lifecycle oversight for systems used in high stakes domains. |
| Telegraph Online, artificial intelligence series | Prior Telegraph pieces on training politics, prediction, manipulation and artificial intelligence colonialism that this article extends. |
