While ministers pose at safety summits and talk about science fiction risks, the real settlement is being written somewhere else. It is not about killer robots. It is about who owns the training data that feeds the engines running modern artificial intelligence, and who is quietly being turned into free raw material for systems they will never control.
The polite version of the story goes like this. Artificial intelligence companies collect great pools of text, images and audio. They use them to train models that answer questions, generate images, write code and summarise documents. Society benefits from new tools. Creators and publishers get more readers. Everyone wins.
The version that is actually playing out is different. A handful of very large model makers are buying peace from a handful of very large rights holders. They sign private deals with big news groups, image libraries and stock houses, while insisting that everything else on the open web is fair game. At the same time, they are fighting off lawsuits from authors, newspapers and media companies who were not invited to the signing ceremony.
Somewhere between those two poles sit the governments, including Britain’s, pretending that hosting a summit or publishing a strategy is the same as deciding where power lands. It is not. The true question is simple and entirely concrete. Who owns the training data. Who sets the price. Who is allowed to say no.
- Large models are trained on vast text and image collections: scraped public web pages, digitised books, code repositories, news archives and stock image libraries.
- Some of that material is openly licensed or genuinely public domain. Some is licensed under private contracts. A great deal is simply taken and defended as fair use or text and data mining.
- Once a model is trained, the original sources are no longer visible, but the patterns and sometimes fragments of the works remain inside the system.
Whoever controls access to these corpora and the terms of their use controls the economic heart of modern artificial intelligence.
What is training data and why does it matter more than summits
Training data is not decoration. It is the engine oil and the fuel. Without vast corpora of human created text and images, the most advanced model is just an empty shell. That is why the quiet race underway in boardrooms is not about releasing flashy chat interfaces but about locking down privileged streams of data.
The pattern is already clear. Major model makers have signed content licensing deals with agencies like Shutterstock and news organisations including the Associated Press and Axel Springer, giving them access to structured archives on negotiated terms. In parallel, other news agencies such as Reuters have confirmed their own arrangements to supply content for training. These contracts are not charity; they are early moves in a market where a small number of privileged data providers will enjoy both cash and prominence inside answer engines that millions of people rely on.
None of this would be a problem if the rest of the ecosystem operated on transparent, standardised rules. It does not. Outside the handful of disclosed deals, training remains a mixture of scraping, litigation and silence. Small and independent publishers, bloggers, local outlets and non English sources are either harvested without discussion or blocked outright by technical files they did not write with artificial intelligence in mind.
The quiet deals between big models and big media
When OpenAI signs a multi year agreement with a major European publisher that owns titles like Politico, Bild and Business Insider, the press release mentions innovation and partnership. The detail that matters is different. The publisher receives money and, by its own description, favourable treatment inside the answer engine. In other words, its content not only trains the model but is more likely to be surfaced when users ask questions.
Other deals sit in the same pattern. A stock image provider licenses millions of images for model training and promotes the arrangement as a new revenue stream for contributors. A news agency sells archive access. Industry analysts keep informal tables of which pairing was worth roughly what. The technicalities differ, but one thing is constant. Those with existing scale and negotiating leverage are given a formal place at the table. Everyone else is told, after the fact, that their works were part of the training mix under an implied or statutory licence.
- Global image libraries signing multi year agreements with model makers so that millions of pictures and metadata can be used to train and refine generative systems.
- News organisations licensing text archives for training, in some cases receiving both fixed payments and variable fees linked to the use of their content inside products.
- Technology and media groups confirming that their stories will enjoy better placement in conversational search results as part of broader cooperation deals.
None of these arrangements are unlawful in themselves. The problem is what happens to everybody who is not inside them.
When rights holders fight instead of signing
Not every rights holder has decided to make peace. Some have gone to court. The New York Times is pursuing a major lawsuit in the United States alleging that millions of its articles were used without permission to train language models and that, in some cases, model outputs can reproduce substantial parts of its work. Authors, through individual and class actions, have brought their own litigation, claiming that books were ingested from shadow libraries and that the output of the systems infringes their rights.
The picture is not one way. In Britain, Getty Images has just lost the central part of its copyright claim against Stability AI in the High Court. The judge concluded that the model weights of Stable Diffusion are not copies of the original images, even though the training dataset included works bearing the Getty watermark, and that the relevant acts of training did not take place within the territorial reach of United Kingdom copyright law. Some narrow trade mark issues survived, but the headline claim of large scale copyright infringement failed.
Elsewhere, an artificial intelligence company accused of training on pirated books has chosen to settle a class action rather than risk trial, after a federal judge in the United States accepted that training on lawfully obtained copies could fall within fair use but allowed broader arguments about specific uses to continue. Other media groups have sued over alleged disregard of robots files and removal of copyright notices. Courts are only beginning to draw the lines and they are doing so in different ways in different jurisdictions.
From the point of view of a publisher or author, this does not add up to a stable regime. What it signals is that the law is unsettled and that, in the meantime, the real settlement will be achieved by contract between those with the balance sheet and the litigation budget to force their way into the first wave of deals.
Britain’s position: safety theatre and copyright fog
Britain presents itself as a pragmatic middle power on artificial intelligence. It hosts summits, stages photo opportunities with chief executives and talks about flexible frameworks. On the question that matters for ownership, it has been slow and hesitant.
The government’s own consultation on copyright and artificial intelligence, launched at the end of 2024, admitted what everyone involved already knows. Rights holders find it difficult to control the use of their works in training. Developers find it difficult to understand what they can legitimately do. Legal uncertainty is undermining investment and fuelling conflict. Proposals floated by ministers have included a wide text and data mining exception, together with an opt out mechanism whereby rights holders would need to reserve their rights to prevent their works being used for training.
At the same time, parliamentary research briefings and legal commentaries have warned that a broad exception without real control for creators would amount to a subsidy to the very largest model makers, turning the creative and news industries into an involuntary input pool. Pressure will increase as the Data Use and Access Bill and other legislation move through the system, but for now there is no clear statutory settlement. Britain is drifting while others begin to write rules.
- In the European Union, the new artificial intelligence regulation will require providers of general purpose models to publish summaries of the content used for training, including information about copyright protected materials.
- In the United States, courts are hearing multiple consolidated cases against major artificial intelligence developers, while regulators have started to demand information about training data sources in competition and consumer protection investigations.
- In Britain, the main movement so far has been consultations and position papers, with proposals for an exception and opt out model still under discussion rather than enacted law.
The result is that Britain risks importing whatever settlement emerges elsewhere instead of setting its own terms.
What this means for small publishers and independent voices
In all of this, almost nobody is speaking for small and independent publishers. The model companies want broad freedom to scrape. The biggest media groups want paid deals and visibility. Governments want to be seen as pro innovation without paying the political price of picking a side. The outlets left on the margins are the ones who can least afford to be treated as frictionless raw material.
For a site like Telegraph Online, there are only three real outcomes. The first is accidental invisibility, where technical settings written for search engines ten years ago end up blocking or confusing modern crawlers used by artificial intelligence systems. In that world, you are simply not present in the new discovery layer, however strong your journalism may be.
The second outcome is being strip mined without compensation. If robots files and terms are ignored or interpreted aggressively by model makers, and if national law gives them the benefit of the doubt, then independent outlets become unpaid training data. Their work improves the answers, but their name is never spoken and their bank account never notices.
The third outcome is collective bargaining. Independent and smaller publishers pool their rights and negotiate together, either through new collecting societies or through co operatives that offer standard licences to model makers on transparent terms. That requires a legal framework that recognises their position, rather than one that treats every scrape as fair game unless a court rules otherwise.
What a fair training data settlement could look like
A serious British approach would start from three simple principles. First, transparency. Providers of significant models should be required to publish meaningful summaries of their training sources, including categories of copyright protected content and major institutional contributors. Without that, rights holders and the public are arguing in the dark.
Second, choice. Creators and publishers should have clear, enforceable mechanisms to say yes or no and to change their minds. That could include standardised clauses in publishing contracts, recognised signals in robots files, and statutory rights for collecting bodies to represent groups of rights holders. An opt out buried in a consultation document is not enough.
Third, money and leverage. If artificial intelligence systems rely on news, books, images and other works at scale, then there should be a clean route for those sectors to be paid, not just for a few multi nationals. That could mean sector wide licences analogous to the way music is licensed for public performance. It could mean mandated collective bargaining schemes where model makers must negotiate with recognised bodies. What it cannot mean, if the word fair is to retain any content, is that payment and prominence are reserved for the small club of organisations that were first through the door.
The choice in front of Britain
Britain does not have to accept that the ownership of training data will be settled entirely in American courts and private contracts. It can decide that if models trained on local work are going to shape what its citizens see and hear, then the terms of that training should be visible and contestable.
That would require ministers to do something harder than hosting summits. It would mean choosing between a regime that quietly gifts free inputs to a handful of global technology companies, and one that insists on negotiated access, transparency and a route for independent voices to be heard. It would mean admitting that the substance of artificial intelligence governance lies not in speculative debates about hypothetical existential risks but in very specific questions about who owns the data, who gets paid and who is allowed to say no.
If Britain refuses to make that choice, others will make it for it. The country will discover, as it already has in other areas of technology, that it has become a rule taker in a system whose foundations were poured somewhere else. The argument over who owns the training data will not wait for Whitehall to catch up.
- Major model developers have signed disclosed licensing agreements with large media and image providers, while facing separate lawsuits from authors and publishers who allege unlicensed use of their works.
- British courts have begun to decide cases such as Getty Images against Stability AI, rejecting broad copyright claims while recognising narrower trade mark issues and leaving key questions open.
- The European Union has already embedded training data transparency obligations into its artificial intelligence regulation, while Britain is still consulting on basic copyright exceptions for model training.
Method and sources: This article draws on official government consultations and parliamentary briefings on copyright and artificial intelligence in Britain, European material on transparency obligations in the new artificial intelligence regulation, legal analysis and judgments in Getty Images v Stability AI and related cases, industry reporting on content licensing deals between model makers and publishers, and public documents from author and media lawsuits in the United States. No anonymous briefing and no uncheckable claims are used.
References
| Source | Relevance |
|---|---|
| United Kingdom government consultation on copyright and artificial intelligence and related parliamentary research briefings | Explains current uncertainty in United Kingdom law on training data, text and data mining exceptions and proposed opt out mechanisms for rights holders. |
| European artificial intelligence regulation and commentary on transparency obligations for training data | Shows how the European Union is imposing duties on general purpose model providers to disclose summaries of training content and respect copyright. |
| Getty Images v Stability AI judgment and legal commentaries | Provides the first major United Kingdom ruling on whether model training and weights amount to copyright infringement and how territorial issues are treated. |
| United States litigation involving the New York Times, authors and media companies against artificial intelligence developers | Illustrates the claims of unlicensed training, alleged memorisation of works and the use of shadow libraries, as well as judicial approaches to fair use. |
| Industry reporting and market analyses on content licensing deals between model makers and large publishers and image providers | Documents the emerging pattern of private data deals and the way that favourable treatment inside conversational systems is sometimes part of the bargain. |
- The End of the Page: How AI Is Replacing the Web We Knew — On how answer engines are quietly dismantling the old search based internet and what that means for publishers who still write for pages.
- Who Gets to Train the AI That Will Rule Us — A closer look at how control over training data turns a handful of companies into unelected editors of the public record.
- AI Colonialism and the New Data Empires — Why the scramble for digital corpora looks less like open innovation and more like a familiar pattern of extraction.
- The End of Clicks: How Answer Engines Are Rewriting Media Economics — On the way conversational interfaces strip traffic from independent outlets while feeding on their work.
- RedBird, Barclays and the Daily Telegraph — A case study in how financial structures quietly decide who speaks in Britain’s media landscape.
- China Pollutes Less per Person Than America or Britain — A data led challenge to a standard climate talking point and an example of how numbers cut through narrative.
- Venezuela: Washington’s Next Conflict — On how oil, minerals and debt shape the next theatre of pressure in the western hemisphere.
- Deindustrialisation of Germany: A Self Inflicted Wound — How energy and industrial policy hollowed out Europe’s former export engine.
- London Leads Europe in AI but Without Power and Capital — Why Britain risks becoming an artificial intelligence laboratory instead of an industrial centre.
- How a Single Press Pass Became a Stress Test for British Democracy — What one accreditation decision revealed about gatekeepers around Parliament and the press.
