Great e-Scrape: clarity needed over generative AI training

2 Comments

Scrraaape! The World Wide Web is creaking under the action of web crawling bots moving from server to server, harvesting data to take back to their hives where hungry generative AI models await.

Ben Maling

But is it legal? Big tech is making big bets on this trillion dollar question, arguing that the acts carried out to train generative AI models are either outside the scope of, or benefit from an exception to, the laws of copyright infringement. Not everyone agrees. Creatives, journalists, and others who make money from licensing or displaying digital works are sounding the alarm on what they view as an illegal appropriation of work and diversion of revenue. New lawsuits are emerging every week (at least 32 in the US at the time of writing), but still the scraping continues, with the perpetrators betting that the courts will rule in their favour, or that by the time any legal certainty is reached, generative AI will be so embedded in society that something as quaint and antiquated as copyright law won’t be able to stand in its way.

Even aside from those feeling robbed, the lack of certainty is a problem. AI model providers are forced to operate in a legal grey area or risk falling behind their less scrupulous, or less questioning, competitors. On the other hand, many wishing to use generative AI are unsure of the legality of the tools and the risks of integrating them into their businesses. Stakeholders on all sides are looking to governments to provide clarity.

That’s why I was excited to hear Feryal Clark, the new minister for AI and digital government, mention the other week that she expects the AI copyright dilemma to be resolved in the UK 'in the very near future, by the end of this year'. Whether by updating existing policy, or with new legislation, the government will aim to bring clarity to both the AI sector and the creative industries. Clark acknowledges 'They both are incredibly important to the UK’s economy, so we need to actually resolve this. It’s been going on for far too long'.

It’s not going to be easy, and we’ve been here before: the previous government considered a text and data mining (TDM) exception similar to that provided by the EU’s Copyright in the Digital Single Market Directive (CDSMD), only to recall it at the 11th hour due to concerns from stakeholders. More recently, the Intellectual Property Office failed in its attempt to reach a voluntary code of practice, as parties with opposing interests were unable to see eye-to-eye. But assuming the government stays true to its word, which way will it go?

We do know that Google, OpenAI, and others have been lobbying for a 'fill your boots' copyright exemption for generative AI training. Sam Altman, generative AI’s great travelling salesman, has gone door-to-door arguing that it is simply impossible to create AI tools if the training data has to be licensed (though generative AI models can be and have been trained without using unlicensed copyright works, examples include Adobe’s Firefly models and 273 Ventures’ Kelvin Legal Large Language Model – KL3M).

Mark Zuckerberg recently chipped in from behind a pair of AI spectacles to offer that 'individual creators or publishers tend to overestimate the value of their specific content', a kind of de minimis argument designed to emphasise the inevitability of progress irrespective of who gets with the programme, though one that seemingly justifies enriching oneself by stealing small amounts from a large number of people.

One risk of a blanket exemption for AI training is that it may discourage creators and publishers from making their work available online, and that this closing of ranks would ultimately be harmful for the AI industry and consumers alike.

At the other end of the policy spectrum is a strict requirement that use of copyright works as training data must be disclosed and is only permitted by informed consent, on an opt-in basis. Some argue that this direction would put UK companies at a competitive disadvantage, and/or would stifle innovation, to use the terminology du jour. On the other hand, one could argue that such a provision would actually force AI providers to be more data-efficient, more selective in the data they use, and dare I say it, more innovative. Some of the worst excesses of the generative AI industry could be curbed.

To avoid penalising UK companies, it is important that any licensing requirement is equally applicable to any company using UK data (and, possibly, to any company putting a model into service in the UK), irrespective of where the training takes place.

There are a number of possible compromises. One is to allow training on copyright data, provided rights-holders are given an opportunity to opt out. Any such provision should come with guidance on what constitutes a valid opt-out, something that is still lacking in relation to the TDM exception in the EU’s CDSMD.

A more nuanced approach could be to have different provisions in different cases, such as for different types of copyright works and/or different types of AI model. For example, use of factual text data could be permitted on an opt-out basis, while use of fictional works, images, or music could require an explicit opt-in. In the case of journalism, AI companies would still have to obtain licences to feed up-to-date data into the models at runtime. Such a hybrid approach is allowed for in Lord Holmes’ AI Bill, which is the most serious proposal so far on how to handle the copyright dilemma in the UK.

The government has an opportunity here to show leadership that will reverberate on the world stage. I hope it will approach the issue with a clear sense of whose interests it has a mandate to protect.

Ben Maling is managing associate, UK and European patent attorney, EIP