Lawyers, wake up – your data is being scraped | Opinion

3 Comments

We should pay very close attention to the battles at the moment in the USA over AI. Our future depends on their outcome.

Jonathan Goldsmith

The Hollywood actors’ strike is the most prominent. It is shocking to read how AI is already eating away at roles. It has started with extras. According to the union, the studios propose that ‘our background performers should be able to be scanned, get paid for one day’s pay, and their company should own that scan, their image, their likeness, and should be able to use it for the rest of eternity in any project they want, with no consent and no compensation’ – which is scary.

Body scanning may not at present affect lawyers, but other data scraping definitely does. The seductive large language models like Bard and ChatGPT do not spew out their knowledge because of original minds and self-generated proposals. They have been fed vast quantities of material free of charge – your and my material – in order to store it, re-issue it on command, and eventually take away our income.

It is chilling to read from where data is scraped: not only from regular websites, but also from apps and other products which integrate large language models, from TikTok videos, photos from dating websites, location data, music preferences, conversations from Microsoft Teams, and whole books.

Significant lawsuits have recently been issued in the US to try and stop this.

Open AI (of ChatGPT and DALL-E fame) faces an array of litigation, including a class action filed at the end of June in a San Francisco federal court against OpenAI and its investor Microsoft, which claimed that the company scraped the personal data of hundreds of millions of internet users in violation of privacy, intellectual property, and anti-hacking laws. Essentially, these machines are trained on what is claimed to be stolen private information, including from children.

Just a few days ago, a second and similar class action was brought against Google, also in a San Francisco federal court, claiming breach of privacy and property rights for the mass material fed to train Bard and its other Google generative AI offerings (including, so I see, one which I use all the time, Google Translate).

The two class actions have been brought by the same law firm, which asked the court to allow the plaintiffs to remain anonymous in each case, citing violent threats reportedly received by individuals filing similar lawsuits.

The problem is that the law is unclear. It was not written for an age of mass data scraping from millions of places without consent. Is it breach of copyright? Breach of contract? Breach of privacy and data protection? Breach of the criminal law? If more than one law applies, which has priority? These lawsuits will presumably begin to sort out the mess.

At least one of the AI generative models, Google’s Bard, did not at first operate in the EU because of data and privacy concerns. But last week, Google announced that, having met regulators, it had reassured them on issues of transparency, choice and control. In a briefing with journalists, Bard’s engineering vice president said that users could opt out of their data being collected. I wonder how you do that? It is certainly not mentioned in Google’s own announcement of its launch in the EU.

But in the meanwhile, as the cases crawl through the US courts, and as we learn whether and how we can opt out of the scraping being undertaken by a variety of unconnected companies, our data continues to be taken without our consent.

The most immediate threat is, I suppose, to the vast amount of proprietary material put out by law firms on their websites, interpreting recent cases and legislation. Expensive lawyers and knowledge management experts put out analyses in the firm’s name, as part of the firm’s branding.

I sometimes use these analyses in pieces for the Gazette, always with a link to the firm’s website. But, as you will know if you use large language models like ChatGPT, ChatGPT does not provide links to source material. It spews it out as its own. So your material is being scraped free of charge, and then fed back to you and a million others – threatening your future.

Anything else which your firm has on the web is susceptible to the same: all those briefings, blogs, podcasts, case studies, precedents, photos and CVs of your staff, summaries of cases won, introductory videos for clients. Solicitors have a vast amount of useful material on their websites, and all is vulnerable.

Having issued the warning, I am not enough of an expert to know the solution: sue ‘em, opt out, lock up your material.

Jonathan Goldsmith is Law Society Council member for EU & International, chair of the Law Society’s Policy & Regulatory Affairs Committee and a member of its board. All views expressed are personal and are not made in his capacity as a Law Society Council member, nor on behalf of the Law Society