In the last days of 2023, The New York Times sued OpenAI and Microsoft for copyright infringement. A move such as this had been mooted for a long time โ large language models (LLMs) are trained on data gleaned from published work online, after all โ but a high-profile lawsuit finally being tabled is still a watershed moment.
Generative AIโs ambiguity with copyright laws has troubled both developers, publishers and lawmakers for many years. But itโs one of many headaches felt due to the rapid ascent of AI and how to strike the perfect balance between quelling bad actors and encouraging sensible growth.
Take the EU AI Act; lawmakers finally agreed on legislation to demand transparency from providers and prohibit certain uses of the technology a month ago, albeit after struggling to keep up with the pace of progress since it was first proposed in 2021. Despite reaching a long-awaited agreement, enforcement likely wonโt start for years.
The issue of copyright is more delicate to settle, particularly with models now becoming multimodal. Weโre very much exploring uncharted territory right now. But what could be the solution to appease publishers and attribute creators appropriately? And, regardless of the outcome of The New York Times suit, how is this dealt with by developers in the short term?
Special treatment
After The New York Times filed its lawsuit, OpenAI expressed its โsurprise and disappointmentโ at the move. In a statement, the company confirmed that โregurgitationโ โ generating unaltered pieces of memorised content โ โโฆis a rare bug that we are working to drive to zero.โ Evidently, this progress was not gathering pace quick enough for many creators.
It probably hasnโt helped that OpenAI has pleaded for special treatment when it comes to paying licence fees, submitting evidence to the House Of Lords communications and digital committee that the development of artificial intelligence is doomed without the use of copyrighted materials. Aside from the fact an exemption would violate the Berne Convention, this move was met with a mixture of scorn and outrage by many.
Gary Marcus, in his Marcus On AI newsletter, put it best: โOpenAIโs lobbying campaign, simply put, is based on a false dichotomy (give everything to us free or we will die) – and also a threat: either we get to use all the existing IP we want for free, or you wonโt get to use generative AI anymore.โ
It appears that the only solution to appease creators and publishers is for foundational AI developers to pay up. Much like those playing copyrighted music in a public place, owners of a business must pay for a PRS licence to ensure the people involved are reimbursed. Why should it be any different for AI companies?
What next?
If AI developers are ultimately forced to pay up through landmark court rulings, costs to build and run LLMs will grow to even higher, eye watering levels, perhaps solidifying their use by only the wealthiest private companies.
Regardless of whatever monetary reimbursement a court may decide, the issue of attribution still looms large. Can it actually be solved with current technology? This is particularly timely as LLMs have become multimedial, with images especially tricky to pinpoint.
This is why, for now, OpenAI appears to be implementing new guardrails to steer users away from inadvertently creating copyrighted material. Up to now, the success of guardrails has been mixed at best; not overly surprising when it must strike a perfect balance between restrictiveness and permissiveness. But whether new guardrails can be little more than a sticking plaster remains to be seen.
In the meantime, other developers are experimenting with tools to arm creators with. An example is Nightshade, a project from the University of Chicago. Playfully comparing it to โputting hot sauce in your lunch so it doesn’t get stolen from the workplace fridgeโ, the tool tricks the learning model into interpreting a different image to what a human viewer would see, skewing the reproduction of copyrighted material.
Furthermore, if and when a technical solution can be found, how will licence fees and royalties be tracked, quantified and paid? The thought of AI developers policing themselves in yet another area is not advisable and surely not best for all those involved; does the answer lie in a brand new regulative body to oversee the legal process? As generative AI becomes more and more part of our daily lives, this could be something to explore.
The only option
It seems that, no matter how often foundational AI developers try to kick the can down the road, paying to use copyright material will eventually be their only option. As is often the case, prosecutions will be the catalyst for change.
But we should already be looking beyond that, finding solutions for how creators will be successfully attributed. This will no doubt define the year to come and beyond. While the pace of progress is exciting, foundational AI developers must be shown no special treatment in their work and their empty threats dismissed at every turn.
Bayezian utilisies data science to offer wide-ranging solutions in health, financial services, environment, social and government. Glyn has over 20 years of experience in the tech sector and has landed accolades such as Fast Track 100, Tech Track 100, Inner City 100 and Hot 100 listings in that time. Currently, he uses his experience to lead Bayezian and develop talented young tech innovators who are entering the industry, as well as strategising with ambitious startups.