The AI Content Wars
The NY Times lawsuit against Open AI and Microsoft is forcing a reckoning. Now marketers need to consider implications.
The New York Times launched a lawsuit against Open AI and lead investor Microsoft, alleging intellectual property violations related to its journalistic content appearing in ChatGPT training data. Regardless of the outcome, the lawsuit is forcing another layer of concern from companies about deploying a generative AI model.
At the heart of the concern is indemnification and liability. Does the AI model provider protect its users from potential legal action? Or is the interested brand liable for any answers the AI might produce, regardless of domain-specific data the brand might use to train the AI?
The forced reckoning goes deeper than just whether or not a brand may be liable for any branded responses a chatbot or copilot tool might produce. Pundits and media are speculating about the long-term impacts of the lawsuit. The most dire of these analyses is a Napster-like collapse of Open AI and the GPT family of technologies or, worse, the entire generative AI sector.
Generative AI solution providers are also engaged in a race to mitigate the potential legal impacts of their models, protecting their customers and their long-term viability. Brands like Adobe and Getty Images proactively promote generative AI solutions that ethically source training data and indemnify customers.
The ongoing legal and marketing spin surrounding the gegenerative AI-content war will only increase as 2024 unfolds. Marketing leaders need to navigate the hype, fear, and concerns to determine real impacts — if any — on the AI tools they are using and are considering deploying, both internally and externally.
Potential Impacts on Brands
Brands need to determine the amount of risk they are willing to incur.
Weighing the impacts of a generative AI implementation on an enterprise can be done via several questions. The first question brands must consider is whether or not to use AI tools. Not using AI in any form due to legal concerns would be foolhardy. Too many companies — including competitors — are using generative AI in some form to strengthen their marketing functions, even if just on a tactical level.
Instead, determining factors should focus on the ethical and responsible acquisition of generative AI solutions. Here are six additional questions to consider during your product evaluation.
Does the generative AI solution meet our needs in a qualitative manner?
One of the problems with the hype cycle and resulting debate is the distraction it creates. Rather than staying on mission, unvetted concerns can create a failure to focus on actual enterprise marketing needs. Vetting solutions begins with finding the qualitative model that best meets your actual marketing needs.
For example, you may need to vet generative writing tools to support your sales team. You want to use Jasper’s GPT-based solution but feel it offers too much legal risk to consider it.
Pause and think it through. Instead of reacting to the hype, don’t assume you need to use Anthropic’s Claude, Writer’s Palmyra, or another writer instead of a GPT-based solution due to legal concerns.
Instead, vet the solutions based on your actual needs. Research the LLM algorithms’ technical strength, effectiveness, and ability to achieve your objectives. Your normal vetting process should identify potential solutions, alternatives, and their risks. Risks become factors that are either mitigated, accepted, or affect your determination to buy or not.
Is your data protected from training?
Every brand must consider whether AI tools train on data and prompts. As Samsung discovered, most public versions of these tools ingest prompts as training content. However, licensed versions often commit to protecting brands. Are the AI providers trustworthy enough to protect enterprise data? This is a core question brands need to ask and resolve before licensing pro or enterprise-level solutions.
In many cases, while guarantees are offered, cybersecurity risk might demand that you create a private instance to protect your data from getting ingested by a foundational AI model provider. Consider regulations in your sector and the potential risks of a customer data leak to make these determinations.
Then there is the business concern. Some organizations may want to protect their more public content from getting crawled to protect value. They may consider erecting paywalls and placing no crawling code on their websites that bars LLMs from training, adding a layer of protection.
Depending on what your organization wants to achieve with its content, making data available to AI engines might make sense. In essence, some organizations and companies may want to train LLMs on their message and value proposition. Public affairs, marketing, and political organizations may all see value in ChatGPT mirroring their messaging.
However, if you are still concerned, don’t let the risks stop you. Even the world’s largest banks use generative AI, albeit within locked-down private instances. Keep in mind that one of the greatest risks of a data leak is your employees. Governance and policy — a separate matter — need to be considered.
Is the algorithm trained on ethically sourced data?
What separates good or well-trained algorithms from bad ones?
This factor is at the heart of most legal implications. Before the ChatGPT boom, it was common practice for algorithm providers to train their technologies on public web data. These webcrawls include copyrighted text, imagery, and video.
To be clear, Open AI’s solutions are not the only foundational AI models that have training data issues. Github Copilot, Stable Diffusion, Midjourney, Dall.E, all have legal concerns. Others, like Anthropic, used a more fine-combed approach to training on public data but still may face copyright infringement infractions if their algorithms accidentally quote stories.
Part of the content war includes a necessary change in training foundational models. Adobe and Getty both trained their image generators on the photography and other content licensed to them by their contributor networks. While their image generators are less powerful than Midjourney or Stable Diffusion, they win enterprise customers by bundling ethical solutions into existing subscriptions.
Legacy vendors are also adapting to the legal implications of training on copyrighted data and the new resulting competition. Open AI is actively changing how it sources data, providing publishers with shortcode methods to block its web crawlers from using their content to train algorithms. The company is negotiating with some large content providers to license and legally access their data.
Another consideration is that just because an algorithm was trained on public web data doesn't mean your brand is liable for damages if it uses the algorithm. We will discuss indemnification in question 4.
Does the provider indemnify its customers from errors or sourcing copyrighted information as an answer?
Adobe, Getty, and others may be winning the AI PR war with their claims of ethically trained data, but indemnification against damages is becoming a commonplace term of service for AI tools. For example, Open AI offers a very clear indemnification guard for concerned brands:
“We agree to defend and indemnify you for any damages finally awarded by a court of competent jurisdiction and any settlement amounts payable to a third party arising out of a third party claim alleging that the Services (including training data we use to train a model that powers the Services) infringe any third party intellectual property right.”
More sophisticated procurement processes already require insurance and indemnification. If your enterprise doesn’t have a standard procurement process to vet legal implications, double down on vetting AI contracts. Bring your legal team in and make sure your brand is protected from AI-vendor training problems.
What happens if the vendor is found guilty?
Another risk facing a brand is the possibility of its foundational model being pulled from the market due to legal action. Sudden breaks in service are never good. While rare, such instances do occur, including the highly visible Apple Watch 9 getting pulled from the market due to an IP matter.
One request the New York Times made in its legal case against Open AI was the removal of all algorithms trained on its data. What steps and protections are in place to replace a GPT algorithm if the newspaper wins? Having a backup plan or even replacing GPT algorithms if the case turns the wrong way may be necessary.
Can the solution provider, your team, or a third-party solution provider mitigate content risks by limiting responses based on source data?
As already noted above in the data training section, it is possible that your solution provider, internal data science teams, or another vendor can address brand concerns with third-party instances or a more complex model. For example, to ensure the sourcing of brand-specific data and prevent hallucinations, a vendor can build or add a retrieval augmented generation (RAG) solution in a private instance.
Conclusion
Open AI’s legal woes over ChatGPTs training data are real and could impact the company’s long-term viability. However, while the risks discussed by influencers and media are real, they are not showstoppers. Brands that understand needs are willing to engage in strong procurement processes to vet potential solutions for ethical challenges and risks.
All images were developed on Midjourney.