The Good, the Bad, and the Ugly of GPT Web Crawling

Content Providers Face Giving Away Their IP

Aug 17, 2023

Clint Eastwood in the final scene of The Good, The Bad, and the Ugly

Last week, OpenAI announced the GPTBot web crawler and how to block it from using your site to train future versions of the LLM and ChatGPT. Perhaps a gambit to avoid further legal repercussions like the Sara Silverman lawsuit, the move puts the impetus on content providers to opt out of giving away their intellectual property.

Basically, if your site is ungated, open on the web, and meets Open AI’s PII and unstated content policies, it will be crawled. This is not permission-based, so I expect there will be some blowback.

However, Open AI’s web crawler takes a similar approach as Google and other web crawlers and should stand long-term as an accepted practice. Unlike Google, ChatGPT and the GPT algorithms do not source links to sites. This creates a self-sabotage quandary for those vested in monetizing content.

To stop Open AI from training on your content and using it without credit, web publishers need to add small pieces of code to your website’s robots.txt file. This code either fully blocks or partially blocks the GPTBot, depending on how your organization structures its web content based on the value of permitting AI training.

So let’s get into these choices and analyze why some organizations may value crawling and why others may opt out.

The Good

One of my favorite elements of the good, the bad, and the ugly as an article format is you focus on the positives first. And there are a few with this news.

First, enterprises, NGOs, and individuals with significant advocacy agendas who want to influence the world are in luck. They should view GPTBot and other LLM web crawlers as an opportunity to train popular AIs from their point of view. After all, this is their raison d’etre.
The same can be said for organizations with larger data repositories that may be of public interest. This includes libraries, wikis, and even general forums with significant data about products and topics of interest. Again, if the purpose is sharing knowledge, then this is another way to achieve that mission.
Companies with product information to disseminate in as many ways as possible should have their site indexed. For example, Adobe should want product information and user forum posts about how to use its Photoshop products indexed by LLMs. This is the same as SEO with the exception of not getting sourced.

The premise is simple, if you want the public to know about your information, welcome GPTBot into your site.

The Bad

What is the value of your content? Is it worth nothing? How does it build value for your business or for you individually? And does having that knowledge propagated by AI as non-original thought serve your purposes? If it doesn’t, then act to protect your content.

As a result, we will see more and more content providers move to gated hosting solutions to protect their intellectual property. These measures come at a cost, mostly developer time. Content providers may also face churn of potential customers and readers who may be casually interested in a topic but need more to surrender their personal information to a mailing list or subscribe.

It might force larger web publishers to parse their content into two directories, crawlable and protected. Again, this requires labor and work, most notably the unenviable task of creating new governance and content/data management layers.

Individual content providers may be unable to afford or know how to gate their WordPress blogs. Some will want to move their content to a gated solution at a cost that may be minimal to a business but painful for an individual. For example, the ability to publish on Medium costs $50 a year. Plus, the time is needed to propagate the account with content and build a following.

On Substack, publishing is free with the same time and effort costs. Here, writers are encouraged to charge a subscription fee, of which the publisher keeps 10% of the fees. This isn’t horrible, but it involves an effort and a price. This may not be optimal for someone who simply wants to write content without getting scraped by an AI bot.

Another “bad” point, one that has been made over and over again. GPT is training on public web content, and the accuracy of its output is only as good as what it learns from. Hmmm, it certainly seems the hit-or-miss responses will continue.

The Ugly

The overarching premise of GPTBot is that if it is on the public web, it can be crawled. If your company makes no choice, you are effectively opting in, and your site will be used for training data.

Forcing people and organizations to volunteer their content for AI training without permission is a recipe for ill will. This approach will create some problems for Open AI’s brand down the road.

Public content does not mean that copyright laws are null and void. The whole what’s fair and right as far as AI usage and citation of publisher’s content still needs to be ironed out.

This is particularly true for more well-known individual personalities and smaller brands that may need more web sophistication to even know what a robot.txt file is. It may be challenging to hold Open AI accountable — several people and entities are trying to do just that — but others will try.

Perhaps more important are the implications of the good. Entities and people can feed the GPT algorithm with more than just lousy content; they can also provide it with propaganda. And given how many businesses and consumers are using GPT without considering accuracy, that can only create more digital literacy problems.

A Wild West on the Interwebs

Generative AI continues to unleash dramatic change on the Internet, forcing a rebalance of power. While OpenAI’s GPTBot may reduce its legal liability, it has opened another can of worms all over the metaphorical floor.

We know LLMs are only as good as their training data. It may be that Open AI has caused the removal of its most valuable training data resources, high-quality public content.

What do you think?

P.S. Those who opt to block GPTBot should also consider blocking Common Crawl, whose public data gathering has been used by several LLMs for training data. That includes GPT 3.5 and Stable Diffusion. Here’s how to do it via your robots.txt file.

This story is published on Generative AI. Connect with us on LinkedIn to get the latest AI stories and insights right in your feed. Let’s shape the future of AI together!

The Good, the Bad, and the Ugly of GPT Web Crawling

Content Providers Face Giving Away Their IP

Content Providers Face Giving Away Their IP

The Good

The Bad

The Ugly

A Wild West on the Interwebs

Discussion about this post