Deny generative AI crawlers access to your WordPress site.

What about a generated AI accessing my site without my permission, compiling information on my site, using my photos and collecting them unilaterally? I thought it would be a good idea, so I decided to deny crawl access to my own site.

Deny in robots.txt

The robots.txt file defines whether crawlers such as search engines are allowed to access your site. The file defines which sites are allowed to be accessed by search engines and other crawlers.

Major AIs are also crawled by bots, just like Google.
To prevent this, you can use robots.txt to describe which crawlers you do not want to be able to access your site.

Install the Virtual Robots.txt WordPress plugin.

The Virtual Robots.txt plugin is a plugin for managing robots.txt.
It is simple and easy to use, with the only function being to change robots.txt.

WordPress.org 日本語

Virtual Robots.txt Virtual Robots.txt automatically creates a robots.txt file for your site. Your robots.txt file can be easily edited from the plugin settings page.

Change robots.txt

If you want to deny access to the site altogether, put the following in robots.txt

User-agent: [crawler's name]
Disallow: /

User-agent: This is used to target specific web crawlers.

Disallow: This directive specifies which directories and pages are not allowed to be accessed by the specified User-agent. / refers to the root directory of the website (all pages).

Deny ChatGPT clawers.

OpenAI, the developer of ChatGPT, has published a technical document on crawlers.

https://platform.openai.com/docs/bots

Claude’s UAs are ‘GPTBot’ and ‘ChatGPT-User’.

User-agent: GPTBot
Disallow: /
‍User-agent: ChatGPT-User
Disallow: /

Deny Gemini crawlers.

Google-Extended is the user agent specified when controlling the use of web pages as training data by Google’s generated AI.

User-agent: Google-Extended
Disallow: /

Deny Claude crawlers.

Claude is a large-scale AI model developed by Anthropic that performs well on natural language processing tasks. It features a safety and ethical design.

Claude’s UA is ClaudeBot.

https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler

‍User-agent: ClaudeBot
Disallow: /

Deny Common Crawl crawlers.

Common Crawl is a non-profit organisation that crawls the web and provides an archive of its data.
The project regularly crawls vast amounts of web pages on the internet and makes the data available free of charge. The data collected includes HTML, text and link information, which is used for analysis and machine learning by researchers and developers.

Common Crawl’s UA is ‘CCBot’.

https://commoncrawl.org/faq

‍User-agent: CCBot
Disallow: /

Summary of robots.txt descriptions

In summary, they are as follows.

User-agent: GPTBot
Disallow: /
‍User-agent: ChatGPT-User
Disallow: /
User-agent: Google-Extended
Disallow: /
‍User-agent: ClaudeBot
Disallow: /
‍User-agent: CCBot
Disallow: /

There is a wide variety of other generated AI and the internet is crawled by various information-gathering tools. It would be advisable to consider denying unwanted access if necessary.

この記事を書いた人

kenichi エンジニア・写真家 | Engineer and photographer

Nomadic worker who travels all over Japan and abroad; worked as a technical sales person for five years before going independent.
Works as a freelance engineer on website production and application development. Currently working to bring interesting things by interesting people to the world, while seeking to move abroad.