What about a generated AI accessing my site without my permission, compiling information on my site, using my photos and collecting them unilaterally? I thought it would be a good idea, so I decided to deny crawl access to my own site.
Deny in robots.txt
The robots.txt file defines whether crawlers such as search engines are allowed to access your site. The file defines which sites are allowed to be accessed by search engines and other crawlers.
Major AIs are also crawled by bots, just like Google.
To prevent this, you can use robots.txt to describe which crawlers you do not want to be able to access your site.
Install the Virtual Robots.txt WordPress plugin.
The Virtual Robots.txt plugin is a plugin for managing robots.txt.
It is simple and easy to use, with the only function being to change robots.txt.
Change robots.txt
If you want to deny access to the site altogether, put the following in robots.txt
User-agent: [crawler's name]
Disallow: /
User-agent: This is used to target specific web crawlers.
Disallow: This directive specifies which directories and pages are not allowed to be accessed by the specified User-agent. / refers to the root directory of the website (all pages).
Deny ChatGPT clawers.
OpenAI, the developer of ChatGPT, has published a technical document on crawlers.
https://platform.openai.com/docs/bots
Claude’s UAs are ‘GPTBot’ and ‘ChatGPT-User’.
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
Deny Gemini crawlers.
Google-Extended is the user agent specified when controlling the use of web pages as training data by Google’s generated AI.
User-agent: Google-Extended
Disallow: /
Deny Claude crawlers.
Claude is a large-scale AI model developed by Anthropic that performs well on natural language processing tasks. It features a safety and ethical design.
Claude’s UA is ClaudeBot.
User-agent: ClaudeBot
Disallow: /
Deny Common Crawl crawlers.
Common Crawl is a non-profit organisation that crawls the web and provides an archive of its data.
The project regularly crawls vast amounts of web pages on the internet and makes the data available free of charge. The data collected includes HTML, text and link information, which is used for analysis and machine learning by researchers and developers.
Common Crawl’s UA is ‘CCBot’.
User-agent: CCBot
Disallow: /
Summary of robots.txt descriptions
In summary, they are as follows.
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
There is a wide variety of other generated AI and the internet is crawled by various information-gathering tools. It would be advisable to consider denying unwanted access if necessary.