• Что бы вступить в ряды "Принятый кодер" Вам нужно:
    Написать 10 полезных сообщений или тем и Получить 10 симпатий.
    Для того кто не хочет терять время,может пожертвовать средства для поддержки сервеса, и вступить в ряды VIP на месяц, дополнительная информация в лс.

  • Пользаватели которые будут спамить, уходят в бан без предупреждения. Спам сообщения определяется администрацией и модератором.

  • Гость, Что бы Вы хотели увидеть на нашем Форуме? Изложить свои идеи и пожелания по улучшению форума Вы можете поделиться с нами здесь. ----> Перейдите сюда
  • Все пользователи не прошедшие проверку электронной почты будут заблокированы. Все вопросы с разблокировкой обращайтесь по адресу электронной почте : info@guardianelinks.com . Не пришло сообщение о проверке или о сбросе также сообщите нам.

Why Do AI Crawlers Keep Hitting robots.txt Instead of My Content?

Sascha Оффлайн

Sascha

Заместитель Администратора
Команда форума
Администратор
Регистрация
9 Май 2015
Сообщения
1,483
Баллы
155
Over the past weeks I’ve been monitoring traffic from AI crawlers like OpenAI’s GPTBot, oai-searchbot and ClaudeBot. The data (see screenshots below) raises some interesting questions:


  • Why does GPTBot visit robots.txt so many times, sometimes multiple times per day?


  • Why does GPTBot prefer robots.txt over sitemap.xml?


  • Why do I see AI bot traffic but no crawling of fresh content? Just repeated hits to old resources.


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



(Screenshot 1: Vercel Observability Query Builder: Bot traffic)
1. robots.txt Obsession



Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



(Screenshot 2: OpenAI GPTBot robots.txt traffic)
The charts clearly show GPTBot hammering robots.txt across multiple IPs, sometimes 7 times in 2 days from the same subnet. Unlike Googlebot, which fetches robots.txt a few times per day and caches the rules, GPTBot seems to re-check every time it rotates IPs or restarts.


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



(Screenshot 3: OpenAI AI crawlers traffic pattern)
That means there’s no centralised “consent” store for the crawler. Every new instance behaves like a fresh bot, wasting its crawl budget on permission checks.

2. sitemap.xml Inconsistencies


I’ve tracked two different projects, and the behaviour is inconsistent. On one site, GPTBot fetched the sitemap exactly once in a month. On another, it skipped the sitemap entirely but went straight for content. Meanwhile, Anthropic’s ClaudeBot actually hit the sitemap multiple times.

The missing piece here is a smart algorithm that keeps score over time for each website. Google solved this years ago: it doesn’t blindly trust every lastmod tag, but instead builds a trust score for each domain based on history, accuracy, and freshness signals. That’s how it decides whether to treat a sitemap update seriously or to ignore it.

AI crawlers aren’t doing this yet. They either underuse sitemaps or waste fetches on them without consistency. To improve, AI labs need to adopt a similar scoring system. Or, as I strongly suspect from patterns I’ve seen, they may simply partner with Google Search and tap into its index instead of reinventing crawling from scratch.

Side note: I’ve even seen OpenAI API results that looked suspiciously close to Google Search outputs ...

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



(Screenshot 4: AI crawlers traffic pattern - Vercel Observability Query builder)
3. Crawling Old Content Repeatedly (and the Budget Problem)


This is where the inefficiency really shows. Bots keep returning to old content instead of discovering what’s new. Even when they’ve seen the sitemap, they often ignore it and waste their crawl budget revisiting stale pages.

There should be a smarter way to surface new material—and honestly, respecting lastmod in sitemap.xml would solve a lot of this. I really hope someone on the search teams at OpenAI and Anthropic is reading this.

From what I see:


  • Crawling budgets are tiny. Sometimes a bot “spends” its limited fetches just on robots.txt and pages it has already crawled.


  • No centralised rule cache. Each IP acts independently, re-checking permissions and burning requests on duplicate work.


  • Unstable sessions. The pattern of repeated restarts suggests crawler instances spin up and down often, leading to wasted quota.

And that’s why your fresh blog post doesn’t get fetched, while your robots.txt enjoys multiple visits per day.

4. The Static Assets Surprise (a.k.a. Bots Running Headless Browsers)


Now here’s the real surprise: OpenAI’s crawler sometimes downloads static assets. Next.js chunks, CSS, polyfills. That almost certainly means it’s firing up a headless browser and actually rendering the page. Rendering at scale is expensive, so seeing this in the logs is like catching the bot red-handed burning VC money on your webpack bundles.

Developers, let’s be honest: we shouldn’t force AI labs to reinvent Google Search’s rendering farm from scratch. The sane thing is still to serve content via SSR/ISR so crawlers don’t have to play Chromium roulette just to see your page. Otherwise you’re basically making Sam Altman pay more to crawl your vibe-coded portfolio site.

The funny bit? This is great news for vibe coders. All those sites built with pure CSR - the “AI slop” nobody thought would ever be indexable, might now actually get pulled into GPTBot’s memory. Your prayers have been heard... at least until the crawl budget runs out.

Fun fact: some vibe coding tools default to CSR, which is basically SEO assisted suicide. If you care about visibility, whether in Google or in AI engines, please stop.


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



(Screenshot 5: GPTBot download static assets (however hallucinated 404s)
5. What This Means for AI SEO


The good news:
OpenAI and Anthropic at least play by the rules. They ask permission before scraping, unlike the swarm of shady scrapers hitting your site daily.

The bad news:

  • Crawl budgets are tiny and often wasted.
  • Fresh content gets ignored.
  • Sitemaps and lastmod aren’t respected.
  • JS rendering happens only occasionally, so CSR-only sites are still at risk of being invisible.
Closing Thought


Google has a 25-year head start in crawling, indexing, and ranking. AI crawlers are still in year one of that journey. They’re not true search engines yet, but the scaffolding is going up fast.

If anyone from OpenAI, Anthropic, or xAI is reading this: please, implement smarter crawl budgets and start respecting sitemap freshness. Otherwise, all we’ll get is bots lovingly revisiting robots.txt while the real content sits untouched.
Godspeed


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.





Источник:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

 
Вверх Снизу