SEO User-Agents disallowed in robots.txt. Reflections on Ahrefs recent study. -

We don’t normally expect to wake up and find Majestic brand names slapped all over the Ahrefs blog, but that’s exactly what happened yesterday ( Thursday 22^nd May 2025 ). A new blog post by Patrick Stox and Xibeijia Guan popped up on our radar, titled “The SEO Bots That ~140 Million Websites Block the Most”.

Ahrefs were generous enough to give MJ12bot a few mentions. Joining MJ12Bot were a range of SEO crawlers, most notably Ahrefsbot and Semrushbot. These three were highlighted as SEO crawlers with significant presence across Ahrefs analysis of millions of robots.txt files.

At Majestic, we’ve had a fair few discussions on robots.txt analysis of late as we’ve worked to launch a new project. OpenRobotsTxt.org is intended to be a living archive of robots.txt files. The archive is updated often, with automated analysis on the data resulting in frequent analysis. There is a similarity between the sort of analysis OpenRobotsTxt performs and the analysis performed and reported on by Ahrefs in their recent post.

We aren’t sure if the launch of OpenRobotsTxt served as a catalyst for Ahrefs to release their own study, or the two studies were released simultaneously by coincidence.

Before we begin, we recognise that backlink indexes may not have the best record when it comes to comparative analysis. Many in our industry will be aware of a rich history of backlink database vendor studies. An approximation of this process is that a vendor commissions a study. The study then finds that the vendor sponsoring the study has the best product. Then other vendors argue about why the study is unfair and biased. Customers get bored and move on. Some time passes. The previous study becomes a distant, faint memory and some vendor decides now is a good time to commission another study. The process repeats. And repeats. And repeats. You get the idea.

We hope the following comes across as a constructive approach to feedback, but please do let us know if you feel we’ve missed the mark. We hope we can use this opportunity to highlight the some of the underlying assumptions made in the analysis of data for the OpenRobotsTxt project while passing what we hope is fair comment on Ahrefs latest analysis.

A quick detour into OpenRobotsTxt – and why we feel qualified to comment.

On Thursday, May 15^th 2025, Majestic announced the launch of the OpenRobotsTxt.org project. OpenRobotsTxt aims to archive and analyse the worlds robots.txt files. The project was seeded with a HUGE data dump from Majestic, and now sees regular updates.

OpenRobotsTxt.org is a long term project. New data is added on an ongoing basis. This data automatically analysed to produce reports. The robots.txt statistics produced are shared with the community under a Creative Commons licence.

The project has a mission to archive and analyse the worlds robots.txt files. OpenRobotsTxt aims to inform debate around robots.txt, user-agents and web crawling.

There is some overlap between the OpenRobotsTxt project and Ahrefs robots study.

A comparison of findings between the Ahrefs robots study and OpenRobotsTxt

The Ahrefs study key findings are presented by way of three data points, shown in the table below:

Crawler:	MJ12Bot ( Majestic )	SemrushBot	Ahrefs Bot
% of websites in the study which block based on study of 140 million “root domains”	6.49%	6.34%	6.31%

This gives an average of ( 6.49% + 6.34% + 6.31% ) / 3 = 6.38% across the three user agents. Ahrefs and Semrush are below the average, MJ12bot is above it.

Ahrefs state their study is based on data gathered by their own crawler, the Ahrefsbot. We therefore assume that the study will exclude sites that perform a server side bot block on the Ahrefs bot. This aspect may add a margin of error to the data shown above. Given the relatively small differences between the dissallow counts, and the single vendor nature of this study, we suggest it seems fair to interpret the Ahrefs study as giving a general indication of an approximate 6.4% block rate on leading SEO bots in the robots.txt files they have analysed. This is a useful benchmark for crawler operators everywhere. We are grateful to Ahrefs for sharing this finding.

The OpenRobotsTxt project reports on data on user-agent blocks slightly differently to the Ahrefs study, but we can add up different columns relating to disallow to produce a total. The result is a percentage figure showing the number of times a bot is mentioned in a disallow context as a proportion of sites in the study:

Crawler:	MJ12Bot ( Majestic )	SemrushBot	Ahrefs Bot
% of websites which disallow something based on user-agent study of ~600 million hostnames	0.5%	0.46%	0.93%

It should be noted that the high figure for Ahrefs in this table includes a disproportionately large number of path based disallow stats. Further analysis would be needed to measure the impact of these directives.

There is a significant difference between the two sets of data. There is an order of magnitude difference between the Ahrefs dataset and the Majestic dataset.

Why?

It would be unfair for us to make too many assumptions on the Ahrefs study. We can however share more information about the make-up of the 600 million hostnames used in the OpenRobotsTxt project.

The most important principle is that the OpenRobotsTxt hostnames statistic is based on resolvable hostnames. A robots.txt file is not mandatory for inclusion. This number therefore includes hostnames that do not have a robots.txt file. This means that the above statistic suggests that 0.5% of websites explitly disallow at least some part to MJ12Bot in robots.txt. Not all hostnames have robots.txt, and a 404 on robots is typically interpreted by most crawlers as permission to crawl.

Another important consideration with the OpenRobotsTxt dataset is that it aims to be protocol agnostic. This is based on the theory that most HTTPS websites serve the same content as the HTTP equivalent. The alternative to being protocol agnostic in the analysis risks double counting robots.txt files ( and hence blocks ) where they appear on both http and https versions of a website. We do not beleive it is reasonable to think of 600 million hostnames as representing 1.2 billion possible robots.txt files.

As with any study, it’s worth noting that the two studies are based on different datasets generated from different web crawlers. Different web crawlers will experience websites in different ways, they may interpret root domain names differently, and will also have their own, differing noise reduction techniques which may change how many subdomains are sampled to produce hostname lists.

Our view on methodology and reporting in the Ahrefs study

The post shares some interesting aspects on the methodology behind the study:

The dataset appears to only include sites which host Robots.Txt files. We believe inclusion is limited to those files found by the Ahrefsbot SEO crawler.
The Ahrefs study ignores “other block types such as firewalls or IP blocks”. This may have a significant impact on the conclusions. An important feature of this impact is that if Ahrefsbot is IP blocked on a server, this means that the crawler may be prevented from accessing the robots.txt file. This failure to access may result in these sites being omitted from their study. That is to say, sites that only block Ahrefsbot via server may be excluded from the study and hence not show a positive “allow” state for other crawlers.
The blog post seems to report on three different sample sets. A 140 Million root domain level test, a 461M hostname check and a top sites ( DR > 45 ) sample set.
The hostname level analysis comprises of 461M robots.txt files, and finds Semrush appears to be the most blocked in this dataset.
The Top Sites report the finds is that Semrush is, again the most blocked in this dataset.
MJ12bot appears to be reported as the most blocked bot for the 140 million data-points sample.

There appear to be at least three data sets analysed. In the study of two of the datasets, it appears that Semrushbot is identified as the most blocked bot. MJ12Bot is suggested to be the most blocked in one of the three studies.

Some observations or thoughts are presented on MJ12bot in the post:

“They’re a distributed crawler, meaning you can’t look up or block them by IPs, which makes them less trusted.”
“They’ve been crawling the web for longer.”
“They have a smaller user base than more popular SEO tools and therefore less leverage to remove any blocks.”

On the third point, we hold our hands up and admit to being the plucky underdog when measured against the SEO tool giants of Semrush and Ahrefs. We aren’t sure how size provides for leverage. However, please rest assured we but don’t intend to send the lads round to “educate” webmasters who may take issue with bots from time to time.

We also accept that we’ve been crawling the web for longer than Ahrefs and Semrush. Of the three, MJ12Bot was first, Ahrefs came some time later, with Semrush being a more recent entrant in the backlinks analysis field. Given that Semrushbot has operated for a shorter period of time, the level of disallows seemed noteworthy. We were interested to read that Ahrefs found that Semrushbot appears to be disallowed in a similar volume to Ahrefsbot and MJ12Bot. It will be interesting to see if and when the Semrushbot disallow count overtakes other SEO crawlers.

The remaining point refers to the distributed crawl model Majestic uses. It’s no secret that MJ12bot works on a distributed, community model of crawling. The preference for robots.txt level blocks over server-side blocks exhibited by the MJ12Bot user agent has been established for some time.

Mentions vs Disallows

Not every mention in a robots.txt file is problematic for an SEO crawler. Some can actually be great news.

The Ahrefs study focuses only on Disallows, whereas the OpenRobotsTxt project captures a range of signals from robots.txt files.

Some SEO crawlers, like Ahrefs and Majestic are seen in significant numbers of Allow directives. To some degree, the proportion of Mentions that are Explicit Disallows could be interpreted as a sentiment score of Webmaster for User agents. That is, a mention in robots that isn’t a disallow is proof that Webmasters know about the Agent, but don’t wish to block it.

OpenRobotsTxt produces this “Sentiment Score” figure for all User Agents.

For the “Big Three” we’ve discussed so far, OpenRobotsTxt reports the following sentiment:

Crawler:	Ahrefsbot	SemrushBot	MJ12Bot ( Majestic )
Sentiment (% of mentions are Disallow) LOWER IS BETTER	34%	69%	40%

This interpretation of OpenRobotsTxt data suggests that while Ahrefsbot may be the most mentioned SEO crawler of the three, they are also the most popular by this measure. Semrushbot seems to lag somewhat in this score. We suspect the reason MJ12Bot and Ahrefsbot both do well is because both tools offer enhanced service to domain name owners who verify the website.

This exercise highlights that being mentioned in robots.txt is not necessarily bad.

And, credit where credit is due. Congratulations are due to all of the Ahrefs team for their achievement. Their effort to win over webmasters has resulted in the best webmaster sentiment score across the three SEO Crawlers listed.

Understanding the OpenRobotsTxt User-Agent analysis data

To produce the analysis shared on OpenRobotsTxt, archived robots.txt files are examined. As a part of this process, User Agents are normalised, in an attempt to reduce noise and indicate robots.txt author intent.

Following parsing of the files, a number of statistics are produced. These stats are available to download from the openrobotstxt.org website and are shared under a creative commons licence.

The columns show are:

Mentions (when the User-Agent is mentioned in a robots.txt directive)
Mentions as a % of overall study
Disallow all
Disallow all as a percentage of mentions
Allow all
A mixture of allow directives following an explicit disallow
A range of disallow lines
A mixture of disallow and allow lines
Mentions where there is no impact on allow or disallow, such as crawl delay
Where rules conflict (fortunately tends to be relatively small)

With the exception of the percentage figures, the “mentions” column is a total of all of the columns to the right.

The resulting analysis of over 37,000 user agents ( at time of writing ) is available to download now. The data is in CSV format, so useful for Excel and Python. To get a feel for the data, a small number of summary tables are presented onsite.

To wrap up.

Data studies are a good thing. They help inform our industry. Huge kudos is due to Xibeijia Guan for an awesome piece of analysis.

That said, we do not fully accept the conclusions as presented in Ahrefs post. Our objections and concerns are not with the study but instead with the presentation of it. There has clearly been a significant amount of work put into the analysis which we respect and admire. We have tried been constructive in our response and use this as an opportunity to provide a background as to our perspective on the differences in approach between Ahrefs and OpenRobotsTxt.

We do welcome any thoughts you may have on OpenRobotsTxt, the Ahrefs study and our response.

There’s not much more to say here. If you are interested in robots.txt analysis, please head over to OpenRobotsTxt.org

Author
Recent Posts

Majestic