I had an interesting support issue this week which every SEO should know about.

Don't lock out the good guys.If you do not verify your website on Majestic’s Webmaster Tools, there is a very real chance that your hosting company is restricting the traffic to your website.

Some hosting companies try to keep their bandwidth costs down by not telling you that they are blocking some of your traffic – in particular traffic from bots. The problem is, they can only (easily) block the “good” bots – those that choose to identify themselves and therefore obey Robots.txt.

So these hosting companies are blocking the good bots, but letting in the bad bots, and you may not have any idea that they are doing it.

Most SEOs will verify a website on Google Webmaster Tools to get some insight into their SEO data. Clever SEOs will also verify with Bing’s WebmasterTools as they also have some neat features. So only the absolute cheapest web hosting company would try and block these bots… but you would be surprised at how many do. So I would suggest that in your Webmaster Tools on Google you occasionally press the “fetch as Googlebot” button to make sure that your site is not blocked by the hosting company directly.

Beyond those two bots, some hosting companies may choose to block other bots. The problem is that the way they block is not via the Robots.txt but rather by their firewall… so you as a website owner have no idea that the spiders are being blocked. This means:

  • You might not be in search engines Like Yandex or Baidu.
  • You may also not rank quite so highly in Majestic’s own search engine “Search Explorer
  • Google has other bots like http://www.google.com/feedfetcher.html which may also be blocked
  • Microsoft has other bots like its media bit: MSN Bot
  • WordPress sites will not get into news feeds.

Many – MANY data sets take signals from the data crawled by good bots (and bad bots) and ultimately, blocking bots needs to be a cautionary tale because there are countless ways on which this costs you real traffic. Any WordPress user, for example, has an RSS feed built into default installations and when a new post is created, WordPress uses a system called Pingomatic to tell these services by default. Now if you have a specific problem with one of those services, you can always switch the ping off or block the service in Robots.txt, but what if your hosting company is blocking these at source? A bit unfair don’t you think?

Can I check in my logs if my host is doing this?

Unlikely. The response hosting companies will be giving these bots is a 403 error, not a 404 or a 500. The block is generated before accessing your site. You need to check from outside the hosting company.

Verify your site in Majestic.

Do it now. If you know your Google Webmastertools login it takes seconds. If you don’t, you can still do it in a few minutes. It checks that we are genuinely able to crawl your site. If we can’t, then either get your hosting company to let it in AND tell you what else they are blocking so you can make your own decisions or change hosting company. There are not many hosting companies that do this – but one or two are quite large and generally don’t know or care that they are hurting businesses.

By using our verification service, you can just be confident that your hosting company is not playing games with you.

Why should I let Majestic’s bot spider my site?

It’s a fair question. Bear in mind that you are really not “hiding” your website much by blocking our bot, as we show you links INTO a website. We do not have to actually crawl your site to know about it. If you look at road map, it says nothing about the size of a town, but you can pretty much deduce this just by looking at the size and number of rads into town. It’s the same with the Link graph. So blocking our bot only stops us looking at the links OUT of your website to other sites.

So the benefit of blocking our bot is mute. The benefit of making sure it gets in… here are some:

  • Verification is free and easy as soon as you have done it you can use Site Explorer and generate advanced reports for your own sites. This data is invaluable for any SEO and some of it is impossible to find anywhere else on the internet.
  • When others are assessing the “value” of your site to them, they will look at the links into their site from you and come up empty. They may then pass you by and never come to engage with you or your users.
  • Don’t forget that we are now a search engine ourselves. Whilst we do not expect consumers to flock to search on MajesticSEO (we are not naive) we now have one of the largest crawlers on the planet and a search API in development. Without going into specifics, can you imagine how many global players might think that this is a pretty useful API to use in their company/website/business offering?
  • Having said that MajesticSEO.com is unlikely to be a consumer facing search engine, we are currently ranked well within the top 1,000 websites in the world according to Alexa. So don’t write off direct (and free) traffic just because your web host was not playing fair with you.
  • We show you 404 errors (and many other error types) on your site and links from third party sites to your domain that do not “complete”, showing you how you are missing even more traffic

Verify your site now for free at https://www.majesticseo.com/webmaster-tools and give some consideration to doing it for your clients as well.

I can’t verify my client site, Can I still check?

Yes – as long as you have a Majestic SEO subscription, checking is really easy. Log in, then type the site into the home page of Majestic SEO.com. Click on the “Pages” tab. If every page is recorded as “403 Forbidden” then your web host is almost certainly blocking legitimate bots like ours from he site.

What if I have to pay for my Bandwidth?

Most hosting companies do not charge by bandwidth – although the biggest businesses will get charged in this way. It is likely to cost more in the long run if you block legitimate bots. In short, blocking our bot will ultimately also block real visitors to your site as our data is used in many web applications around the world, but if you are worried about this, the correct solution would be to use the “Crawl Delay” protocol in Robots.txt to control the bandwidth.

Can I check what happens when you crawl my site?

Here are three third party tools that will let you do that:

  1. Screaming Frog (Paid version Only) is a desktop web crawler that lets you change user agent
  2. FREE: SEO Book lets you easily select our user Agent (MJ12Bot) with their Server Header Checker, here.
  3. FREE W3C has an experimental web tool here (Click options to add MJ12Bot as the user agent).

Each of these will tell you if you get a 403 or 406 response. If you do, please contact your hosting company or modify your site’s security settings to allow MJ12Bot in. In these cases, Robots.txt will NOT work!

How do I verify my site in your Webmaster Tools?

Log in with a FREE account and go here.

Dixon Jones
Latest posts by Dixon Jones (see all)

Comments

  • George

    That’s pretty weird thing about cheap hosting.What about about unlimited bandwidth do they also blocks good bots? One more doubt what about yandex and Baidu?

    January 21, 2014 at 7:05 am
    • Dixon Jones

      I certainly think that some western hosting companies block Baidu and Yandex and you would not even know unless you checked. Looking at our stats, they crawl much more successfully in their own countries, but i can’t really say whether that is their crawler decisions or the hosting company interference. To be honest… some are even blocking Bing I expect!

      January 21, 2014 at 12:33 pm
  • Mark

    Why I don’t see my backlinks friends, and thanks for the tools.

    January 24, 2014 at 5:23 am
  • SEOinUS

    This is pretty cool..

    January 28, 2014 at 11:19 am

Comments are closed.

THANK YOU!
If you have any questions in the meantime, please contact help@majestic.com
You have successfully registered for a Majestic Demo. A Customer Advisor will contact you shortly to schedule a suitable time to connect.