This post has come from several imperatives. Firstly I was talking about crawler issues with Vanessa Fox and David Burgess a couple of weeks ago at SMX Advanced in New York and wanted to share the presentation and also because our users are saying that Google is not showing the effects of link removals in the light of Penguin fast enough for them. This is also affecting other link intelligence data sets and frankly I think Majestic is doing better than most – but nowhere near perfect yet.
MajesticSEO cannot speak for Google. We do not “scrape” Google or try in any way to replicate their index. Our data is our own and there are remarkable differences. However – there are some interesting challenges for search engine spiders when it comes to scale. Even though you can replicate a spider over many, many machines, ultimately – however big your ability to crawl the web – you have to make choices on how to best marshall your crawl resources. If you are a “Johnny come lately” crawler that needs to build up its database as quickly as possible, then you would naturally focus on crawling new web pages or ones on websites where you knew there were lots of outbound links (rather then ones you knew were quality for example). You have limited resources – so you have to make choices. The problem here is that the more you push your efforts toward “discovery”, the less you push your resources towards “verification”.
A search engine crawler’s primary goal is not to weed out the bad links. It is to find content so that another part of the algorithm can weed out the good from the bad and return the GOOD to users. If – at the crawl level – a search engine can discount large amounts of the bad by not even recrawling it, because it is perhaps not of the quality a search engine merits as something worth revisiting very often, then this can dramatically improve the efficiency of the crawler resources in focussing on the stuff that matters.
If you are being penalised for the bad stuff, this really isn’t very helpful. Majestic solves its own dilemma of showing quality and current relevance by distinguishing between fresh data (Links from URLs we have seen within a two month timespan) and everything else going back over five years. After 60 days, if a link isn’t worth seeing, we won’t have recrawled it and it will have dropped out of the Fresh Index but will still be in the Historic Index. But Fresh helps to see the GOOD stuff, not the bad stuff. One way Google tries to solve it is by giving you “fetch as Googlebot” in WebmasterTools, so you can effectively tell Google when a site has changed. Another way is by asking you to use caching commands… see the presentation below for more on this.
If you have removed links and you need them to get updated, maybe there is a need to add a step in your cleanup proces – to request the site owner to not only remove offending links, but also to then ask them to “fetch the page as Googlebot” to help Google update at its end faster. The only suggestion I can make is to encourage the site owner that this is quick and painless for them and will only suffice to show Google that the site is improving its focus on quality. I cannot speculate on whether the site owners will agree with you, though.
So back to the question of allocating Resources for large spiders. Majestic does this through its Crawler Controller and the thing is, our Crawler Controller will never really run out of things to crawl – so we need to maintain rules of engagement for the controller that keeps a sense of balance. We need to look at new content, but also respect that old content may change and keep one eye on the most important pages and maybe revisit these more often than others. Webmasters can help tremendously in helping any crawler be more efficient – not least by avoiding duplicate content which forces spiders to recrawl exactly the same thing twice or more than twice – often on what is apprently to a human exactly the same URL – but a URL to a computer is like a phone number. Put +1 in front of a phone number and many humans will know that this is not required for people dialling within the US. For a computer, this is a different phone number unless the programmer has gone out of his way to merge different variations of the same number onto one record.
So crawler issues affect search engines dramatically, which is probably why the session at SMX Advanced was so popular.
I have previously posted that presentation on the blog – but just for completeness, here it is again. I hope there are some takeaways for you on how to make your site more spider friendly. In the process you will be doing the world a favour and saving little spider legs from becoming over-tired.
I am sorry I do not have a transcript of what was said at the session.