Today we are announcing a jump in the QUALITY of our index, whilst also removing URLs which are SO bad, Google dropped them years ago. Over the last few months, people will have seen the size of our index increase dramatically and now we have got to the point where we need to look at whether to crawl faster… or just smarter.
Kudos to SEOMoz for building this into their model some time ago. We don’t intend to decrease our URL count to anything like the Moz index (at the time of writing they are reporting 89 billion URLs, whilst we are reporting 498 billion in our Fresh Index alone!) but with our index at this size, we have been able to identify huge amounts of URLs on domains that are just designed (it seems) to mess up crawlers.
We do not intend to show any comprehensive list of URLs we are excluding from our index (what would be the point of an index of de-indexed URLs?), but suffice to say that any URL that has merit should be exempt from this cull.
What did you Cull?
Here is an example:
http://tbod.asia did (in our previous index) have 9,157,905 URLs that we knew of. However – their value (and the value of the site) is such that even Google has dropped the whole domain:
Unfortunately, the .asia TLD seems particularly affected by this mechanical spam and we only found these dropped pages (most likely) by crawling some of these internal pages themselves. Indeed – once the index has been culled, we we STILL index web pages on dropped domains that have even the remotest of merit.
Will this affect Penguin Investigations?
We do not believe so. The URLs we have discounted are not really different from session variables, in terms of value. Google are not penalizing these URLs per se. A cursory glance suggests they are dropping these URLs from their index (and most likely their crawl) as well. Believe it or not, there are a class of URLs that are even lower in value than those penalized by by Penguin. These URLs never get indexed… leave alone penalized… because Google wants to crawl smarter as well. Crawling the web is expensive to do – and being more efficient and smarter at it should be every crawler’s goal.
How many less links will I see to my site then?
Actually, for most people, none! To see a link TO your site, we need to have crawled the actual page. These are mostly URLs that we have seen exist, but there has never been any signal for us to get around to crawling these URLs.
Why Does This Change Make Majestic Better?
Quite simply – scale. Imagine what ELSE we can do for you if we free up the machines spent on crawling and indexing 150 billion URLs? That is about twice the size of SEOMoz’s entire index.
Taking these out will:
- Make everything else faster
- Allow us to look at collecting more data in the future
- Let us update Flow Metrics and our indexes more frequently
What if my site gets caught up in this change?
We have had one support request querying link count changes since the update, so if we get more then we can tweak this change. But we really don’t think it will affect most normal sites (including penguined sites) – but you are free to contact support and we will look at your site (but you’ll need to be specific). Nothing is irretrievable, but crawling smarter is a move towards a better tool set.