The web is very big. We now know for fact that there are at least 138 bln unique urls out there, and this number comes from just 24 bln unique urls that we crawled (some of them more than once). Big search engines like Google (G) and Yahoo (Y) stopped telling us how many pages they crawled in their index, and while we can estimate that they probably have 30-35 bln crawled pages how can we be sure that we crawl the same urls they crawled? We can look at backlink counts they report but this number does not tell us whether we have actually crawled the same pages as they did. The purpose of this research was to help us know if we are getting closer to the index size and quality that is used by those top search engines. But how do we do it if they obviously won’t allow us direct comparison of data they have, yet alone publish the results? Read on to find out how we solved this tricky problem to help ourselfes guide towards a better quality index that is getting closer to what Google and Yahoo have.
Our approach was simple, yet effective: we took a set of 20 different urls – big and small sites were included and then we obtained list of backlinks shown for those by Google and Yahoo, which we then compared with all backlinks in our index for those urls to check how many of them we matched. We assume here that (as claimed by those search engines) the backlinks they show is a fairly random sample from the complete set that they have but won’t show. So, the higher percentage of those backlinks that are also present in our database we get, the more likely our database is close to what those search engines have.
There are two sources of data: the main is our own web crawl that we have been doing since late 2004, and for quality verification of that data we use backlinks reported by Google and Yahoo.
The secondary measure that we use is the actual number of backlinks that we have and they do, however in case of Google it is not applicable as they are, well, just giving much lower number. This makes it harder to compare whether our index size is close to what they have, however this does not affect our methodology.
Unfortunately this is not the only issue with the links reported by those search engines. We have actually taken all those reported backlinks and run a totally separate crawl and index creation from that data just to see whether we would actually match (as one would expect) 100%, because in theory we would have crawled exactly the pages that those search engines report as having backlinks to our target URLs.
You can probably imagine our suprise that not only we did not get 100% matching ratio, but in some cases we were considerably off. How could that be? There are a few reasons why backlink reported by a search engine (and this applies to us too) might not actually have a link to target site, or even be accessible at all. Some sites go down, some sites are updated quickly and the link that was there yesterday won’t be there tomorrow, some links are marked as “nofollow” and while we were faithfully (but naively) observing it, the other search engines actually do include those backlinks in their results (but they might value them less in ranking). All this was known for some time, however one of a lesser known things is that specifically Google has got a rather unexpected behavior that treats (at least for link: command purposes) backlinks of the page that redirected to some other page that in turn points to yet another page as backlinks of that yet another page.
One very good example of this “redirect backlinks” behavior can be seen on one of the URLs that we used – every half-serious search engine builder must have this document many times over: The Anatomy of a Large-Scale Hypertextual Web Search Engine. It turns out Google reports a lot of backlinks that actually do not contain a link to that page, however they do link to a URL that was used previously to host this page: http://www-db.stanford.edu/~backrub/google.html. If you click it you will see that you get redirected. We now do take this into account for that test URL, which is why (as you will see below) match ratio for that particular page has increased considerably. Yahoo does not appear to be as heavily affected by this “feature” as Google, but then again we can reasonably expect Google to be a lot more advanced when it comes to web links. At the moment our quality check only takes into account such known redirects but not everything automatically, even though we applied that logic to our “Practical best” comparison to understand what’s the best matching we can reasonably expect.
Backlinks matching results
In the table below you can see how many of backlinks reported by Google and Yahoo for 20 test urls were matched in different builds of our growing index. Number of billions in column titles refers to total number of unique urls in each index. The matching was either done for all backlinks (internal and external), or just external backlinks (in recent indices as external backlinks are usually more important than internal), and also (in the most recent index) for short (?) and long (?) domains (with subdomains). Match ratios of over 65% for urls and over 85% for domains are marked with blue colour. Practical best column refers to totally separate matching done on an index that consisted of actual backlinks that we use for matching and in theory the match ratio there should be 100%, however in practice is it not the case (due to reasons described above), so our match ratio should really be compared with that practical maximum that we can achieve.
Intuitively it is expected to see better match ratios as index size is growing. However the data above shows much bigger increase in match ratios in the most recent index created on 25/05/2008. Even more intersting to see that domain match ratios are very high and 90%+ is not rare at all – this means that even though current index does not match backlinks shown by Google and Yahoo for those urls, our index does have backlinks from domains from which Google and Yahoo claims to have found backlinks from, we just don’t match exactly the URL.
Substantial improvements in match ratio in the current index were due to improvements in crawling and analysis that we have implemented in October 2007. Since then we have made further improvements and expect that match ratios will increase further after next index update.
One interesting observation is that we seem to consistently match less backlinks shown by Google than by Yahoo. There is a reason for it related to different quality of backlinks shown by Google when compared to Yahoo, we will cover this in the next research article.
Another observation is that in some cases (http://www.youtube.com) we actually matched more than our best matching suggested – this, we believe, is a side effect of improved indexing that was introduced in January 2008 as well as bigger crawling (200 kb max per page rather than 100 kb), yet the best match case was done in October 2007, we are going to rerun it using current indexing to see if it also improves (as we could reasonably expect).
Our methodology allows us to estimate how close our index is in terms of quality to those used by other search engines.
- Our methodology allows us to estimate how close our index is in terms of quality to those used by other search engines.
- Current index shows very substantial reduction in a gap between our index and market leaders such as Google and Yahoo
- We have achieved very high match ratio for domains, which means that even though if we might not yet have exact backlink as shown by search engines we used for testing, then we’d still have other backlinks from that domain.