(This article was originally written on 23rd April 2008 and last updated 18 Jul 2008 and is now moved here for convenience)

Foreword

The web is very big. We now know for fact that there are at least 138 bln unique urls out there, and this number comes from just 24 bln unique urls that we crawled (some of them more than once). Big search engines like Google (G) and Yahoo (Y) stopped telling us how many pages they crawled in their index, and while we can estimate that they probably have 30-35 bln crawled pages how can we be sure that we crawl the same urls they crawled? We can look at backlink counts they report but this number does not tell us whether we have actually crawled the same pages as they did. The purpose of this research was to help us know if we are getting closer to the index size and quality that is used by those top search engines. But how do we do it if they obviously won’t allow us direct comparison of data they have, yet alone publish the results? Read on to find out how we solved this tricky problem to help ourselfes guide towards a better quality index that is getting closer to what Google and Yahoo have.

Methodology

Our approach was simple, yet effective: we took a set of 20 different urls – big and small sites were included and then we obtained list of backlinks shown for those by Google and Yahoo, which we then compared with all backlinks in our index for those urls to check how many of them we matched. We assume here that (as claimed by those search engines) the backlinks they show is a fairly random sample from the complete set that they have but won’t show. So, the higher percentage of those backlinks that are also present in our database we get, the more likely our database is close to what those search engines have.

Source data

There are two sources of data: the main is our own web crawl that we have been doing since late 2004, and for quality verification of that data we use backlinks reported by Google and Yahoo.
The secondary measure that we use is the actual number of backlinks that we have and they do, however in case of Google it is not applicable as they are, well, just giving much lower number. This makes it harder to compare whether our index size is close to what they have, however this does not affect our methodology.
Unfortunately this is not the only issue with the links reported by those search engines. We have actually taken all those reported backlinks and run a totally separate crawl and index creation from that data just to see whether we would actually match (as one would expect) 100%, because in theory we would have crawled exactly the pages that those search engines report as having backlinks to our target URLs.
You can probably imagine our suprise that not only we did not get 100% matching ratio, but in some cases we were considerably off. How could that be? There are a few reasons why backlink reported by a search engine (and this applies to us too) might not actually have a link to target site, or even be accessible at all. Some sites go down, some sites are updated quickly and the link that was there yesterday won’t be there tomorrow, some links are marked as “nofollow” and while we were faithfully (but naively) observing it, the other search engines actually do include those backlinks in their results (but they might value them less in ranking). All this was known for some time, however one of a lesser known things is that specifically Google has got a rather unexpected behavior that treats (at least for link: command purposes) backlinks of the page that redirected to some other page that in turn points to yet another page as backlinks of that yet another page.
One very good example of this “redirect backlinks” behavior can be seen on one of the URLs that we used – every half-serious search engine builder must have this document many times over: The Anatomy of a Large-Scale Hypertextual Web Search Engine. It turns out Google reports a lot of backlinks that actually do not contain a link to that page, however they do link to a URL that was used previously to host this page: http://www-db.stanford.edu/~backrub/google.html. If you click it you will see that you get redirected. We now do take this into account for that test URL, which is why (as you will see below) match ratio for that particular page has increased considerably. Yahoo does not appear to be as heavily affected by this “feature” as Google, but then again we can reasonably expect Google to be a lot more advanced when it comes to web links. At the moment our quality check only takes into account such known redirects but not everything automatically, even though we applied that logic to our “Practical best” comparison to understand what’s the best matching we can reasonably expect.

Backlinks matching results

In the table below you can see how many of backlinks reported by Google and Yahoo for 20 test urls were matched in different builds of our growing index. Number of billions in column titles refers to total number of unique urls in each index. The matching was either done for all backlinks (internal and external), or just external backlinks (in recent indices as external backlinks are usually more important than internal), and also (in the most recent index) for short (?) and long (?) domains (with subdomains). Match ratios of over 65% for urls and over 85% for domains are marked with blue colour. Practical best column refers to totally separate matching done on an index that consisted of actual backlinks that we use for matching and in theory the match ratio there should be 100%, however in practice is it not the case (due to reasons described above), so our match ratio should really be compared with that practical maximum that we can achieve.

#

URL

Search
Engine

30 bln
(all)
01/09/07

54 bln
(all)
06/09/07

115 bln
(all)
23/09/07

138 bln
(all)
16/01/08

115 bln
(external)
23/09/07

138 bln
(external)
16/01/08

200 bln
(external)
25/05/08

Practical best
(external)
15/10/07

138 bln
(short domains)
16/01/08

200 bln
(short domains)
25/05/08

138 bln
(long domains)
16/01/08

200 bln
(long domains)
25/05/08

1

http://www.google.co…

Google

6.80%

9.70%

12.50%

47.60%

12.50%

47.50%

60.82%

80.90%

94.70%

96.00%

91.90%

93.10%

Yahoo

38.00%

40.10%

41.50%

81.70%

41.50%

81.40%

84.53%

85.60%

98.10%

98.50%

95.70%

96.00%

2

http://www.yahoo.com

Google

8.30%

11.10%

12.80%

29.60%

12.10%

29.20%

37.80%

93.40%

96.20%

98.10%

91.00%

93.70%

Yahoo

37.40%

43.10%

50.50%

71.40%

34.20%

65.10%

74.69%

84.20%

94.70%

95.30%

93.90%

94.50%

3

http://www.cnn.com

Google

10.80%

19.20%

25.00%

48.00%

25.00%

48.10%

60.11%

79.20%

91.50%

92.00%

82.90%

84.50%

Yahoo

35.10%

41.90%

44.80%

69.50%

44.70%

78.30%

82.64%

83.10%

96.80%

96.80%

94.60%

94.80%

4

http://news.bbc.co.u…

Google

2.80%

5.30%

6.10%

33.10%

39.00%

69.00%

73.00%

73.00%

92.50%

92.50%

88.90%

88.90%

Yahoo

22.00%

26.00%

29.30%

60.00%

46.30%

75.90%

79.35%

78.40%

92.00%

92.70%

90.50%

92.00%

5

http://www.majestic1…

Google

29.60%

31.50%

64.80%

69.40%

23.70%

28.90%

31.58%

57.90%

67.70%

74.20%

68.80%

75.00%

Yahoo

25.20%

26.40%

53.30%

58.10%

14.30%

20.80%

23.81%

33.80%

75.40%

82.00%

70.80%

78.50%

6

http://www.amazon.co…

Google

9.40%

15.90%

20.40%

30.20%

20.40%

30.40%

32.31%

36.40%

94.30%

96.00%

87.30%

88.80%

Yahoo

26.90%

35.10%

40.20%

55.80%

56.80%

78.60%

81.35%

78.90%

93.80%

95.00%

92.60%

93.90%

7

http://en.wikipedia….

Google

3.10%

6.90%

18.40%

35.50%

15.50%

34.50%

45.69%

60.90%

73.20%

82.10%

61.20%

69.60%

Yahoo

6.70%

12.10%

23.80%

40.40%

21.30%

40.90%

49.24%

64.80%

70.00%

77.80%

61.80%

69.20%

8

http://www.searcheng…

Google

11.00%

13.90%

17.10%

41.80%

14.10%

41.40%

54.05%

89.00%

65.30%

66.70%

52.80%

54.90%

Yahoo

15.50%

26.60%

32.50%

61.80%

34.30%

74.00%

79.11%

95.90%

95.60%

96.10%

91.70%

94.30%

9

http://www.microsoft…

Google

9.80%

19.40%

25.50%

41.80%

39.70%

65.60%

71.65%

77.30%

93.80%

96.00%

88.40%

90.00%

Yahoo

25.30%

29.30%

32.00%

63.70%

40.50%

70.30%

75.07%

73.90%

94.90%

96.10%

93.60%

94.80%

10

http://www.php.net

Google

8.90%

14.80%

20.90%

58.40%

21.10%

59.10%

64.77%

85.10%

93.90%

94.20%

88.80%

91.00%

Yahoo

18.00%

20.20%

21.80%

75.60%

27.90%

73.00%

75.03%

80.80%

91.20%

92.70%

86.90%

88.50%

11

http://www.google.co…

Google

5.20%

10.90%

22.40%

32.80%

22.90%

34.60%

45.21%

52.70%

84.30%

87.60%

78.90%

83.50%

Yahoo

9.80%

17.30%

24.00%

46.10%

25.50%

54.50%

65.10%

78.30%

82.50%

87.20%

73.40%

79.80%

12

http://www.nethack.o…

Google

12.90%

30.70%

40.90%

51.90%

39.00%

50.60%

53.39%

67.70%

84.00%

86.40%

74.00%

75.00%

Yahoo

14.00%

30.70%

45.20%

63.10%

44.10%

63.30%

66.20%

75.70%

86.90%

89.60%

78.70%

81.50%

13

http://www.maximumco…

Google

12.50%

16.70%

37.50%

65.00%

45.00%

65.00%

70.00%

85.00%

86.70%

100.00%

81.30%

87.50%

Yahoo

2.00%

3.20%

6.80%

25.60%

19.30%

25.60%

28.82%

67.40%

42.90%

46.90%

36.10%

39.70%

14

http://infolab.stanf…

Google

0.00%

2.40%

8.50%

57.50%

8.50%

57.50%

62.26%

73.10%

83.90%

87.20%

75.00%

79.20%

Yahoo

0.40%

5.60%

14.80%

26.90%

14.80%

26.90%

35.94%

48.90%

58.30%

70.90%

50.80%

62.30%

15

http://www.youtube.c…

Google

6.30%

9.80%

15.80%

36.30%

20.10%

42.50%

48.72%

50.20%

87.40%

89.00%

77.70%

81.40%

Yahoo

8.00%

9.90%

8.80%

78.20%

21.00%

73.10%

76.77%

64.90%

94.10%

95.90%

88.40%

91.80%

16

http://www.gazeta.ru

Google

1.50%

4.90%

9.40%

16.90%

11.40%

14.90%

19.90%

38.30%

79.40%

85.70%

68.00%

73.30%

Yahoo

5.00%

9.80%

18.10%

38.80%

18.30%

37.90%

39.68%

47.60%

81.00%

84.00%

66.70%

70.50%

17

http://football.guar…

Google

6.90%

11.80%

22.10%

24.60%

9.50%

11.30%

15.38%

33.90%

71.40%

71.40%

66.70%

72.20%

Yahoo

5.70%

9.50%

17.70%

28.60%

19.60%

38.90%

53.71%

85.40%

85.50%

88.40%

73.80%

88.80%

18

http://www.darkageof…

Google

15.70%

30.10%

47.70%

56.00%

47.70%

56.00%

61.25%

80.70%

84.80%

87.00%

78.80%

82.80%

Yahoo

17.20%

41.10%

67.80%

91.40%

71.10%

92.40%

94.95%

98.30%

95.50%

97.70%

90.60%

93.80%

19

http://www.kwikbreak…

Google

14.30%

14.30%

21.40%

21.40%

21.40%

21.40%

21.43%

35.70%

87.50%

87.50%

75.00%

75.00%

Yahoo

2.60%

5.30%

9.50%

14.00%

9.50%

14.20%

20.32%

51.30%

31.90%

37.60%

31.40%

37.30%

20

http://www.techcrunc…

Google

6.50%

13.60%

24.40%

51.10%

25.10%

47.70%

64.96%

74.90%

90.10%

91.90%

85.90%

87.80%

Yahoo

16.70%

23.90%

32.80%

77.40%

27.80%

60.70%

63.29%

79.20%

95.90%

97.10%

91.90%

94.00%

Intuitively it is expected to see better match ratios as index size is growing. However the data above shows much bigger increase in match ratios in the most recent index created on 25/05/2008. Even more intersting to see that domain match ratios are very high and 90%+ is not rare at all – this means that even though current index does not match backlinks shown by Google and Yahoo for those urls, our index does have backlinks from domains from which Google and Yahoo claims to have found backlinks from, we just don’t match exactly the URL.
Substantial improvements in match ratio in the current index were due to improvements in crawling and analysis that we have implemented in October 2007. Since then we have made further improvements and expect that match ratios will increase further after next index update.
One interesting observation is that we seem to consistently match less backlinks shown by Google than by Yahoo. There is a reason for it related to different quality of backlinks shown by Google when compared to Yahoo, we will cover this in the next research article.
Another observation is that in some cases (http://www.youtube.com) we actually matched more than our best matching suggested – this, we believe, is a side effect of improved indexing that was introduced in January 2008 as well as bigger crawling (200 kb max per page rather than 100 kb), yet the best match case was done in October 2007, we are going to rerun it using current indexing to see if it also improves (as we could reasonably expect).

Conclusions

  1. Our methodology allows us to estimate how close our index is in terms of quality to those used by other search engines.
  2. Current index shows very substantial reduction in a gap between our index and market leaders such as Google and Yahoo
  3. We have achieved very high match ratio for domains, which means that even though if we might not yet have exact backlink as shown by search engines we used for testing, then we’d still have other backlinks from that domain

This article will be updated every time we create new index (next one is scheduled for Aug/Sep 2008).

History:
23 Apr 2008: First revision.
18 Jul 2008: Updated with recent index stats.

Feel free to contact us: contact@majesticseo.com.