Our crawl department recently sent through a message:
[02/02/2015 14:28:26] 3,004,506,542,426 – URLs so far crawled
So since Majestic started crawling the web in 2004 we have looked at 3 Trillion URLs. We thought a few of you might like to know this, as well as explain why 3 Trillion URLs crawled… whilst significant… is the tip of the iceberg when it comes to Majestic’s data.
1 Trillion new URLs crawled in under 12 months
It took Majestic 5 years to crawl its first 1 trillion URLs which we announced in October 2009. We then only took 2 and a bit years to find our second trillion URLs, which we achieved in 2014. Shortly before we hit the 2 trillion mark, we stopped optimizing our crawl for discovery only, and instead started to balance our crawl between discovery and re-crawling URL’s already discovered. Even so, we have now found our third trillion less than 12 months later.
What factors make Majestic’s crawl behaviour more intelligent these days?
Our Flow Metrics – Trust and Citation flow – are helping us to crawl the internet more intelligently – getting more data from the same crawl bandwidth. Majestic is now able to use multiple signals to make sure we crawl the right pages at the right time… not just lots of pages through link walking.
Other factors help us to make our bot crawl more intelligently:
- Obeying Crawl-Delay in Robots.txt lets a webmaster modify the crawl rate of our bot on their site – although we also obey robots.txt.
- Trust Flow allows us to re-visit important pages more often than lower quality evergreen content
- Recording changes to page content allows us to distinguish between pages which rarely change; as well as pages with news or frequently updated content
- Our URL submitter helps us to better prioritize the URLs YOU care about, by allowing you to upload thousands or tens of thousands of URLs for review.
Aren’t there more than 3 Trillion Links?
Oh goodness yes! Most URLs have a large number of links on them, although most are internal. Even so, whilst Majestic does not report on internal links, we do crawl them to get a better understanding of how link juice flows around a site. On top of this, we are also reporting all the pages that LINK TO each URL. So let’s say the average page has 100 inbound links, that would mean 300 trillion link relationships… but that’s just guesswork right now. Just finding out the right answer would be a project in itself. What we DO know, though, is that we have worked out about 800 relative Topical Trust Flow scores for every URL. Of course, most URLs only have Topical Trust Flow score for a small proportion of the 800 topics, but the sheer scale of the calculation would make your average “big computer” weep and grind to a halt.