Our crawl department recently sent through a message:

You're gonna need more data centers...
We’re going to need more data centres…

[02/02/2015 14:28:26] 3,004,506,542,426 – URLs so far crawled

So since Majestic started crawling the web in 2004 we have looked at 3 Trillion URLs. We thought a few of you might like to know this, as well as explain why 3 Trillion URLs crawled… whilst significant… is the tip of the iceberg when it comes to Majestic’s data.

1 Trillion new URLs crawled in under 12 months

It took Majestic 5 years to crawl its first 1 trillion URLs which we announced in October 2009. We then only took 2 and a bit years to find our second trillion URLs, which we achieved in 2014. Shortly before we hit the 2 trillion mark, we stopped optimizing our crawl for discovery only, and instead started to balance our crawl between discovery and re-crawling URL’s already discovered. Even so, we have now found our third trillion less than 12 months later.

What factors make Majestic’s crawl behaviour more intelligent these days?

Our Flow Metrics – Trust and Citation flow – are helping us to crawl the internet more intelligently – getting more data from the same crawl bandwidth. Majestic is now able to use multiple signals to make sure we crawl the right pages at the right time… not just lots of pages through link walking.

Other factors help us to make our bot crawl more intelligently:

  • Obeying Crawl-Delay in Robots.txt lets a webmaster modify the crawl rate of our bot on their site – although we also obey robots.txt.
  • Trust Flow allows us to re-visit important pages more often than lower quality evergreen content
  • Recording changes to page content allows us to distinguish between pages which rarely change; as well as pages with news or frequently updated content
  • Our URL submitter helps us to better prioritize the URLs YOU care about, by allowing you to upload thousands or tens of thousands of URLs for review.

Aren’t there more than 3 Trillion Links?

Oh goodness yes! Most URLs have a large number of links on them, although most are internal. Even so, whilst Majestic does not report on internal links, we do crawl them to get a better understanding of how link juice flows around a site. On top of this, we are also reporting all the pages that LINK TO each URL. So let’s say the average page has 100 inbound links, that would mean 300 trillion link relationships… but that’s just guesswork right now. Just finding out the right answer would be a project in itself. What we DO know, though, is that we have worked out about 800 relative Topical Trust Flow scores for every URL. Of course, most URLs only have Topical Trust Flow score for a small proportion of the 800 topics, but the sheer scale of the calculation would make your average “big computer” weep and grind to a halt.

 

 

Dixon Jones
Latest posts by Dixon Jones (see all)

Comments

  • Nick Garner

    As Dixon will know, I’m a massive fan of majestic. Having done a lot of data crunching recently using SEMrush as a way of correlating trust flow and Citation flow with actual rankings, my confidence has only grown in these metrics.
    And majestic is still the most important seo tool in our agency

    February 19, 2015 at 8:55 am
  • Tim

    Ah, so that’s how you f*+s got my e-mail address.

    February 24, 2015 at 1:47 pm
    • Dixon Jones

      Tim, we map the internet using this, we do not record email addresses this way. If we have emailed you, it will be because you have your email preferences requesting this (which you can amend <a href="http://info.majestic.com/emailPreference/e/63022/306">here</a> or – more likely if you came to this article from our last newsletter, because you physically have that preference switched on in your double opted in, <a href="https://majestic.com/account/my-details">registered account settings</a>. Either way, an anonymous comment doesn’t help either of us I am afraid. We have no interest in contacting people that do not want our news.

      February 24, 2015 at 2:01 pm

Comments are closed.