Most users rely on our Fresh Index from day to day because it updates continually and covers any link we see over a 90-day period.

However, some of our users do use the Historic Index. It’s much bigger and the task of updating the index is massive. We have recently made some infrastructure improvements which has meant we have not been able to recalculate the Historic Index as we had hoped this month.

Firstly – apologies to anyone waiting for the update to the Historic Index. We could not foresee the blip but we’re working to put things right as soon as we can – and we estimate that the update will be complete before the end of February.

That presents an opportunity to reveal a bit more about how Majestic works under the hood.

“If we were to crawl our entire database of 6,659,283,985,220 URLs, simple maths says that this would take approximately 3 years.”

The Historic Index is a BEAST of a dataset. Majestic has been crawling for over a decade and that’s a LOT of data.

But Majestic is so much more than just a list. The maths that Majestic runs over the dataset transforms the list into meaningful statistics upon which the industry relies.

Trust Flow and Citation Flow are not numbers plucked out of thin air, and without the maths being applied to the whole dataset, they would not converge into norms that help us understand “Page Strength”.

The Fresh index is one thing – with 847,072,493,467 URLs to do maths on. The Historic Index is quite another with 6,659,283,985,220 URLs. Majestic isn’t nearly as well funded as our search engine counterpart: we don’t have the same resources to crawl the internet. Our network of bots crawl up to 7 billion URLs every day and averages over 5 billion a day. If we were to crawl our entire database of 6,659,283,985,220 simple maths says that this would take approximately 3 years.

Of course, many of these URLs are associated with websites which are now closed, or the pages have been removed, or the page hasn’t changed since it was first created. So to ensure that we crawl the sites which are important and are updated regularly, we have a cluster of computers that examine all our URLs and choose which should be crawled next to ensure that our index contains the most up to date, most relevant information possible.

The size of the internet is always increasing as is the amount of spam and subversive information created to try to fool search engines. Search engines are getting better at recognising this stuff and so must we.

So to this end we have recently updated our cluster. Unfortunately, this update had some knock-on effects on other sides of our build processes, as you can imagine the whole process is interrelated and this created problems in our historic build which has seen it delayed.

Why we think our Flow Metrics create a better view of the Internet and some ways in which we use them

Flow metrics are built from the ground up on every Index Update, both fresh and historic. We don’t rush the maths and it takes time to build a system that analyses pages in a useful way.
We also use our own metrics to direct our crawl priorities – which URLs a bot should crawl next. That means we don’t waste resources crawling and re-crawling pages that never change, and that nobody cares about.

We think Google uses a similar idea. Google is better at “discovery” than Majestic. No question. But the decision to revisit a URL – however much resources you have – is a matter of compromise for a web crawler. What percentage of your crawl resources should be spent discovering what you have not yet seen against auditing what you already know?

Flow Metrics help Majestic to get this balance right.

[Edited the crawl rate down from “7 billion a day”. 7 Billion would be a VERY good day.]

Comments

Comments are closed.