Majestic is upgrading its back end to be able to create better scale moving forward. This is a good thing in the long run, but the change is taking longer than expected and so we want to share what is happening.

You may have noticed, our usual monthly Historic Index release has slipped this month. We wanted to keep you informed of why this has happened, as well as giving you a little more in site into the amounts of data that we process each month to be able to bring you a historic index which allows you to view, filter and search link data over the last 5 years.

As Matt Cutts said in 2012, quoting Douglas Adams in hitchhikers guide to the galaxy

‘Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist, but that’s just peanuts to space…’

That quote goes for the Internet too, its big, really big… Our Historic Index now records more than 3.5 trillion unique URLs, and as you can imagine storing data on all these links for a full 5 year period is quite a big task. Even with our own custom compression specifically written to store linking data in as little space as possible this still fills just over a Petabyte (around 1,000 well spec’d laptop computers) of storage, and processing this on our dedicated cluster comprising of almost 300 processing cores and 7 TB of memory takes somewhere in the order of a month and ultimately produces our searchable index which is around 400TB in size. So to make this more sustainable, we have been working to further improve the storage and processing of this data. These changes will allow us to store far more pre-processed data, therefore reducing the working set of data which is used to produce our historic index, and hopefully also as a side effect this will also help us to reduce the processing time of this so that we can continue to produce this index monthly, even though the amount of data is increasing every single month. Unfortunately working with this much data is never quick and so this has led to the delays of this month historic index, but we are working hard to complete this and release our next index as soon as possible.

Our Fresh index is unaffected by these changes, this continues to update every few hours and our current historic index still contains data for the period 25 Apr 2010 to 25 Nov 2015

Some things you may not be aware about our indexes

Q. What is the difference between Fresh and Historic?

A. Our Fresh Index contains current link data. It indexes the last 90 days of data from our crawlers, our bots visit the majority of the Internet at least once a month, so this index only includes links which are currently live or recently removed, regardless of when it was first created. Anything links which have not been seen in the last 90 days are removed from this index. This index is most useful if you wish to view your current backlinks. All metrics in the fresh index are fully updated approximately every 30 hours and has extra data included every few hours.

Our Historic index contains data from the last 5 years, any links which have been found at any time during this 5 year period will be recorded in this index, even if the link has been removed or the website no longer exists. This index is most useful for data mining, finding links active at points in history for comparison or analysis. The historic index is usually updated monthly.
Date ranges for indexed data are always shown on our homepage for both indexes.

Q. Are all links in the fresh index also in the Historic Index?

A. The same data is used for both indexes, but because the fresh index updates more quickly this data is shown much more quickly in this index. This data will show in the historic index, but due to the longer update time it will take longer to show, if you want to view only current active link you should most likely be looking at our fresh index.

Q. Why are all the links I check from the Historic not working, even though they are not showing as deleted?

A. In our index a link is shown as deleted when we successfully recrawl a page and find that a link that was previously shown on it is now no longer showing. If we are unable to recrawl the page, this may be because the site is temporarily down, closed down permanently, the page no longer exists, or even the site has now blocked us from visiting their site. In these cases we are unable to tell with certainty the final status of that link, so we mark the ‘last seen’ date, but otherwise leave other attributes unchanged. Over 5 years a lot of websites and pages come and go, this is why filtering our historic index to specific date ranges gives the most useful data, only returning links which were known to be active in that period.

Q. Can you tell me what backlinks my website had in December 2012?

A. Yes, our historic index covers the last 5 years and can be filtered down to show what backlinks were active in any month, or even down to a specific day, although of course changes are only logged when our bots have visited the page and so seen the change.

Q. Why can’t you just have 1 Index? Searching 2 indexes is complicated.

A. Actually – we seriously considered dropping the Historic Index. It is incredibly expensive to produce, But having 2 indexes has 2 big advantages. Firstly the massive reduction in the amount of data in our fresh index means that we can update it almost daily as opposed to the month processing that it takes to process the 5 years of data in the historic index. Secondly links which have not been seen for 90 days are completely removed from the fresh index so that the fresh index will only show current or recent links rather than filtering links which are in an unknown state because we have been unable to confirm they have gone or are still there. So if you want to view current data view the fresh index, if you are looking to filter to specific dates then you most likely need the historic index.

How long will it take to get the next Historic Index out?

We usually release a Historic update once per month. It is our hope that February will be the only month we miss, but that cannot be promise right now. The architecture changes involve re-importing all the index from Tape backups. This is what is taking the time, because tapes are incredibly slow compared. This is where we underestimated the time and it is a painful wait. We believe that will still take another week and only THEN can we rebuild the next index. That will take us right up towards the end of the month.

Dixon Jones
Latest posts by Dixon Jones (see all)

Comments

  • Jack Farmer

    Given the massive amount of data, wouldnt it be easier to run everything on Amazon, Google or Azure cloud ?

    Anyway, great work thx for the update 😉

    March 5, 2016 at 10:13 am
    • Dixon Jones

      I cannot comment on our costs, but Moz are more open. They were doing that and it was costing them $6.2 Million on Amazon Web Services. Going in-house took saved them over $3 Million a year in cash terms, through server leasing. (<a href="http://www.xconomy.com/seattle/2014/01/30/moz-dumps-amazon-web-services-citing-expense-and-lacking-service/">source</a>). You can make your own assumptions as to our costs with a much larger historical data set. The mozscape index today is reporting 142 Billion URLs (source: OSE homepage). Our Historic Index is currently reporting 3.6 Trillion FOUND and 884 Billion crawled (source: majestic homepage). This is why the Historic Index is such a beast! 🙂

      Plus… I expect we would have had a lot of trouble restoring from tape to AWS.

      March 7, 2016 at 11:22 am

Comments are closed.