We feel we have the cleanest link data commercially available, (unless you subscribe to an extra service that re-crawls links like ours on demand). So we get surprised if anyone reports that our Fresh Index has large numbers of dead links that are unreported. It turns out this is usually due to a misinterpretation of the data, so to make it totally transparent, for every link, we now list the dates our crawler first found it, date we last saw it and… if the link disappears, we’ll also tell you the date the link was lost. The dates are relative to the index used (fresh or historic).
This should pretty much make everything clear, transparent and as easy to filter as possible.
The recent date functions are available it both the Fresh Index and the Historic Index. So here’s how it works:
We know there is baggage in our historic data, which is why we built the Fresh Index in the first place. EVERY link in the fresh index has been seen within the previous 60 days and most are checked much more frequently than that. We think with this level of detail, we hope the UX is intuitive enough – but do let us know in support if we can make it any slicker.
Because we found ourselves needing to do a sanity check on our data – we found ourselves having to verify that our Fresh Index reporting of one particular site was indeed accurate to within a reasonable tolerence. This means that over last weekend we found ourselves re-crawling 824355 backlinks to SearchEngineLand.com. As if we didn’t have better things to do.
As a bi-product of this slight diversion, we have a big spike in the number of LOST links found in our brand spanking new “Lost Links” graph apparently “Lost” by Search Engine Land on a single day. We would like to sincerely apologize to Search Engine Land for any minor misconceptions this may cause, as in a typical day, SEL loses maybe 2-5000 thousand or so links and tends to accumulate many more. This is natural as sites use SEL’s news feed as a primary news source. As the news stories come off the front page of news sites, so do the links.
Recrawling data in range of ~1 mln links requires pretty good technical capability which we happen to have. Alex outlined the full process and data files for anyone to check the data for themselves:
Step 1: Search for searchengineland.com in Site Explorer
Step 2. Create Advanced Report for searchengineland.com in fresh index using our system, download all data using default settings to exclude known deleted links, mentions but allowing links marked as nofollow (we’ve created download file with data here: http://tinyurl.com/majesticseotestfile1 – 25 MB compressed )
Step 3. Extract unique backlinks from that big file (download:http://tinyurl.com/majesticseotestfile2 )
Step 4. This is where one needs to recrawl nearly 1 mln urls, parse HTML correctly, build backlinks index to extract data again in order to check how many links still present – competent level of TECHNICAL EXPERTISE REQUIRED!
Step 5: We used our newly built index with this data to create same CSV file as in step 1 (download file here: http://tinyurl.com/majesticseotestfile3 )
Step 6: count number of unique ref pages in the file above (download results here -http://tinyurl.com/majesticseotestfile4 )
(in case anyone was interested, the final ratio of unique LIVE links still present turned out to be = 91%!)
I hope the clarity of showing every “First found” date, Every “Last seen” date and – where it has been deleted “Link lost” date will save everyone having to go through such a tortuous process themselves and that these dates will help to clarify that our data is unquestionably clean. We feel this is about as legitimate and transparent as we can get.
Of course, even 91% is not perfect for everyone, so we have a number of partners who have built significant added value on top of our link data. Some of these partners recrawl the source URLs and not only verify the links regularly, but can also index the content or other areas of the page. Example partners are rotated here. Majestic SEO, however, remains the Largest, Fastest, Freshest and now Cleanest full link map on the Internet.