Are certain phrases associated with different market sectors?
One of the joys of working in the development team here at Majestic is that when some manner of question is raised on some aspect of the internet, quite often, the answer is at our finger tips. All the information we need is only a few calls to the API and a handful of lines of code away – and with tools like wordle, great visualisations are often just a click away.
In order to investigate the relationships between anchor text of sites in the same market sector, I wrote a short bit of code to call the Majestic API command “GetAnchorText” to retrieve the top 250 anchor phrases for a number of different websites. The output of the “GetAnchorText” calls were then processed to identify anchor text common across a number of sites, before using the Wordle advanced data import facility to produce a data visualisation.
The result is kind of “bloggy” – but also kind of dull – “Read more via my blog here” ? (And who is Sarah?)
I suspect that given the huge diversity of content on all of these blogging sites, the resulting anchor text diversity makes it very unlikely that we will find a huge number of correlating phrases in the top 250 matches for each site.
In case you are wondering, “probefahrt vereinbaren” means “arrange a test drive” (according to Google translate). I was quite happy when a colleague guessed “German Car Manufacturers” from the resulting wordle.
Filled with Enthusiasm, I decided to try News related sites from two different markets – the USA and Great Britain. For the US we have The Wall Street Journal, the New York Times, Huffington Post and the Washington Post. For the UK, we have the Guardian, the Telegraph, the Times, and the Independent:
I found it interesting that it appears that in the newspaper industry on both sides of the pond, different publications appear to be linked with the anchor text of a competitor – though some of this was certainly due to the weighting algorithm and thresholds used to produce the wordle – counting referring domains for a phrase with a zero threshold means that if 60,000 websites link to the Guardian with the phrase “the guardian” and one site accidentally links to the Telegraph with the same anchor text, then the Guardian enters the charts quite prominently…
So I thought, what if we expanded the list of websites so that rather than any two matches of four websites, it was something a little more robust – say 4 matches from all of the news sites listed above? Lets have a look:
Increasing the sample sets and thresholds really seem to have created a news “fingerprint” for Anchor Text. This process also suggests that “here” is far more ubiquitous than “click here” – at least for the sample data sets above – perhaps the result of a far too literal reading of a W3C Recommendation?
I'm the Chief cat herder and hirer for our development team. I work with Dixon and Alex to contribute to the design and oversee the development of software which is developed by our talented team of software engineers.