Whilst on holiday this summer, a Mathematics teacher approached me in a restaurant and asked me to explain the PageRank formula on my T-Shirt – which is really the key to understanding Google’s algorithms. It made me think and create the best explanation of PageRank that I can find. Hopefully better than the others I have seen on Youtube.
The PageRank Formula at the heart of Google’s Algorithms
Of course – being a Geek – I was wearing the Matrix form of the PageRank algorithm. The algorithm that has made Larry Page and Sergei Brin two of the richest, most powerful people in the world. This is the math that built Google.
Reading this literally says;
“The PageRank of a page in this iteration equals 1 minus a damping factor, PLUS… for every link into the page (except for links to itself), add the page rank of that page divided by the number of outbound links on the page and reduced by the damping factor.”
Well – maybe for a few of you. But this algorithm is fundamental in understanding links and in particular, understanding why most links count for nothing or almost nothing. When you get to grips with Google’s algorithm, you will be light years ahead of other SEOs… but I never really see it properly explained. I guarantee that even if you know this algorithm inside out, you’ll see some unexpected results from this math by the end of this post and you will also never use the phrase “Domain Authority” in front of a customer again (at least in relation to links).
I am not asking anyone here to know much more than simple Excel.
PageRank in Practice
I am going to start by showing you how that maths applies to this representation of a VERY small Internet system with only 5 nodes. Then we will look at a very slightly different map which has profound consequences for our results.
Before we start, maybe have a look at this and GUESS which node has the highest PageRank (The head of the tadpole lines are the “arrows” to show the direction of the links).
The PageRank algorithm is called an Iterative algorithm. We start with some estimates and then we continually refine our understanding of the ecosystem we are measuring. So how can we see how this formula applies to this ecosystem?
Firstly, we need to create a matrix… we have nodes A to E. I’ll call them pages for now, because it is a terminology we understand, but the hardcore fans should know I mean “nodes”, as this is important later.
- Start Value (In this case) is the number of actual links to each “node”. Most people actually set this to 1 to start, but there are two great reasons for using link counts. First, it is a better approximation to start with than giving everything the same value, so the algorithm stabilizes in less iterations and it is so useful to check my spreadsheet in a second… so node A has one link in (from page C)
- Now let’s map out all the blanks in a matrix…. Starting with every page cannot link to itself (OK… it can… but not in Google’s algorithm)
- Node A ONLY links to C
- Node B ONLY links to C
- Node C to A, B& E
- D – Links to B and 3 TIMES to E! Do you count it once or 3 times? I’m going to count it ONCE right now, but we’ll come back to that oddity later.
- E only links out to D
So here’s the grid. We can check a few things here… 8 green boxes= number of links in our algorithm (if we only counted the 3 links from D to E once).
Also – note that the majority of this grid is red… most pages on the Internet do not link to each other.
This is a simplification of that formula. It’s not TOO scary now is it? So now we can add the multiplier to each column. This is how much of its value each link will pass on to pages it links to.
So – for example, Page A has PR 1, Multiplied by 0.85 and divided by its single outbound link. So the multiplier is .85
On page C, the PR = 2. the Multiplier 2 X 0.85 all divided by the three outbound links. This means each one lends a score of 0.566666.
(This presentation is not going to go into the case of when the Outlinks is zero.)
So now we go along the green boxes, filling in the green boxes. So…
Page A gives one link TO page C… each link it gives has a value of 0.85… so we put 0.85 in this box.
Page C links to THREE pages, giving 0.5666667 to each one…. And so on until the green boes are filled.
Now… if you remember, we took off the damping factor before we started this, so we need to add the damping factor back to every page. This means the total amount of PageRank will stay stable.
Then we add up the columns, to find new PageRank values for each page! Here’s the completed grid:
[EDIT Mar 7 2019: Thanks to Pablo Rodríguez Centeno for pointing out column C should add up to 2.7, not 1.85 in the table above. Nice spot!]
Now that is really all there is to the PageRank Algorithm – but I did say it is iterative. So you need to do it again and again to get to the real PageRank for every page. I therefore cut and paste the values back into the start values to get the next iteration. My boxes are already referenced, so the next iteration is worked out instantly…
If you want to see my Excel spreadsheet, by the way, here’s what to do.
…I take the numbers at the bottom…
And put them into the top… giving me new numbers at the bottom, which I…
Cut and paste into the top again to get the third iteration… and again and again.
This is what happens to the numbers after 15 iterations…. Look at how the 5 nodes are all stabilizing to the same numbers. If we had started with all pages being 1, by the way, which is what most people tell you to do, this would have taken many more iterations to get to a stable set of numbers (and in fact – in this model – would not have stabilized at all)
Now we have done the math, we can see which is the most important page on our Internet.
Was it the one you guessed? Well whether you said “yes” OR “no”…. It’s now time to reveal the wider story.
You recall I said “nodes” instead of “pages”? That’s because this was doing the PageRank at the lowest common denominator I had…. 5 nodes. But what If these were actually domains, not pages?… now I will put in the pages for each domain and start again…
So now we have 10 nodes, not 5… and IMPORTANTLY, we now have some internal linking….
Where do you thing the power will lie in this version of the Internet?
Am I mad enough to do all of the calculations again? Oh Yeh…
… and here is the actual scores for every page.
The winning page being Node E1.
Some Interesting Observations
The winning Domain was site C in the 5 node model, so if you had used the domain level modelling, you would have hoped for links from pages which amongst the WORST at the page level.
The internal links on a site you cannot control dramatically affect the PageRank of your own pages!
PageRank was only EVER done at the page level… Majestic does our calculations at top level, Subdomain level and Page level – and in the quest to show our customers higher link counts, we default to TLD first – as do our competitors… but it is the PAGE level that counts.
If you build a new site and only used Domain Authority to create links, you could EASILY have got linked from the worst page possible, even if it was from the best domain, because of the INTERNAL LINKS of the other web pages! How on earth are you going to be able to see the strength of a link if that strength depends on the internal links on an entirely different website?!
Second observation is that the data does not have to be complete, but it works best with a universal data set.
Back in 2014, one of our researchers wrote this blog post after a study using the PageRank algorithm ONLY on Wikipedia showed Carl Linnaeus as more influential than Jesus or Hitler.
Majestic’s Citation Flow, as a proxy to PageRank, could have told the researcher a different, more likely result, as our data uses a larger section of the Internet.
The next oddity is that the majority of pages have hardy any PageRank at all!. The top three pages in this 10 node model counts for 75-80% of the entire PageRank of the system.
“Link Counts as an initial estimate for PageRank sucks as a metric”
The next oddity is – the original guess… of using Link Counts as an initial estimate for PageRank sucks as a metric. This chart has plotted the PageRank of each of the pages as an area. When we started, page C3 was our the best guess for the highest PageRank. But look at how much love it loses by the end of the modelling.
“PageRank doesn’t Leak”
In both versions of my model, I used the total of my initial esitimate to check my math was not doing south. After every iteration, the total Pagerank remains the same. This means that PageRank doesn’t leak! 301 redirects cannot just bleed PageRank, otherwise the algorithm might not remain stable. On a similar note, pages with zero outbound links can’t be “fixed” by dividing by something other than zero. They do need to be fixed, but not by diluing the overall PageRank. I can maybe look at these cases in more depth if there is some demand.
The web is big
I’d like to leave you with these thoughts. I have shown you how this works in a world 10 pages big.
10 Pages X 10 calculations (albeit many multiplied by zero) and then 15 iterations is 1,500 bits of Maths.
Majestic does a similar (but different) calculation over 500 billion URLs a day for our Fresh index and currently 1.8 Billion pages a month on out Historic Index.
PageRank Proxies are HARD to build!
… which is just one reason why Google still cannot let it go… This is a Tweet from Google Gary.
Lastly – PageRank is not about Rankings, because Pure Pagerank does NOT consider context. So be very wary of using page metrics that are based on search visibility. Majestic’s Citation Flow is about the purest correlation to PageRank currently available, although the algorithm is a little different.
References for the Reader
PageRank is a Trademark of Google. The algorithm is protected (in the USA) by patent US6285999 and assigned to Stanford University. Whilst Majestic’s formula correlates closely to PageRank in tests, it has some unique differences which we do not make public.