Our Ambassador in the Ukraine has just returned from presenting some university research in Seoul. WE thought it would be useful for blog readers to see some of the “bigger picture” things that Majestic is beginning to get involved in. This is Dmytro’s very different conference report.

Dmytri Filchenko, possesses a Doctorate in Mathematical Modeling and Computational Methods and is currently the Head of the Centre for Webometrics and Web Marketing in Sumy State University, Ukraine. See (http://www.linkedin.com/in/filchenko) for further details.

Hyperlinks are a subject of interest not only to SEOs or web marketing specialists but also to researchers in the fields of web-mining, information retrieval or webometrics. Coined more than 15 years ago, the term ‘webometrics’ became a constantly growing field of science, which assesses web presence, web usage and web impact indicators, as well as discovers new patterns on the Web.

This subject area was discussed in depth at one of the most comprehensive events, the 8th International Conference on Webometrics, Informetrics & Scientometrics organized by Global Interdisciplinary Research Network COLLNET, which was held recently in Seoul, South Korea, where I was asked to present my paper and presentation.

Every day, all of us use various web metrics. For example: Google PageRank, when searching the web, Alexa Traffic Rank when searching for popular sites, or MajesticSEO’s Flow Metrics when estimating the impact of a given URL. However, have you ever considered the scale of samples examined while computing these metrics (thousands of billions of URLS’s) and how difficult it is to solve the problems of precision (Google PageRanks are 11 integers from 0 to 1)?

If one was to compute such metrics for a specific subject area, then size and precision would be less significant and more importantly, one would be able to provide a better insight into the subject area.

For instance, if you are going to estimate the authoritativeness of your company’s web domain alongside those of your competitors, you would need to calculate your own metrics on your own sample of web domains and not on the whole web as most major metrics do. However, this requires a database of backlinks, which are core data for any modern metrics. This is where MajesticSEO comes in.

My team were given a project which was to evaluate the level of mutual impact between universities; I thought, why not work on this from the webometric point of view. A hyperlink to any university’s website requires extra effort and begs the question, why would one university pay such a tribute to another? Should a university decide to place a link of another on their site, then this would imply that they appreciate adding the link.

More than ten years ago, there was a similar project in the name of ‘G-Factor’ which was aimed at evaluating hyperlinks to universities. Using the Google search engine, the number of hyperlinks to university’s websites from all other university websites was counted. The more hyperlinks a university website obtained from other universities websites, the higher it was ranked.

Despite obvious advantages, G-Factor in the classical form had some drawbacks. First and foremost, G-Factor used Google as a backlinks provider (expressed by letter ‘G’ in the name) which cannot be regarded as a hyperlink provider anymore. Secondly, the G-Factor did not take into account the authoritativeness of a hyperlink source. In which case, where does the link come from for university websites?

Back to our project; a new data provider was required, as well as some new algorithm which took into account mutual authoritativeness of university websites. As for the backlinks, we tried different sources, but they were either poor in quality, extremely expensive, or very primitive in APIs. After months of evaluations, we finally found Majestic!

As we started using the Majestic database, we worked tightly with the API Support Team. Webinars organized by Dixon Jones were exceptionally useful for us. All of this enabled us to create a special tool with GUI for exporting the list of hyperlinks from the Majestic database for each university web domain. After that, another tool parses that list extracting only those hyperlinks that appear on university web domains included in the sample. Such an analyzer creates a so-called Backlinks Matrix.

Afterwards, we concentrated on strengthening the algorithm of the original G-Factor. At the first step of research, we proposed to use an original Brin & Page model for PageRank, that uses the hyperlink structure of the Web, to build a Markov Chain with a Transition Probability Matrix. Without getting too technical, I’ll just say that we used a power iteration technique for solving the system of equations, which finally gave us PageRank values.

All in all, for our project which we named ‘Extended G-Factor’, we used the same concept as Google PageRank and G-Factor (estimating the universities mutual impact as a function of mutual hyperlinks from their websites), although our model was much more intricate. We took 324 Ukrainian university web domains from the directory of ‘Ranking Web of World Universities’ and calculated our extended G-Factor.

The project revealed a number of findings. Firstly, it was discovered that most Ukrainian universities were not fond of citing other universities on their website. This may well be down to competition and to prevent students and other categories of customers from being exposed to rival sites.

However, those universities that do cite other universities on their website do not actually refer back to papers and articles. Again, this could be to produce less coverage of opponents’ works. Although, it was found that the academic institutions who had more in common in terms of specialty,  had more mutual hyperlinks. Moreover, it appeared that small sized schools ranked very high when ‘hyperlinking’ with major universities. This is most likely due to the fact that they are not regarded as ‘big competition’. Therefore, the major universities, which usually obtain high positions in Webometrics Rankings are not well represented in an extended G-Factor Ranking.

It was also discovered that the Universities that are in the same region are more likely to link to each other often.

In conclusion, our work determined that the Hyperlink is a powerful tool, which provides a good metric of mutual impact. Our project has been published on http://ranking.sumdu.edu.ua.

Dixon Jones
Latest posts by Dixon Jones (see all)

Comments

  • Jesse Leimgruber

    This is a phenomenal study, and it was interesting to see the quality of rankings that can come solely from link data. (Their rankings are very similar to the US News college rankings)

    That being said, I am curious to know whether or not they excluded link data from student and club pages. (Clicking my Website link above will bring you to my Stanford.edu student page, all students get one at Stanford) In addition, many university citations appear in PDF reports, not on the actual web pages themselves, would there be any way to count these?

    Awesome job Majestic and Dmytri Filchenko, keep it up!

    November 13, 2012 at 9:16 pm
    • Alex

      Hi,

      We exclude internal links in our counts, so if student club pages are on the same root domain such as stanford.edu then we’d exclude them from our counts BUT we do include internal linking in our Flow Metrics.

      November 14, 2012 at 11:18 am
    • Dmytro Filchenko

      > Hi Jesse,

      Does the US News college rankings really use link data? This is a surprise for me.

      As for our research, we counted only backlinks (external hyperlinks). Hopefully, Majestic SEO provides this data.

      November 14, 2012 at 8:31 pm
      • Jesse Leimgruber

        > US News does not use link data in their rankings. I was just pointing out that your rankings correlate very strongly to theirs, illustrating the accuracy in your findings. Keep up the good work!

        November 20, 2012 at 6:34 am
  • Nitmeare

    Would be interesting to count “quotations” and not only links, as example if university 1 is not linking to university 2 but mentions the name of the other university in one of their documents(.html, .pdf .doc etc).

    This way pagerank or other ranking algorythms could be used to check how often a university is mentioned in scientific texts. This would make it possible to rank how important the impact of one university is for the scientific discurse on a global scale.

    Or to check how often the works of a scientist are quoted by others and therefore allow to calculate a ranking for the relevance of a scientific work or the author in the science community just the same way as the rank of websites is calculated. (might be usefull for scientific search engines)

    I think this could be done with existing ranking algorythms, there is just a need for a crawler wich is able to extract the “quotations” and interprets them in a correct way. Cause the “source directory” of scientific works is usualy highly standardised it should not be to complicated to develop such a system.

    Of course this might also be used to analyse how often the name of a specific brand, company or product is mentioned in social networks or on blogs to analyse its popularity. So you can as example create ranking algorythms to create a popularity ranking of all companies wich are traded on the NYSE wich is updated on daily or even hourly basis. Such data might be extremly valuable for forecasting prices on the stock exchange or monitor election campaigns et cetera.

    This should even be possible without a full text index (wich would cost to much money i assume ?). Just provide crawlers with a database of “quotes” they should search for, so they will extract all “quotes” the same way as they are allready extracting links and send them back to central-server for further analysis and creation of statistics.

    “quotes” could as example be the names of top 10.000 NYSE listed companies for example or some popular products such as “android” or “iPad”.

    Quality of “quotes” could be estimated using trust flow for the sites they have been found on.
    This might open up completly new horizons for data mining and statistic creation. Especially if combined with the new GeoIP map majesticseo is offering.

    If this works well there might be Level 2: Even more deep analysis. If combining “quotes” with “words”, as example if “quote” (product name) is found in a html textblock wich contains “positive words” we might assume that the person likes the new “quote” (product) of company X. If negative words are found near the text position where “quotes” is found it would mean that someone does not like it. So companies could get statistics showing them how popular, positive or negative a new product is discussed in the internet. Relating to GeoIP they might get this statistics for specific continents and nations. Combined with trust flow they will find the most important sources where their product is discussed.
    Also to filter them, as example showing links on sites wich criticise the product so they could investigate the reasons why people like or dont like it more easily.

    Such would also be very interesting for sociologists and economists to analyze data on a massive scale to get more precise and up to date informations on any topic than it would be possible with traditional (offline) ways of collecting informations (such as surveys).

    November 15, 2012 at 12:47 am
    • Dmytro Filchenko

      Hi Nitmeare,

      You are completely right, that quotations additionally to hyperlinks can really open up new horizons of webometrics.

      The same idea for academic content has been widely used by Google Scholar, MS Academic Search, Scopus, ISI and others. They count quotations in a bibliographic section of the papers and then compute different indeces.

      Why not to try this as the next improvement of the univrsities G-Factor, it would be very promising.

      November 19, 2012 at 8:10 am

Comments are closed.