Team Sapphire built a Tinder-esque way to discover which of the Majestic Million Top 500 websites someone may like, called crawlr. They used a swipe left/right gesture to create a Topical Trust Flow profile of the kind of things you would be interested in. They had a multi-tiered and well-structured solution, which even included a Python script to grab up-to-date screenshots every ten minutes for the websites in their database.
Max from Team Sapphire had this to add:
“When the user swipes right on the app, the API calls Majestic’s API for GetRefDomains, we then add five of the domains which aren’t already in the database to the sites table, which are scanned the next time the poller is run to pick up site information, and takes a screenshot of the site. If the Majestic API has outdated information for a website (for example, no title) – we use python’s urllib2 to grab it and populate the database.”
Here is a screenshot, and a fantastic video, which shows crawlr in action:
We caught up with the winning team of Tom Bofry, John Hayes and Max Maton (left to right) from Birmingham City University, and after the prize giving asked them a few questions:
How did you come up with the idea for crawlr?
It’s really quite a blur as to how we came about the idea. I’m pretty sure we sat down, and it was the second or third idea I [Max] threw out there, John and Tom liked it, we knew the pressure was on so we just started! I remember thinking that I only use about 5 sites on the regular (reddit, arstechnica, y-combinator, facebook, and bbc news) and thought it would be nice to find new, similar websites I’d be interested in.
At Majestic we all thought that the end result was fascinating, did you expect it to be so good when you first started?
We were happy with the end result, considering the time constraints. Initially Tom mocked up the UI for extra features such as ‘Sites recommended to me’ and ‘Sites I like’ however I didn’t have time to implement the API calls – so we didn’t hit our goals completely but the front-end definitely exceeded our expectations, Tom did a brilliant job, he knuckled down, only turning around to tell us to stop breaking the API!!
If you had more time, how would you take crawlr forward?
I would personally have loved to crawl the website’s top 5 pages, taking a snippet of each page’s text (perhaps the first x paragraphs after the first H1) and run them through IBM Watson’s personality identifier – we could then have some personality indexes attached to each site which would allow us to tell if the website is perhaps open, liberal, happy or closed-minded, conservative, sad etc. This would further help us narrow our topical searches.
Is this your first hackathon, and if not, which others have you been to?
This was my second hackathon – I attended a machine learning hackathon in Rapallo, Italy, hosted by the company I worked for in my sandwich year. We used Watson’s cognitive advisor to narrow down a subset of search results based on user’s natural language, to build a small search engine. I really enjoyed this hackathon because we were all of similar ages, similar talents and really enthusiastic! We took it in turns to sleep (albeit very little) and there was so much enthusiasm for what we were doing.
Which other hacks impressed you the most?
I really liked the prospects of combing geo-aware software with domain analysis for a real world business ranking, I think it has so many potentials (as well as reversing it – if the store has a great site and is doing well on the street, why not bump it up in SERP rankings??) After chatting a lot with the guys (Team Jill) who built the education app, I really enjoyed their enthusiasm and big ideas, it could certainly make an impact in home learning (I envision it being implemented in universities open courses)
BCUHack Majestic Winners
Team Sapphire built a Tinder-esque way to discover which of the Majestic Million Top 500 websites someone may like, called crawlr. They used a swipe left/right gesture to create a Topical Trust Flow profile of the kind of things you would be interested in. They had a multi-tiered and well-structured solution, which even included a Python script to grab up-to-date screenshots every ten minutes for the websites in their database.
Max from Team Sapphire had this to add:
“When the user swipes right on the app, the API calls Majestic’s API for GetRefDomains, we then add five of the domains which aren’t already in the database to the sites table, which are scanned the next time the poller is run to pick up site information, and takes a screenshot of the site. If the Majestic API has outdated information for a website (for example, no title) – we use python’s urllib2 to grab it and populate the database.”
Here is a screenshot, and a fantastic video, which shows crawlr in action:
We caught up with the winning team of Tom Bofry, John Hayes and Max Maton (left to right) from Birmingham City University, and after the prize giving asked them a few questions:
How did you come up with the idea for crawlr?
It’s really quite a blur as to how we came about the idea. I’m pretty sure we sat down, and it was the second or third idea I [Max] threw out there, John and Tom liked it, we knew the pressure was on so we just started! I remember thinking that I only use about 5 sites on the regular (reddit, arstechnica, y-combinator, facebook, and bbc news) and thought it would be nice to find new, similar websites I’d be interested in.
At Majestic we all thought that the end result was fascinating, did you expect it to be so good when you first started?
We were happy with the end result, considering the time constraints. Initially Tom mocked up the UI for extra features such as ‘Sites recommended to me’ and ‘Sites I like’ however I didn’t have time to implement the API calls – so we didn’t hit our goals completely but the front-end definitely exceeded our expectations, Tom did a brilliant job, he knuckled down, only turning around to tell us to stop breaking the API!!
If you had more time, how would you take crawlr forward?
I would personally have loved to crawl the website’s top 5 pages, taking a snippet of each page’s text (perhaps the first x paragraphs after the first H1) and run them through IBM Watson’s personality identifier – we could then have some personality indexes attached to each site which would allow us to tell if the website is perhaps open, liberal, happy or closed-minded, conservative, sad etc. This would further help us narrow our topical searches.
Is this your first hackathon, and if not, which others have you been to?
This was my second hackathon – I attended a machine learning hackathon in Rapallo, Italy, hosted by the company I worked for in my sandwich year. We used Watson’s cognitive advisor to narrow down a subset of search results based on user’s natural language, to build a small search engine. I really enjoyed this hackathon because we were all of similar ages, similar talents and really enthusiastic! We took it in turns to sleep (albeit very little) and there was so much enthusiasm for what we were doing.
Which other hacks impressed you the most?
I really liked the prospects of combing geo-aware software with domain analysis for a real world business ranking, I think it has so many potentials (as well as reversing it – if the store has a great site and is doing well on the street, why not bump it up in SERP rankings??) After chatting a lot with the guys (Team Jill) who built the education app, I really enjoyed their enthusiasm and big ideas, it could certainly make an impact in home learning (I envision it being implemented in universities open courses)
——————————————————————————————————————————————————
We loved meeting Max, Tom and John, and seeing what they could do with Majestic data.
Posted In: Commentary, Events / Conferences