How to optimise your website for crawlers (Majestic SEO Podcast)

When we talk about optimising a website for crawlers, we’re really talking about laying the groundwork for visibility. In this episode, we’ll explore what it takes to make your site as crawl-friendly as possible, from smart internal linking, managing crawl budget, and how to find and remove any technical blockers that might prevent your most important pages from being indexed.

Joining our host David Bain was Charlie Williams, Rejoice Ojiaku, and Miruna Hadu.

Watch On-demand

Listen to the Podcast

Transcript

David Bain

Why is website crawling important for SEO? Hello and welcome to the July 2025, edition of the Majestic SEO panel, where we’re discussing why website crawling important for SEO. I’m your host David Bain, and joining me today are three great guests, so let’s meet them. Starting off with Miruna.

Miruna Hadu

My name is Miruna Hadu. I am the Customer Support and success specialist at Sitebulb. So if you use the tool, or ever contacted us, I’m behind that inbox, and then I have some years of agency experience in SEO, and I’m now using it to help people use the tool.

David Bain

Thanks for joining us Miruna. Also with us today is Charlie.

Charlie Williams

Hi there. I’m Charlie Williams and I’m an independent SEO consultant, a freelancer, essentially having done SEO for quite a while now, over 15 years. I’ve had a lot of different jobs in different places, but always had a lot of focus on technical SEO. I Now specialize in essentially on-site SEO, so that’s technical and content and helping people build better websites.

What it boils down to, crawling has always been a huge part of my work for many years as part of that, including doing training workshops for tools crawling tools like Screaming Frog, where I used to work as well, and also doing a training session on managing crawling and indexing for BrightonSEO.

David Bain

Thank you, Charlie. And also with us today is Rejoice.

Rejoice Ojiaku

Hi, I’m Rejoice. I’ve been doing SEO for around six years now. I’m the B2B SEO manager at Nelson Bostock, which is a PR and integrated comms agency. I’ve mostly focused around content, data and most recently, AI search and all that, fun, wonderful SEO aspects of things. It’s nice to be here.

David Bain

So three great panelists there. Let’s go back to Miruna for the first question. So, Miruna, what is website crawling?

Miruna Hadu

So when we talk about website crawling there’s a couple of things that we can refer to. Generally, in SEO, we talk about search engines like Google crawling our site, which is that process of discovery of our pages, which then leads on to things like rendering, processing, and indexing, to have those pages in the search results.

And on the other hand, we can talk about crawling as the process of auditing and discovery that we do on our website. So we might be using crawling tools such as Sitebulb, Screaming Frog, or Oncrawl. All of the crawls are out there to essentially audit your website, discover all the pages and gather data about them, gather indexability, structured data, everything that you need to know in order to analyze and improve your website.

David Bain

Charlie, anything you’d like to add to that?

Charlie Williams

I think that was a fantastic little summary there. Crawling, from an external point of view, which was that first part that Miruna was talking about, is the first step of the search engine process. Google, when they talk about, how they work and things like that, they break things down into that exact process, like crawling, rendering, indexing and serving results.

If a crawler cannot crawl your website and your pages, then a search engine bot, whatever it is, can’t do it either and can’t see what you have on it. So if you want your content to appear in results, whether that’s on AI engines, whether it’s on search engines, or whatever, then crawling is a fundamental aspect of kind of SEO work.

David Bain

Rejoice, do you have any introductory remarks that you’d like to add?

Rejoice Ojiaku

I pretty much think Charlie and Miruna kind of explained it. But I would say definitely website crawling looks at the whole idea of spider-ing. It’s just how Google and other robots look into your site structure, content, and basically just browsing and trying to kind of pick out those best pages that serve a purpose or answer a user query. So crawling is definitely an important aspect of SEO. It’s literally how we tend to do the job and how to be discovered.

David Bain

So Charlie, why is it important to understand what’s being crawled?

Charlie Williams

It’s really a fundamental aspect of auditing a website, of understanding what a website consists of and what bots, whether that’s a search engine or anything else you’re interested in, are actually seeing of your site.

For a bot, let’s focus on Googlebot and Bingbot and so on, for them to crawl your website, they have to go through the process of visiting a page and then following the HTML links, the <a> graph links from one page to another within the site, and that crawling process, understanding if a bot lands on your site, what they see where they go, is really important, because you want to fundamentally make sure your most important pages are in that crawl pattern they are served and seen by whatever it is that’s crawling your website.

Similarly, if you’re managing the performance of website and looking at the data from a search engine, such as in Google Search Console or Bing Webmaster Tools or something like that, understanding which pages are being crawled, which one are being crawled most often, if your important pages are being crawled regularly or not, we start to understand a little bit about our site structure, a little bit about how Google regards some of our pages, and therefore the chances of them then being indexed and so on so they can actually appear in the search results.

So it’s a crucial aspect of auditing and monitoring your website health. Understanding what’s being crawled, and also then, determining what you want to be crawled. So that’s using the various tools and methods we have at our disposal for determining what we want to be crawlable within our site structure.

David Bain

Rejoice, why would we actually choose certain pages that we really want to be crawled and other pages that don’t want to be crawled?

Rejoice Ojiaku

I think it really is down to site importance and what pages you prioritize and what pages you see as important. So you can include pages that have really good internal links, or pages that have really good content that you want to sort of serve to your audience as well.

I think it’s really important when we look at why Google will crawl some pages over the others is down to like content quality and freshness, and I think sometimes people forget that aspect of things when they think about crawling. Google aims to focus on pages that are both high quality and relevant. So pages that are frequently updated strong has useful content tends to be crawled more often because, again, the purpose of Google is to give users exactly the information that they sort of looking for.

So I think that if you’re either building a website or you’re an SEO person, determining what content and what pages you want to present to users and you think users will find valuable and should be the pages that you can choose and say, hey, I want this page to be crawled more often. And some pages that tend to be gated, there’s no point crawling it because there needs to be some sort of access towards it. So I think that’s how you can look the importance of the different types of pages.

David Bain

What are your thoughts in terms of what pages shouldn’t be crawled and what pages aren’t a good use of web crawling resource?

Miruna Hadu

There’s a few ways of looking at this in terms of resources. Google and any other search engines and bots are going to look at what we call a crawl budget to your website. Presumably, Google is not going to crawl every page on your website, or if it’s going to take a while to get to every page on your website, what you want to make sure is that the ones it crawls first are your most important pages and that they are indexed and appear in search results, but also that they are updated regularly, so that when you update the content on those pages, it reflects in your search results.

So there are pages, for example, for things like faster navigation, where if you’re working with a huge e-commerce site, you might have filtering in your menu that creates loads and loads of these faceted navigation pages. You don’t want all of those to be crawlable, because you’re going to be wasting that crawl budget that could otherwise be spent on your category pages, your product pages and stuff that brings in revenue to your website.

Similarly, like Rejoice said, there might be some pages that are gated that you only want users to reach once they’ve completed a certain step in the process. So those pages that perhaps don’t have value for your search results and for your SEO strategy, that you don’t necessarily want to spend that crawl budget on.

David Bain

Charlie. What is crawl health and how do you improve your crawl health?

Charlie Williams

Yeah, so crawl health is part of that crawl budget allocation that Miruna was just explaining. So crawl budget is how much of your website Google can and wants to crawl, and there’s two aspects to it: there’s the crawl rate and crawl health, and then there’s the crawl demand.

Crawl demand is Google’s assessment of how often it wants to crawl your pages and re-crawl which ones it knows about, and crawl rate is essentially how fast Google can crawl your website without knocking it over. So Google will try and crawl your website as fast as it can, but if it starts detecting that your server response time is getting higher, or something like that, or they seem to be struggling with the rate it’s crawling your site, then it will pare back on how much it crawls, and that will affect your crawl budget, because Google’s budget is not just about number of pages, also about time.

I’ll give you an example. I was working with a website, probably about a year ago, and they were struggling to get new pages indexed. So I went into the Crawl Stats report within Search Console, and you can see that Google was regularly reporting problems with crawling the website. It’s crawl health was not good enough. Google was not able to crawl the website at a rate it felt it could get to those pages, so it simply wasn’t seeing the new pages in order to index them. Now that is an extreme example that doesn’t happen very often, and the solution was simply for them to go from paying one pound a month for hosting to something like 15 pounds a month to actually get a bit of hosting and Google started crawling and indexing the new pages much more rapidly, as you’d hope.

But that’s an example of what we mean by crawl health. It’s this concept within your crawl budget of how easy it is for bots to crawl your site without things going wrong. There is the inverse as well. You might not always want these things to happen again. Working with a large e-commerce store, they found they were essentially on a hosting that was paying per visit. And we found that Applebot was visiting the website multiple millions of times a week for no apparent reason. Why is Apple needing to do that? But it was costing them 1000s of pounds a month. We had to work out how to stop Applebot from crawling the website so much.

But the kind of basic idea of crawl health is how much can your website be crawled without causing problems? Anyone like me who’s been using tools badly for too long will remember doing things like crawling an old WordPress site and knocking it over because you crawl it too fast using your crawling software.

David Bain

Sorry Charlie, you were saying that it costs the business 1000s of pounds a month by letting Apple have access to crawling the site on a regular basis?

Charlie Williams

Yeah, it’s quite a common thing for larger sites, or in certain situations, to not pay a set fee for their service per month, but to be on a platform which charges them by usage. So the more customers that come and visit the website, the more they’ll pay for their hosting, because they’re serving more people. And if bots are coming in and using up huge numbers of visits, it means that the business has to pay those fees to the hosting company. In this case, they were getting millions, literally millions of visits a month from Applebot, and there’s no benefit to that, and it was costing them 1000s of pounds a month.

David Bain

Rejoice, why does Google choose to crawl some pages not others? Is there a certain structure that you can give to finding pages in your site that make it more likely for Google to come along and actually want to crawl those pages? Are there other ways that you actually want to block Google from certain pages as well? And what’s the optimum way of going about structuring this?

Rejoice Ojiaku

I categorize it in like three ways. So Google chooses, based on site importance and page popularity, pages that are seen as important. So this can include pages with many internal links, external backlinks, or even high traffic, and pages maybe with low visibility or are that hard to reach, may be crawled less often or skipped. You can also look at it from the sense of crawl budget, as we’ve been speaking about, so for larger sites, Google assigns a crawl budget due to the number of pages that it will call within a certain timeframe. And again, that efficiency is that the budget gets used on priority content. But if Googlebot spends time on low value or similar pages, the important content may be overlooked. So that’s why Google sort of has to allocate these crawl budgets.

Then we have to also think about this discoverability. So Google learns about new or existing pages through links from other sites, internal navigation, or as every site should have a site map, so these pages that aren’t linked or listed in a sitemap are unlikely to be found in full, which is why within SEO there’s always a push about making sure all the pages that you do want Google to find should be embedded in that in that sort of sitemap.

In terms of a structure, I would say the structure should always follow how you want users to go from the homepage down to whatever pages you want them to discover. So following some sort of path, it feeds into the idea of why we always encourage breadcrumbs, because breadcrumbs actually gives you a structure in terms of where certain pages sit.

So if you have an overview of a navigation, how many clicks would it take to for user to get from point A to point B? So if you think about the site structure in that way, how easy and how easily navigable it is for the user, it kind of allows you to then picture how Google is going to see that and link it towards each other. So when Google focuses on crawling or on pages that matter the most, it’s pages that are well linked and that can be well linked through navigation, that can be well linked through internal links of other pages. Pages that are updated regularly are quite valuable, and those low quality or hidden or blogs pages are going to be overlooked because Google doesn’t want to spend too much time on that, because the users are not necessarily going to access that.

So I think if you think about efficiency, and the users journey they’re going to take on your site, it should then help you think about how you can structure your site, how to make it easier for you to be called as well as easier for users to find those valuable pages that you want them to find.

David Bain

Miruna, why should SEOs use crawling tools to help them and what practical things can actually be learned from them?

Miruna Hadu

SEO crawling tools are like having eyes on your website and what’s happening on your website without really crawling your website with a tool dedicated to it. You’re kind of flying blind a little bit. You have to rely on gathering information about the pages that exist on your website from your CMS, your developers, or even the memory of your client, possibly, and then the Google index, which, again, might not contain all of your pages if they haven’t been crawled properly.

So SEO crawling tools essentially allow you to analyze the health of your website, analyze your indexing, your internal linking, and all of the different elements that are then going to influence how Google crawls and index your site, as well as other elements of that site health, like, site speed, structured data, canonicals, redirects, all of those things that influence your site health, and it’s all going to be packaged for you in one place. You don’t need to go to different sources to analyze that data, which is going to make that analysis that much easier.

David Bain

Charlie, would you like to add something to that?

Charlie Williams

I think for most technical reviews, a crawl of your website is your foundational building block of data to do all that great stuff that Miruna just laid out. You know, whether it’s from the number of internal links, which Rejoice just explained, how important internal linking your structure that is for to everything, from your title tags to crawl errors to whatever it might be, all the crawling tools now, and there’s a whole heap of them, and most of them are fantastic, and they give you a host of information.

In many cases, probably too much data, but there are so many reports in all of them to do all kinds of niche stuff. But even if you don’t ever go near that, they all give you this wealth of information about the health of your website. You asked me about crawl health before, but just a general technical health of your website, the easiest way to start finding problems is crawling your website and actually going, here is my site, what is it we’ve got here? What are we showing to search engines? What are we showing to bots? What is it we’re putting in front of them? And where are the improvements we can start making?

Miruna Hadu

Most crawling tools will allow you to essentially replicate the behavior of certain bots and user agents. So I’m going to speak for Sitebulb right now, but you can replicate the behavior of Googlebot, and you can tell Sitebulb to crawl your website and crawl your robots.txt file, for example, as Googlebot or any other user agent that you want to set it. And that allows you that controlled test of what different user agents and bots might be able to see on your pages and on your website, without having to do guesswork, based on the data after the fact, after Google has tried to crawl and index your pages.

Charlie Williams

I think that’s super important, because most people working on a site are not ever in a position to see that complete top down view of here’s everything that’s on your website. There’s always pages they’re not aware of. There’s always kind of extra directories they haven’t seen before. There’s always behaviors they weren’t aware of, that bots might see, that users might not, or vice versa. There’s all these kind of intrinsic aspects of websites that it’s hard just to understand when you’re just using a website, or even if you’re developing a website, it takes SEOs to come and stick their nose in and kind of go, have you thought about this? Have you thought about that? And to actually go and give this point of view, and that is the beauty of doing the crawling of your site yourself.

I couldn’t agree with Miruna more. Try to replicate what those bots are seeing and going, actually, from this rather niche, specialized perspective, but super important in terms of how many people are going to find your website. Here’s these extra things you should be considering. Just to give a quick example, Miruna mentioned faceted navigation before giving this idea, I’ve been working with a client that had uncontrolled faceted navigation. I gave up crawling the website after about 5 million URLs, and it’s nearly all Product Landing Pages (PLPs), just with this uncontrolled faceted navigation. And it’s mad, because until you explain to the client, you kind of go, well, Google at some point is just not going to crawl all these variants to find the good stuff and make a decision, at least up until AI came along, crawling was Google’s biggest cost. That computing power required to crawl the internet is Google’s biggest cost. And the biggest change we’re seeing in crawling behavior from Google and I think from Bing as well, over the past few years, is that Google used to just crawl and index everything that was technically crawlable and indexable and allow the ranking part of the algorithm to decide what to show in the search results.

Now Google makes decisions at a crawling level about what it’s going to crawl. If it starts seeing lots and lots of uncontrolled, faceted, navigational, repetitive content, it will just stop crawling after a while. And it also has done that a little bit, but it’s becoming more and more fussy, in a good way about this. And this goes to Rejoice’s point about content quality. If it starts crawling lots and lots of pages and going, this is all the same content, or this is all absolute bump, it’s not going to crawl as much because it’s a waste of its time and of its money. So again, all you know this, this different perspective you get by actually running a crawl on your website and going, what does it actually look like? Is very, very, very insightful.

David Bain

Let’s move on to talking about the impact of ranking a little bit. Rejoice. Have you actually been able to pinpoint any SEO improvements or any ranking improvements as a result of website crawling. Or is that not something that you can actively tie together?

Rejoice Ojiaku

You can tie it together. I don’t think ranking and crawlability are all the way separate. The whole purpose of crawling the website is discoverability and indexing. You can’t rank for anything if you’re not discovered or any pages are indexed. So if we’re looking at Googlebot, for example, the whole aim is to discover your content. It never crawls a page that is hidden. So you need to be able to sort of do that.

Another way it can help with ranking is because crawling allows you to identify those issues, right? So it reveals blockers that is preventing your page from being indexed. It also helps you identify those on page elements, such as missing data or duplicate data or duplicate metadata, which can hurt click through rates, in that regards, and also it provides these relevant signals that you know are all factored in into Google’s ranking algorithm.

Another reason why crawling and ranking would kind of go hand in hand is that consistent content accuracy, as we were saying, accurate metadata, accurate site maps, accurate structured URLs and clean navigation will all improve those relevant signals, and all of those can be discovered through crawling.

So we shouldn’t look at it as though crawlability and rankings are separate. Essentially crawlability will help you figure out why ranking is being impacted from a technical perspective, and that’s why we look at technical SEO and content SEO fairly differently, because the technical is the foundation, right? If your foundation is shaky, then whatever you put on top of it won’t necessarily be stable. So crawling actually helps you build that foundation in order to improve the rankings moving forward.

David Bain

Miruna, how do you actually measure the positive impact of website crawling?

Miruna Hadu

Are you referring to tool crawling, or kind of Google’s crawling?

David Bain

It’s a relatively open ended question and seeing where the guests take them. But obviously the context of what Rejoice was referring to, and probably the previous question was more about tool crawling. So in relation to tool crawling, can you actually measure the positive impact of website crawling from tools and actually acting on the results that you achieve as a result of that and measuring improved rankings?

Miruna Hadu

Yeah, definitely. The next step after you’ve done a website crawl and audit and you analyze the data, is implementing certain changes. So your workflow and how you do that is going to be different depending on your experience, what you’re focusing on, and what tool you’re using. In a tool like Sitebulb, we have things like hints, which is going to prioritize the biggest issues for you. You can then go away and create a To-Do list of things that you need to implement, either yourself, your content team or your developers, and then you can always re-crawl the site and compare right?

So within Sitebulb we have a change log for every hint and every metric. So if you’re crawling within a project you’re running, you could be running audits on a daily, weekly or monthly basis, depending on how fast things change on your website and how fast changes get implemented. And then you can literally compare side by side all of the pages that, for example, got flagged as having poor speed performance last month when you crawled, and this month after you’ve implemented your changes. And if everything is going well, you should be seeing those pages being flagged up as having issues go down.

Charlie Williams

I’m going to just jump on the back of that, if that’s okay, and say beyond that and after you do all that stuff, you can also measure the impact of if you will have good crawling health and best practices you’re putting a site by looking at the data you’re getting from Google and Bing and so on again and again.

You can integrate Search Console data into Sitebulb, Screaming Frog, SEO spider and things like that. But you can look at these reports and go, Okay, as I’m improving the crawl behavior of my site, am I seeing the most important pages are getting crawled more often? Is it that I’m actually seeing my indexing numbers go up or down? Or whatever it is we’re trying to improve, because I’m controlling the pages in the crawl. If I go into the crawl stats report in Search Console, can I see if the improvements I’m making are leading to the kind of crawling ratios for discovery and refresh crawls, or whatever it is we’re looking to improve? There’s lots of things we could tangentially go into, but am I seeing that data come in there?

And then finally, if you really want to get into the real weeds of seeing crawling data, you can use log files. Crawl Stats is great, but log files integrate into many crawling software, but what you can do is actually get actual answers of search engines are crawling this many pages, this bit here, and you’ve got a bit of a before and after possibility with the changes, improvements you’re making to your site, and going right is this moving things in the right direction?

Normally, especially if you find struggling to get certain pages indexed because they’re not getting crawled and things like that, as you improve that crawling behavior, you’ll see a relationship between that and improved traffic, because those pages that were not earning traffic before should now be and moving the right way.

And to Rejoice’s point she made earlier about content quality being a big thing. I mentioned about uncontrolled, faceted navigation, things like that. It’s poor content, it’s poor content experiences from at least the search engines point of view. So as you know it’s not always a one to one input/output relationship. But as you improve crawling issues, which are content quality issues. We’re producing extraneous content on our site that’s being crawled, and it’s not of a benefit to a user. It’s not of a benefit to a search engine to see such content.

If you tighten up what’s being crawled, you simplify your site, you make it leaner, you’re going to have more focus on the pages that matter. You should then, if you’ve had any negative signals associated with kind of poor content experience or certain sections, see an improvement there. So yes, to answer the question, I think using these tools, whether it’s Google or your own crawling, you can start seeing a relationship between improved crawl health and improved organic performance.

David Bain

There’s a couple of questions from our live stream viewers: “We have problems with the AI bots ignoring robots.txt rules and “crawling even blocked content. What should we do to stop that?” Does any one of our panelists want to jump in and have an opinion on that?

Rejoice Ojiaku

I think in terms of AI bots ignoring the robots.txt files, there’s a few things you can do. So I know you can utilize bot blocking services. So Cloudflare being an example, they do have an AI scraper toggle that gets turned on by default for new users, which can block unauthorized callers regardless of their robots.txt behavior. You can also manually block known bots. So if you know, for example, a GPT bot that is actually crawling and ignoring those robots.txt, you can actually manually embed a user agent disallow or disallow within the robots.txt files. So there are definitely a few things that you can sort of do.

Sometimes I’ve seen companies implement challenges like captures or proof of work scripts. So that, again, stops the AI bot, because they have to actually authenticate themselves. So there are a few ways you can kind of look at it, and I think you know, kind of going through the best that works for your business would be, would be a lot more helpful.

David Bain

Rejoice, do you have any thoughts on identifying the bots that are good to have in your site and to be crawling on a regular basis, and the bots that aren’t and you should be trying to block?

Rejoice Ojiaku

I think it’s hard to determine which bots are good and which bots are not. I think is down to where you want bots to kind of discover. So in this day and age, now that we’re all talking about AI search and AI searchability, people are now being more open to allowing AI bots to kind of crawl, because we’re now giving into the whole conversation around AI brand mentions and what websites or what sources gets pulled into AI when users actually type in a question. So that really does determine where you want to be found and how you want to be found, and what bots you want to actually crawl.

You know, Charlie gave an example about the Applebots that clearly wasn’t good for that website, because they didn’t want apple to crawl, because certain bots might crawl it way more frequently than others. So I think it’s looking into the different types of bots and kind of determine from a business goal, where you are, and what you want to occur and go from there. So it’s hard to say these bots are bad. These bots are good. It’s all a case by case situation.

David Bain

Okay, so it sounds more about frequency, about how many times you’re happy for the bot to come back and actually crawl your site. So how often is acceptable?

Rejoice Ojiaku

Tough question. I don’t know how often is acceptable. I think, for me, the bots that kind of follow the frequency in which Google does, again, large sites kind of have to be called a bit more frequently because of how many pages that there are, they have to go through smaller sites. So you probably wouldn’t need a bot to call your page consistently. So it’s hard to determine how many times your site needs to re-crawl, but I’ll say a good rule of thumb is any bots that kind of mimic the frequency in terms of what Google does, would be a good standard, because I know Google is very careful around wasting crawl budgets for particular sites just based on the fact that is not a lot of websites anyways have that refresh of pages consistently. But I’m hoping maybe Charlie has probably a bit idea in terms of that the frequency, because it’s very hard to say.

Charlie Williams

Absolutely. I mean I’d love to just say a number like seventy and then just leave it like that and be like, there’s your answer. I don’t have a number. It’s a great question, but I don’t have a number.

I agree with Rejoice’s points, though. So, for example, bots that are interested in the content. Let’s remove nefarious bots, or people just scraping content, if we’re talking about bots coming to our site for a good reason, someone’s going to want to visit sites based on almost that crawl demand that we talked about was part of crawl budget before. So the crawl demand for my website, which is one page and has doesn’t change, is very, very low because it’s not that many external links pointing to it, and it doesn’t change frequently. So Google knows it does not need to visit that page that often. There’s not the demand for it.

Whereas Google has probably never stopped crawling the BBC homepage, because that page is changing minute by minute every day, or other large newspapers or things like that. So the crawl budget and what’s a reasonable thing will very much vary depending, and I hate to use depending, obviously, in SEO, but it will vary depending on the circumstance, and what is an appropriate thing for your website.

That’s the honest answer, and something that’s quite hard to measure, and it will vary. And I imagine most sites don’t really keep that as a number, and they more probably should, because it would be very interesting.

David Bain

How do you know how much your server can handle in terms of the resources that website crawling is taking on it before it falls over?

Charlie Williams

You should have some kind of idea from your hosting package about your bandwidth. Some hosting packages still rely on total numbers per month and stuff like that. Ignore that most packages are just based on the size of your storage or database and on your bandwidth, you can stress test them. So for example, if you have a setup where you just have a server, you can stress test it just by crawling it really fast and seeing if it slows down, or measuring the impact. There are tools to do this.

There is also the fact that Rejoice mentioned before about Cloudflare, if you’re running a CDN, or if you’re running a kind of like protective layer, such as Cloudflare, they will also handle some of that bandwidth, because, depending on your package with them, they can stretch to cope with that demand, to stop things like DDoS attacks and stuff like that.

I’m not going to be able to give a specific answer to that, but generally, when I’ve helped clients with it, we’ve set up kind of like bots from other tools, or sort of manually set up bots and other servers to basically hammer a server and make lots and lots of requests very quickly. Now, normally, most server setups that are a sophisticated setup at all will have certain things in place. They’ll have rate limiting. So if they see a bot that they can tell from an IP address, they don’t recognize it in established bots, such as Googlebot or something like that, if they see an IP address like that, they don’t recognize this and the person is clearly a bot, they’re just requesting multiple pages every second. It’s not human behavior. They will rate limit such visits.

So most sites will have some protective things in place, and that’s generally a good idea. It does mean, though, when you’re us and you’re crawling that website, such as with Screaming Frog or Sitebulb, you need to either have your IP address listed, or have a crawler bot set up with your own bot name, it can be called anything, but set up your own bot that can be whitelisted so you’re allowed to crawl at a full rate without having to slow down to respect the site or be blocked in some way.

I’d approach that question more from the other side, about what things have you got in place to prevent over crawling from bots you do not recognize? That’s kind of more the standard way of approaching it.

David Bain

In terms of bots you do recognize, in terms of setting up your own website crawler for better SEO audits, Miruna, is there any particular frequency that you recommend for that?

Miruna Hadu

It really all comes down to what your server can handle and what kind of restrictions you have on it, whether or not you’re whitelisted, which you should be if you’re working on the website. But then there’s also some other things that you can do when you’re setting up your crawl to kind of think about both the resources that you’re utilizing when it comes to your machine, if you’re crawling or desktop. Obviously that’s not a concern if you’re crawling on Cloud, on like a cloud crawler, but also the resources such as the demand you’re putting on your server.

So what we always recommend is to get started before you even set up your crawl, is to have clear intentions in mind in terms what is it that you’re trying to achieve and trying to figure out with your crawl, and in many instances, after you’ve gone through that initial process of discovery, where perhaps you did want to crawl everything on your site, you might not need to crawl everything, and you might not need every single piece of data that the crawler crawling tool gives to you, which because, like Charlie said, it can be overwhelming. There’s going to be, like, a whole load of data. So you can think about it as setting up your crawl by only selecting the kind of reports or data that you need, and then also limiting what the crawler is going to do. You can limit your crawl to certain directories of your website, you can exclude pagination and that kind of faceted navigation that we were talking about earlier. You can even run like what we call sample crawls, which is when you only crawl a certain amount of pages at every depth of your website.

There are different things that you can do to kind of manage how much you know you’re going to request from your website server. And then there’s also thinking about, when are you going to crawl right? If you’re working on a website that has high demand in terms of the users that visit it during the week or during certain hours, then perhaps schedule your crawl to happen overnight or happen over the weekend, kind of taking into account all these different elements that influence how much you’re going to request from the website server.

David Bain

Let’s finish off by asking our panel, what’s the number one thing that you tend to gravitate towards when looking in a website crawling report from an SEO perspective? So if you’ve set off the crawler yourself, and you’ve got those findings, what’s that number one thing that you look at and then just remind the listener, the viewer, where people can find you. So Rejoice. Shall we start off with you on that one what’s the number one thing that you’d look for?

Rejoice Ojiaku

I’m predominantly content based, so a lot of the quality issues I mostly focus on is around that on-page element. So a lot around the metadata. But when working with certain freelancers, especially around e-commerce and websites that have product skus and all of these things, crawling becomes a bit harder.

For me, what I look for is how I can kind of break it down. So I don’t necessarily look to crawl the entire site. I mostly look to crawl those priority pages. And one thing I love about certain crawling tools is the ability for me to specify what URL path I need to be crawled. So if I want to focus on the blog section or focus on a particular category, I will look for that. And I think for me with the issues I’m trying to spot is one, whether that page is indexed, is it canonicalized? Is it self referencing? Does it have the right heading, and most importantly, if it’s not pulling in that metadata, figuring out whether it’s hard coded in, and that’s based on certain CMS that can actually stop that from sort of appearing. So that’s what I tend to kind of look for. I tend to look for the issues stopping us discovering certain on page issues or on page element.

So once we discover those issues, then I wouldn’t understand how we can actually now fix it, because that stops me actually doing my work in terms of optimizing. So I do have to sort of work with the technical team and figure that out.

You can find me on LinkedIn, Rejoice Ojiaku. I’m always on LinkedIn, either commenting, posting or doing whatever. I’m more than happy to connect with everyone, actually.

David Bain

Thank you. Rejoice, Charlie. What’s the number one thing that you will look for in our website crawling report?

Charlie Williams

I’m gonna Rejoice on some of that I think. So it’s twofold. Obviously, you do that kind of big audit to begin with, and there’s two major things I’m looking for. Normally when I do a crawl like this, I always try to integrate with Search Console indexing data as well, so I can see which of these pages I’m finding and which of them Google is choosing to index, which seems choosing not to index.

So what I’m looking for, essentially, is content quality issues. I’m looking for this idea of going these are pages, there’s a section here where Google’s only indexing half the pages, or it’s not even bothering to find some of them. It says a bunch of the pages I’m finding in your website. It’s never heard of. It’s marked as URL as unknown to Google. I’m looking to this, like this kind of impact of I’m looking at website, how much of your website that I can crawl right now am I seeing Google’s actually interested in because in the day, you know, that’s what we’re trying to do, is understand which pages can actually period search results and start driving money for us and so on.

And of course, you can cross reference with analytics to go, well, these are not being indexed by Google, but actually users are navigating there and they’re buying from it. So it must be useful. How can we bring these two things together, and then the other side is the other side of the same coin? That is technical waste. I’m looking at a website going which extraneous pages are the CMS creating? Where’s extraneous pages in the website that we can find because it’s old stuff that people have forgotten about. I’m looking to basically come in wielding as big a machete as I can and chop off as much of the website that’s useless as I can to make it leaner, meaner, faster, more efficient, and that higher content quality overall, bringing that kind of metric up by average, if nothing else.

So yeah, for me it’s, it’s about finding waste, whether it’s content waste, low quality stuff that Google’s not interested in, or stuff that’s just technical duplication or emptiness that we can remove from the site.

David Bain

Where can people find you, Charlie?

Charlie Williams

I’m Charlie Williams, and you’ll you can find me on LinkedIn. You can also check my site is chopped.io, that’s my consultancy. I also @pagesauce on Twitter or x, but I don’t think I’ve posted on that in 18 months. But if you hit me up there and send me a message or I will still get a ping.

David Bain

Thank you Charlie. Miruna, what’s your number one thing that you look for in a website crawling report?

Miruna Hadu

Well, without saying it depends, like I said before that, I do think of crawling as in, what is it that you’re trying to achieve? One thing that I love recommending is that, especially if you’re crawling a large website, and kind of looking at where to get started is running a sample crawl, something that’s going to give you a snapshot of what is it that you’re going to encounter once you crawl the whole site.

So like Charlie said before, you don’t want to start a crawl and suddenly 5 million URLs in figure out that you’re mostly crawling filter pages. So running a sample crawl, like I said, and then I would look at what is it that has been crawled. Are the majority of pages filter pages, or have my kind of key pages from the homepage, those key links, have they been crawled and found?

I would also integrate, you know, like Charlie said, things like Google Search Console, Google Analytics, adding your XML site maps in and trying to figure out if everything you want to be found is there and what the structure of that website looks like, to allow for crawlers like Google to find every single page on your website and index it.

You can find me on LinkedIn, but similarly, I’m a lurker. I don’t post much. However, you can also find me in the Sitebulb support inbox. So if you contact support on Sitebulb.com or if you ask any questions over the chat, you’ll probably reach me.

David Bain

Thank you everyone for watching. Thank you so much to our panel as well. I’ve been your host, David Bain, and you’ve been listening to the Majestic SEO panel. If you want to be join us live next time, sign up at majestic.com/webinars and, of course, check out SEOin2025.com.

Previous Episodes

Follow our Twitter/X (@Majestic) and Bluesky (@majestic.com) accounts to hear about more upcoming episodes!

Or if you want to catch up with all of our previous episodes, check out the full list of our Majestic SEO Podcast episodes.

Author
Recent Posts

Majestic

How to optimise your website for crawlers (Majestic SEO Podcast)

Transcript

Previous Episodes

Leave a Comment Cancel reply