12/05/16 // Written by Emma Phillips

John Mueller Google Hangouts Session: The Importance Of ASCII Characters In URLs

Google frequently arranges live Google Hangout sessions online, hosted by its Webmaster Trends Analyst, John Mueller, who runs one-on-one sessions with a select group of search specialists.

Dan Picken, of Ingenuity Digital’s sister company WMG, attends these sessions to ensure he’s at the leading edge of search science, asking the questions that will enable us to keep our clients’ businesses fully optimised online.

The importance of using ASCII characters in URLs

In this example we have encountered a website which uses non-ASCII characters in its URLs, but with a rule that converts the non-ASCII text to search engine friendly URLs, complete with ASCII characters. These are faceted navigation URLs, where the user applies filters as they browse through products, so there was a requirement to block Google crawling these, for two reasons:

  1. Google doesn’t waste the crawl budget in crawling these unnecessary faceted navigation URLs.

and

  1. Google prevents these URLs from being visible in general search queries to users.

The robots.txt directives were naturally driven by the original non-ASCII version of the URLs, but when tested with the robots.txt testing tool in Google Console it showed that the URLs were allowed to be crawled, indicating that the robots.txt was simply not working for Google in its current form.

On 8 April 2016, Dan put this to Google to get some guidance. This is what John Mueller had to say:

It’s tricky if you use non-standard delimiters between parameters, because then we can’t figure out how these parameters actually work, so with normal parameters we understand that order is irrelevant and we can learn with the parameters setting in search console how to swap them out, keep them or replace them with a default value. However, if you use non-standard delimiters then it’s really tricky for us to figure out.

John Mueller then goes on to say that he assumes we’re being over-specific and probably what we need to do in the robots.txt file is use the non-escape characters instead of the escaped version.

He continues by saying this is “an edge case where it could go either this way or that way, but because of all the wildcards in there I imagine it’s really hard to make this deterministic in the sense that it always does the same thing”.

John Mueller generally recommends to simplify the website.

The conclusion here is if you are going to set up a new site, set it up with common ASCII characters so there are no issues caused for any of the search engines further down the line.

Watch the Google Hangout session on YouTube https://youtu.be/nIDZmac_rMI?t=2019