Detecting URL rewrites (SEO urls)
How could a client detect if a server is using Search Engine Optimizing t开发者_运维知识库echniques such as using mod_rewrite to implement "seo friendly urls."
For example:
Normal url:
http://somedomain.com/index.php?type=pic&id=1
SEO friendly URL:
http://somedomain.com/pic/1
Since mod_rewrite runs server side, there is no way a client can detect it for sure.
The only thing you can do client side is to look for some clues:
- Is the HTML generated dynamic and that changes between calls? Then /pic/1 would need to be handled by some script and is most likely not the real URL.
- Like said before: are there
<link rel="canonical">
tags? Then the website likes to tell the search engine, which URL of multiple with the same content it should use from. - Modify parts of the URL and see, if you get an 404. In
/pic/1
I would modify "1".
If there is nomod_rewrite
it will return 404. If it is, the error is handled by the server side scripting language and can return a 404, but in most cases would return a 200 page printing an error.
You can use a <link rel="canonical" href="..." />
tag.
The SEO aspect is usually on words in the URL, so you can probably ignore any parts that are numeric. Usually SEO is applied over a group of like content, such that is has a common base URL, for example:
Base www.domain.ext/article
, with fully URL examples being:
- www.domain.ext/article/2011/06/15/man-bites-dog
- www.domain.ext/article/2010/12/01/beauty-not-just-skin-deep
Such that the SEO aspect of the URL is the suffix. Algorithm to apply is typify each "folder" after the common base assigning it a "datatype" - numeric, text, alphanumeric and then score as follows:
- HTTP Response Code is 200: should be obvious, but you can get a 404
www.domain.ext/errors/file-not-found
that would pass the other checks listed. - Non Numeric, with Separators, Spell Checked: separators are usually dashes, underscores or spaces. Take each word and perform a spell check. If the words are valid - including proper names.
- Spell Checked URL Text on Page if the text passes a spell check, analyze the page content to see if it appears there.
- Spell Checked URL Text on Page Inside a Tag: if prior is true, mark again if text in its entirety is inside an HTML tag.
- Tag is Important: if prior is true and tag is
<title>
or<h#>
tag.
Usually with this approach you'll have a max of 5 points, unless multiple folders in the URL meet the criteria, with higher values being better. Now you can probably improve this by using a Bayesian probability approach that uses the above to featurize (i.e. detects the occurrence of some phenomenon) URLs, plus come up with some other clever featurizations. But, then you've got to train the algorithm, which may not be worth it.
Now based on your example, you also want to capture situations where the URL has been designed such that a crawler will index because query parameters are now part of the URL instead. In that case you can still typify suffixes' folders to arrive at patterns of data types - in your example's case that a common prefix is always trailed by an integer - and score those URLs as being SEO friendly as well.
I presume you would be using of the curl variants.
You could try sending the same request but with different "user agent" values.
i.e. send the request one using user agent "Mozzilla/5.0" and a second time using User Agent "Googlebot" if the server is doing something special for web crawlers then there should be a different response
With the frameworks today and url routing they provide I don't even need to use mod_rewrite to create friendly urls such http://somedomain.com/pic/1 so I doubt you can detect anything. I would create such urls for all visitors, crawlers or not. Maybe you can spoof some bot headers to pretend you're a known crawler and see if there's any change. Dunno how legal that is tbh.
For the dynamic url's pattern, its better to use <link rel="canonical" href="..." />
tag for other duplicate
精彩评论