Transform title into dashed URL-friendly string [closed]
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this questionI would like to write a C# method that would transform any title into a URL friendly string, similar to what Stack Overflow does:
- replace spaces with dashes
- remove parenthesis
- etc.
I'm thinking of removing Reserved characters as per RFC 3986 standard (from Wikipedia) but I don't know if that would be enough? It would make links workable, but does anyone know what other characters are being replaced here at stackoverflow? I don't want to end up with %-s in my URLs...
Current implementation
string result = Regex.Replace(value.Trim(), @"[!*'""`();:@&+=$,/\\?%#\[\]<>«»{}_]");
return Regex.Replace(result.Trim(), @"[\s*[\-–—\s]\s*]", "-");
My questions
- Which characters should I remove?
- Should I limit the maximum length of resulting string?
- Anyone know which rules are applied on titles here on SO?
Rather than looking for things to replace, the list of unreserved chars is so short, it'll make for a nice clear regex.
return Regex.Replace(value, @"[^A-Za-z0-9_\.~]+", "-");
(Note that I didn't include the dash in the list of allowed chars; that's so it gets gobbled up by the "1 or more" operator [+
] so that multiple dashes (in the original or generated or a combination) are collapsed, as per Dominic Rodger's excellent point.)
You may also want to remove common words ("the", "an", "a", etc.), although doing so can slightly change the meaning of a sentence. Probably want to remove any trailing dashes and periods as well.
Also strongly recommend you do what SO and others do, and include a unique identifier other than the title, and then only use that unique ID when processing the URL. So http://example.com/articles/1234567/is-the-pop-catholic
(note the missing 'e') and http://example.com/articles/1234567/is-the-pope-catholic
resolve to the same resource.
I would be doing:
string url = title;
url = Regex.Replace(url, @"^\W+|\W+$", "");
url = Regex.Replace(url, @"'\"", "");
url = Regex.Replace(url, @"_", "-");
url = Regex.Replace(url, @"\W+", "-");
Basically what this is doing is it:
- strips non-word characters from the beginning and end of the title;
- removes single and double quotes (mainly to get rid of apostrophes in the middle of words);
- replaces underscores with hyphens (underscores are technically a word character along with digits and letters); and
- replaces all groups of non-word characters with a single hyphen.
Most "sluggifiers" (methods for converting to friendly-url type names) tend to do the following:
- Strip everything except whitespace, dashes, underscores, and alphanumerics.
- (Optional) Remove "common words" (the, a, an, of, et cetera).
- Replace spaces and underscores with dashes.
- (Optional) Convert to lowercase.
As far as I know, StackOverflow's sluggifier does #1, #3, and #4, but not #2.
How about this:
string FriendlyURLTitle(string pTitle)
{
pTitle = pTitle.Replace(" ", "-");
pTitle = HttpUtility.UrlEncode(pTitle);
return Regex.Replace(pTitle, "\%[0-9A-Fa-f]{2}", "");
}
this is how I currently slug words.
public static string Slug(this string value)
{
if (value.HasValue())
{
var builder = new StringBuilder();
var slug = value.Trim().ToLowerInvariant();
foreach (var c in slug)
{
switch (c)
{
case ' ':
builder.Append("-");
break;
case '&':
builder.Append("and");
break;
default:
if ((c >= '0' && c <= '9') || (c >= 'a' && c <= 'z') && c != '-')
{
builder.Append(c);
}
break;
}
}
return builder.ToString();
}
return string.Empty;
}
I use this one...
public static string ToUrlFriendlyString(this string value)
{
value = (value ?? "").Trim().ToLower();
var url = new StringBuilder();
foreach (char ch in value)
{
switch (ch)
{
case ' ':
url.Append('-');
break;
default:
url.Append(Regex.Replace(ch.ToString(), @"[^A-Za-z0-9'()\*\\+_~\:\/\?\-\.,;=#\[\]@!$&]", ""));
break;
}
}
return url.ToString();
}
This works for me
string output = Uri.UnescapeDataString(input);
精彩评论