Slugify and Character Transliteration in C#
I'm trying to translate the following slugify method from PHP to C#: http://snipplr.com/view/22741/slugify-a-string-in-php/
Edit: For the sake of convenience, here the code from above:
/**
* Modifies a string to remove al non ASCII characters and spaces.
*/
static public function slu开发者_运维百科gify($text)
{
// replace non letter or digits by -
$text = preg_replace('~[^\\pL\d]+~u', '-', $text);
// trim
$text = trim($text, '-');
// transliterate
if (function_exists('iconv'))
{
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
}
// lowercase
$text = strtolower($text);
// remove unwanted characters
$text = preg_replace('~[^-\w]+~', '', $text);
if (empty($text))
{
return 'n-a';
}
return $text;
}
I got no probleming coding the rest except I can not find the C# equivalent of the following line of PHP code:
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
Edit:
Purpose of this is to translate non-ASCII characters such as Reformáció Genfi Emlékműve Előtt
into reformacio-genfi-emlekmuve-elott
I would also like to add that the //TRANSLIT
removes the apostrophes and that @jxac solution doesn't address that. I'm not sure why but by first encoding it to Cyrillic and then to ASCII you get a similar behavior as //TRANSLIT
.
var str = "éåäöíØ";
var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str));
=> "eaaoiO"
There is a .NET library for transliteration on codeplex - unidecode. It generally does the trick using Unidecode tables ported from python.
conversion to string:
byte[] unicodeBytes = Encoding.Unicode.GetBytes(str);
byte[] asciiBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, unicodeBytes);
string asciiString = Encoding.ASCII.GetString(asciiBytes);
conversion to bytes:
byte[] ascii = Encoding.ASCII.GetBytes(str);
@Thomas Levesque is right, will get encoded by the output stream...
to remove the diacritics (accent marks), you can use the String.Normalize function, as detailed here:
http://www.siao2.com/2007/05/14/2629747.aspx
that should take care of most of the cases (where the glyph is really a character plus an accent mark). for an even more aggressive char matching (to take care of cases like the Scandinavian slashed o [Ø], digraphs, and other exotic glyphs), there's the table approach:
http://www.codeproject.com/KB/cs/UnicodeNormalization.aspx
this includes around 1,000 symbol mappings in addition to the normalization.
(note, all punctuation is removed by the regex replace in your example)
精彩评论