Algorithm to convert unicode to gsm characters
I need an algorithm (preferably in Python) to convert an arb开发者_StackOverflow中文版itrary string to a string containing only characters from the GSM alphabet. I need this filter to send the string as text in SMS:es. If possible, the algorithm should also replace characters with their closest encodable equivalent. Examples:
>>> gsm_convert('© all rights reserved')
[copyright sign] all rights reserved
# or
C all rights reserved
>>> gsm_convert('––– long dashes –––')
--- long dashes ---
Python has some builtin algorithms for doing this, but those functions also convert the input string to ascii which is not correct. GSM handles several characters not found in ascii.
From doing this in Perl and PHP I'd do it in two steps using regular expressions.
Start by including regular expression support
import re
Replace any characters you can with their closest match.
I'd suggest using a set of regular expressions for example replace "á" with "a" using the following
message = ur'abc\u00e9\u00e1' message = re.sub(ur'\u00e1','a',message)
Remove any remaining characters that aren't in the GSM character set.
message = ur'abc\u00e9\u00e1' message = re.sub(ur'[^\u0040\u00A3\u0024\u00A5\u00E8\u00E9\u00F9\u00EC\u00F2\u00C7\u000A\u00D8\u00F8\u000D\u00C5\u00E5\u0394\u005F\u03A6\u0393\u039B\u03A9\u03A0\u03A8\u03A3\u0398\u039E\u00C6\u00E6\u00DF\u00C9\u0020\u0021\u0022\u0023\u00A4\u0025\u0026\u0027\u0028\u0029\u002A\u002B\u002C\u002D\u002E\u002F\u0030\u0031\u0032\u0033\u0034\u0035\u0036\u0037\u0038\u0039\u003A\u003B\u003C\u003D\u003E\u003F\u00A1\u0041\u0042\u0043\u0044\u0045\u0046\u0047\u0048\u0049\u004A\u004B\u004C\u004D\u004E\u004F\u0050\u0051\u0052\u0053\u0054\u0055\u0056\u0057\u0058\u0059\u005A\u00C4\u00D6\u00D1\u00DC\u00A7\u00BF\u0061\u0062\u0063\u0064\u0065\u0066\u0067\u0068\u0069\u006A\u006B\u006C\u006D\u006E\u006F\u0070\u0071\u0072\u0073\u0074\u0075\u0076\u0077\u0078\u0079\u007A\u00E4\u00F6\u00F1\u00FC\u00E0\u20AC\u005B\u005C\u005D\u005E\u007B\u007C\u007D\u007E]','',message) print message
In this example it will print abcé
, removing the á
(\u00e1
) which isn't part of the GSM character set.
It sounds like you need a codec. Googling turned up this: http://demo.sahanafoundation.org/gsoc2010/amishra/gsoc/modules/pygsm/gsmcodecs/ I have no idea whether it works, you'll have to find out for yourself.
The license for that code is at http://demo.sahanafoundation.org/gsoc2010/amishra/gsoc/modules/pygsm/LICENSE
EDIT: Hi, a contributor to pygsm here (if theres any doubt, call the number in the docstring test).
FYI- The Sahana code linked above seems to have moved to: http://eden.sahanafoundation.org/browser#modules/pygsm/
Also, this Sahana code was derived from https://github.com/developmentseed/slingshotSMS, which was derived from the original, standalone library https://github.com/adammck/pygsm/ ... whose license is located at https://raw.github.com/adammck/pygsm/master/LICENSE
The link in the first response looks like it might do the trick; FWIW, I have used the library linked from this post as a basis for doing something similar.
As you'll see, the author has created a codec suitable for encoding Greek, so this will just be a starting point.
You say you want to convert an "arbitrary" string to its "closest equivalent"; making it completely arbitrary may be difficult as "closest" may have different meanings in different domains (what do you do with a Unicode snowman, for example)?
If you're just trying to deal with Latin or Latin-derived alphabets then "arbitrary" should be doable.
Here is my C# code (For french text)
public static bool IsGsmString(string message)
{
// https://messente.com/documentation/tools/sms-length-calculator
// https://stackoverflow.com/questions/29541753/regex-only-checks-first-character-in-string-c-sharp/29541980#29541977
//var strMap = new Regex(@"^[@£$¥èéùìòÇØøÅå_ÆæßÉ!""#%&'()*+,./\w:;<=>? ¡ÄÖÑܧ¿äöñüà^{}\[~\]|€-]*$");
//return !strMap.IsMatch(message.Replace(Environment.NewLine, "")); // Enlever les saut de ligne car non inclus dans le Map
foreach (char c in message.ToCharArray())
if (!IsGsmChar(c))
return false;
return true;
}
public static bool IsGsmChar(char c)
{
string strGSMTable = "@£$¥èéùìòÇ`Øø`ÅåΔ_ΦΓΛΩΠΨΣΘΞ`ÆæßÉ !\"#¤%&'()*=,-./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑÜ`¿abcdefghijklmnopqrstuvwxyzäöñüà";
strGSMTable += "^{}\\[~]|€" + Environment.NewLine; // Adding extended char and CRLF
return strGSMTable.IndexOf(c) >= 0;
}
public static string ReplaceNoneGsmChar(string message)
{
var converted = "";
foreach (char c in message.ToCharArray())
{
if (IsGsmChar(c))
converted += c;
else
converted += GsmReplacement(c);
}
return converted;
}
private static string GsmReplacement(char c)
{
switch (c)
{
case 'â':
return "a";
case 'ê':
case 'ë':
return "e";
case 'î':
case 'ï':
return "i";
case 'ô':
return "o";
case 'û':
return "u";
case 'ÿ':
return "y";
case 'Â':
case 'À':
return "A";
case 'È':
case 'Ê':
case 'Ë':
return "E";
case 'Î':
case 'Ï':
case 'Ì':
return "I";
case 'Ô':
return "I";
case 'Ù':
case 'Û':
return "U";
case '’':
case '`':
return "'";
case '«':
case '»':
return @"""";
case 'µ':
return "u";
case '©':
return "C";
case 'œ':
return "oe";
default:
return "_"; // non remplacable
}
}
精彩评论