Is there a more elegant way to change Unicode to Ascii?

2023-02-05 19:48 问答作者：

I seen the problem a lot where you have some obscure unicode character which is somewhat like a certain ascii character and needs to be converted at run time for whatever reason.

In this case I am trying to export to csv. Having already used a nasty开发者_开发百科 fix for dash, emdash, endash and hbar I have just recieved a new request for ' ` '. Aside from another nasty fix is there another better way to do this?

Heres what I have at the moment...

        formattedString = formattedString.Replace(char.ConvertFromUtf32(8211), "-");
        formattedString = formattedString.Replace(char.ConvertFromUtf32(8212), "-");
        formattedString = formattedString.Replace(char.ConvertFromUtf32(8213), "-");

Any Ideas?

It's a rather inelegant problem, so no method will really be deeply elegant.

Still, we can certainly improve things. Just which approach will work best will depend on the number of changes that need to be made (and the size of the string to change, though it's often best to assume this either is or could be quite large).

At one replacement character, the approach you use so far - using .Replace is superior, though I would replace char.ConvertFromUtf32(8211) with "\u2013". The effect on performance is negligible but it's more readable, since it's more usual to refer to that character in hexadecimal as in U+2013 than in decimal notation (of course char.ConvertFromUtf32(0x2013) would have the same advantage there, but no advantage on just using the char notation). (One could also just put '–' straight into the code - more readable in some cases, but less so in this where it looks much the same as ‒, — or - to the reader).

I'd also replace the string replace with the marginally faster character replace (in this case at least, where you are replacing a single char with a single char).

Taking this approach to your code it becomes:

formattedString = formattedString.Replace('\u2013', '-');
formattedString = formattedString.Replace('\u2014', '-');
formattedString = formattedString.Replace('\u2015', '-');

Even with as few replacements as 3, this is likely to be less efficient than doing all such replacements in one pass (I'm not going to do a test to find how long formattedString would need to be for this, above a certain number it becomes more efficient to use a single pass even for strings of only a few characters). One approach is:

StringBuilder sb = new StringBuilder(formattedString.length);//we know this is the capacity so we initialise with it:
foreach(char c in formattedString)
  switch(c)
  {
    case '\u2013': case '\u2014': case '\u2015':
      sb.Append('-');
    default:
      sb.Append(c)
  }
formattedString = sb.ToString();

(Another possibility is to check if (int)c >= 0x2013 && (int)c <= 0x2015 but the reduction in number of branches is small, and irrelevant if most of the characters you look for aren't numerically close to each other).

With various variants (e.g. if formattedString is going to be output to a stream at some point, it may be best to do so as each final character is obtained, rather than buffering again).

Note that this approach doesn't deal with multi-char strings in your search, but can with strings in your output, e.g. we could include:

case 'ß':
  sb.Append("ss");

Now, this is more efficient than the previous, but still becomes unwieldy after a certain number of replacement cases. It also involves many branches, which have their own performance issues.

Let's consider for a moment the opposite problem. Say you wanted to convert characters from a source that was only in the US-ASCII range. You would have only 128 possible characters so your approach could be:

char[] replacements = {/*list of replacement characters*/}
StringBuilder sb = new StringBuilder(formattedString.length);
foreach(char c in formattedString)
  sb.Append(replacements[(int)c]);
formattedString = sb.ToString();

Now, this isn't practical with Unicode, which has over assigned 109,000 characters in a range going from 0 to 1114111. However, chances are the characters you care about are not only much smaller than that (and if you really did care about that many cases, you'd want the approach given just above) but also in a relatively restricted block.

Consider also if you don't especially care about any surrogates (we'll come to those later). Well, most characters you just don't care about, so, let's consider this:

char[] unchanged = new char[128];
for(int i = 0; i != 128; ++i)
  unchanged[i] = (char)i;
char[] error = new string('\uFFFD', 128).ToCharArray();
char[] block0 = (new string('\uFFFD', 13) + "---" + new string('\uFFFD', 112)).ToCharArray();

char[][] blocks = new char[8704][];
for(int i = 1; i != 8704; ++i)
  blocks[i] = error;
blocks[0] = unchanged;
blocks[64] = block0;

/* the above need only happen once, so it could be done with static members of a helper class that are initialised in a static constructor*/

StringBuilder sb = new StringBuilder(formattedString.Length);
foreach(char c in formattedString)
{
  int cAsI = (int)c;
  sb.Append(blocks[i / 128][i % 128]);
}
string ret = sb.ToString();
if(ret.IndexOf('\uFFFD') != -1)
    throw new ArgumentException("Unconvertable character");
formattedString = ret;

The balance between whether it's better to test for an uncovertable character in one go at the end (as above) or on each conversion varies according to how likely this is to happen. It's obviously even better if you can be sure (due to knowledge of your data) that it won't, and can remove that check - but you have to be really sure.

The advantage here is that while we are using a look-up method, we are only taking up 384 characters' worth of memory to hold the look-up (and some more for the array overhead) rather than 109,000 characters' worth. The best size for the blocks within this varies according to your data, (that is, what replacements you want to make), but the assumption that there will be blocks that are identical to each other tends to hold.

Now, finally, what if you care about a character in the "astral planes" which are represented as surrogate pairs in the UTF-16 used internally in .NET, or if you care about replacing some multi-char strings in a particular way?

In this case, you are probably going to have to at the very least read a character or more ahead in your switch (if using the block-method for most cases, you can use an unconvertable case to signal such work is required). In such a case, it might well be worth converting to and then back from US-ASCII with System.Text.Encoding and a custom implementation of EncoderFallback and EncoderFallbackBuffer and handle it there. This means that most of the conversion (the obvious cases) will be done for you, while your implementation can deal only with the special cases.

You could maintain a lookup table that maps your problem characters to replacement characters. For efficiency you can work on character array to prevent lots of intermediary string churn which would be a result of using string.Replace.

For example:

var lookup = new Dictionary<char, char>
{
    { '`',  '-' },
    { 'இ', '-' },
    //next pair, etc, etc
};

var input = "blah இ blah ` blah";

var r;

var result = input.Select(c => lookup.TryGetValue(c, out r) ? r : c);

string output = new string(result.ToArray());

Or if you want blanket treatment of non ASCII range characters:

string output = new string(input.Select(c => c <= 127 ? c : '-').ToArray());

Unfortunately, given that you're doing a bunch of specific transforms within your data, you will likely need to do these via replacements.

That being said, you could make a few improvements.

If this is common, and the strings are long, storing these in a StringBuilder instead of a string would allow in-place replacements of the values, which could potentially improve things.
You could store the conversion characters, both from and to, in a Dictionary or other structure, and perform these operations in a simple loop.
You could load both the "from" and "to" character at runtime from a configuration file, instead of having to hard-code every transformation operation. Later, when more of these were requested, you wouldn't need to alter your code - it could be done via configuration.

If they are all replaced with the same string:

formattedString = string.Join("-", formattedString.Split('\u2013', '\u2014', '\u2015'));

foreach (char c in "\u2013\u2014\u2015") 
    formattedString = formattedString.Replace(c, '-');

继续阅读：.net ascii unicode

Is there a more elegant way to change Unicode to Ascii?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？