Hard to explain question. Downconvert/limit string to a certain charset without stripping
I've encountered this problem a few times, and now I finally decided to ask, hoping someone knows w开发者_JS百科hat I'm talking about.
What I wish to do is this form of char convertion:
ÆØÅ => AOA
ÉÈÊ => EEE
üÿï => uyi
So far the closest I've come to a search criteria I can type into google as this:
- Something similar to base64/URLEncode
- A sound algorithm such as Metaphone or Soundex
This did not work as expected. There seemed to be no correlation between ÉÈÊ
and EEE
any different from that and ÆØÅ
. So, held up against E, all six chars would've been converted to E, which wasn't the accuracy I was looking for.
- Convertion from the origin encoding (e.g. ASCII) to a charset/encoding consiting of only alphanumerics
I'm not very confident about this approach as the encoding would have to be able to recognize, say E
, as an ancestor/nearest (alphanumeric) neighbour of È
.
I feel like I'm saying a lot of words which are around the ballpark.
Does anyone understand what I'm trying to achieve, or know what this "method" I'm looking for is called?
Any ideas/thoughts are very much appreciates (and I do mean any),
- Mik
I suspect you'd have to consider a database of Unicode codepoints, mapping them to their nearest US-ASCII equivalent (where possible). I imagine it would be a relatively sparse map, since most Unicode codepoints don't have a US-ASCII equivalent.
Hopefully this answer has some key words in that help you look for what you want.
精彩评论