开发者

Finding out Unicode character name in .Net

Is there a way in .Net to find out, what Unicode name certain character has?

开发者_C百科

If not, is there a library that can do this?


It's easier than ever now, as there's a package in nuget named Unicode Information

With this, you can just call:

UnicodeInfo.GetName(character)


Here's a solution you can implement immediately, like copy/paste/compile.

First, download the Unicode database (UCD) here: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

Next, add this code to your project to read the UCD and create a Dictionary for looking up the name of a .NET char value:

string[] unicodedata = File.ReadAllLines( "UnicodeData.txt", Encoding.UTF8 );
Dictionary<char,string> charname_map = new Dictionary<char,string>( 65536 );
for (int i = 0; i < unicodedata.Length; i++)
{
    string[] fields = unicodedata[i].Split( ';' );
    int char_code = int.Parse( fields[0], NumberStyles.HexNumber );
    string char_name = fields[1];
    if (char_code >= 0 && char_code <= 0xFFFF) //UTF-16 BMP code points only
    {
        bool is_range = char_name.EndsWith( ", First>" );
        if (is_range) //add all characters within a specified range
        {
            char_name = char_name.Replace( ", First", String.Empty ); //remove range indicator from name
            fields = unicodedata[++i].Split( ';' );
            int end_char_code = int.Parse( fields[0], NumberStyles.HexNumber );
            if (!fields[1].EndsWith( ", Last>" ))
                throw new Exception( "Expected end-of-range indicator." );
            for (int code_in_range = char_code; code_in_range <= end_char_code; code_in_range++)
                charname_map.Add( (char)code_in_range, char_name );
        }
        else
            charname_map.Add( (char)char_code, char_name );
    }
}

The UnicodeData.txt file is UTF-8 encoded, and consists of one line of information for each Unicode code point. Each line contains a semi-colon-separated list of fields, where the first field is the Unicode code point in hexadecimal (with no prefixes) and the second field is the character name. Information about the file and the other fields each line contains can be found here: Infomation on the format of the UCD can be found here: http://www.unicode.org/reports/tr44/#Format_Conventions

Once you use the above code to build a mapping of characters to character names, you just retrieve them from the map with something like this:

char c = 'Â';
string character_name;
if (!charname_map.TryGetValue( c, out character_name ))
    character_name = "<Character Name Missing>"; //character not found in map
//character_name should now contain "LATIN CAPITAL LETTER A WITH CIRCUMFLEX";

I suggest embedding the UnicodeData.txt file in your application resources, and wrapping this code into a class, which loads and parses the file once in a static initializer. To make code more readable, you could implement an extension method in that class 'char' class like 'GetUnicodeName'. I've purposely restricted the values to the range 0 through 0xFFFF, because that's all a .NET UTF-16 char can hold. .NET char doesn't actually represent a true "character" (also called code point), but rather a Unicode UTF-16 code unit, since some "characters" actually require two code units. Such a pair of code units are called a high and low surrogate. Values above 0xFFFF (the largest value a 16-bit char can store) are outside the Basic Multilingual Plane (BMP), and according to UTF-16 encoding require two chars to encode. Individual codes that are part of a surrogate pair will end up with names like "Non Private Use High Surrogate", "Private Use High Surrogate", and "Low Surrogate" with this implementation.


If you use Process Monitor to look at the files accessed by charmap.exe, you'll see that it opens a file named C:\Windows\system32\getuname.dll. This file contains the character names in its resources (actually the resources themselves are in a .mui file in a culture-specific subdirectory).

So all you have to do is get the names from this file, using the LoadString API. I wrote a helper class to do it:

public class Win32ResourceReader : IDisposable
{
    private IntPtr _hModule;

    public Win32ResourceReader(string filename)
    {
        _hModule = LoadLibraryEx(filename, IntPtr.Zero, LoadLibraryFlags.AsDataFile | LoadLibraryFlags.AsImageResource);
        if (_hModule == IntPtr.Zero)
            throw Marshal.GetExceptionForHR(Marshal.GetHRForLastWin32Error());
    }

    public string GetString(uint id)
    {
        var buffer = new StringBuilder(1024);
        LoadString(_hModule, id, buffer, buffer.Capacity);
        if (Marshal.GetLastWin32Error() != 0)
            throw Marshal.GetExceptionForHR(Marshal.GetHRForLastWin32Error());
        return buffer.ToString();
    }

    ~Win32ResourceReader()
    {
        Dispose(false);
    }

    public void Dispose()
    {
        Dispose(true);
        GC.SuppressFinalize(this);
    }

    public void Dispose(bool disposing)
    {
        if (_hModule != IntPtr.Zero)
            FreeLibrary(_hModule);
        _hModule = IntPtr.Zero;
    }

    [DllImport("user32.dll", CharSet = CharSet.Auto, SetLastError = true)]
    static extern int LoadString(IntPtr hInstance, uint uID, StringBuilder lpBuffer, int nBufferMax);

    [DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)]
    static extern IntPtr LoadLibraryEx(string lpFileName, IntPtr hReservedNull, LoadLibraryFlags dwFlags);

    [DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)]
    static extern bool FreeLibrary(IntPtr hModule);

    [Flags]
    enum LoadLibraryFlags : uint
    {
        AsDataFile = 0x00000002,
        AsImageResource = 0x00000020
    }
}

You can use it like this:

string path = @"C:\Windows\System32\getuname.dll";
using (var reader = new Win32ResourceReader(path))
{
    string name = reader.GetString(0xA9);
    Console.WriteLine(name); // Copyright Sign
}


It isn't a built-in feature in .NET. You can find out from Charmap.exe, it displays the codepoint name in the status bar. If you need that in your own program you could compile the Unicode Character Database into your app.


I don't think there's anything built into .NET to identify this... But there is a Unicode character database.


As Hans Passant and MichaelBray said, .NET does not provide any built-in feature to get the Unicode name of characters.

You can use the Unicode Character Database which at http://unicode.org/ucd - today it contains the complete information of all Unicode 5.2 characters (annex #44).

Another alternative is to use the Character Map in Windows which you can access via Start\App Programs\Accessories\System Tools\Character Map (Win+R => charmap)

You can also use Unicode Converter Tools, which is a Open Source tool at http://unicode.codeplex.com it also provides a user interface for getting information and also its use from Unicode UCD (annex #44) the keynote of this software for You is that you can add the EnterpriseAppUnit dll of this aplication to your application and use the provided API.

This Assembly contain some static methods that takes a 'char` and returns the Name, HexCode, Decimal Code, etc.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜