String StartsWith() issue with Danish text
Can anyone explain this behaviour?
var culture = new CultureInfo("da-DK");
Thread.CurrentThread.CurrentCulture = culture;
"daab".StartsWith("da"); //false
I know that it can be fixed by specifying StringComparison.InvariantCulture
. But I'm just confused by the behavior.
I also know that "aA" and "AA" are not considered the same in a Danish case-insensitive comparision, see http://msdn.microsoft.com/en-us/library/xk2wykcz.aspx. Which explains this
String.Compare("aA", "AA", new CultureInfo("da-DK"), CompareOptions.IgnoreCase) // -1 (not equal)
Is this linked to 开发者_如何学Gothe behavior of the first code snippet?
Here a test that illustrates the problem, daab og dåb (same word in old and modern language respectively) means baptism/christening.
public class can_handle_remnant_of_danish_language
{
[Fact]
public void daab_start_with_då()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("daab".StartsWith("då")); // Fails
}
[Fact]
public void daab_start_with_da()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("daab".StartsWith("da")); // Fails
}
[Fact]
public void daab_start_with_daa()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("daab".StartsWith("daa")); // Succeeds
}
[Fact]
public void dåb_start_with_daa()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("dåb".StartsWith("daa")); // Fails
}
[Fact]
public void dåb_start_with_da()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("dåb".StartsWith("da")); // Fails
}
[Fact]
public void dåb_start_with_då()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("dåb".StartsWith("då")); // Succeeds
}
}
All the above tests should be successfull with my understanding of the language, and im danish! I aint got no degree in grammar though. :-)
Seems like a bug to me.
Like Nappy said, its a feature of the danish language, where "aa" and "å" is still the same. Danish got another two letters, æ and ø, but I am not sure if they can be written using two letters as well.
I think in the second example "aA" is not changed while "AA" is changed to "Å". Just to confuse things even more, "Aa" is considered equal to "AA" and "aa" only when using case-insensitive comparing.
The modern spelling of "baptism" in Danish, namely dåb, is certainly not considered to start with da, for a Danophone. If daab is supposed to be an old-fashioned spelling of dåb, it is a bit philosophical whether it starts with da or not. But for (modern) collation purposes, it does not (alphabetically, such daab goes after disk, not before).
However, if your string is not supposed to represent natural language, but is instead some kind of technical code, like hexadecimal digits, surely you do not want to use any culture-specific rules. The solution here is not to use the invariant culture. The invariant culture has (English) rules itself!
Instead, you want to use ordinal comparison.
Ordinal comparison simply compares the strings char
by char
, without any assumptions of what sequences are "equivalent" in some sense. (Technical remark: Each char
is a UTF-16 code unit, not a "character". Ordinal comparison is ignorant of the rules of Unicode normalization.)
I think the confusion arises because, by default, some string
methods use a culture-aware comparison, and other string
methods use the ordinal comparison.
The following examples all use a culture-aware comparison:
"Straße".StartsWith("Strasse", StringComparison.CurrentCulture)
"Straße".Equals("Strasse", StringComparison.CurrentCulture)
"ne\u0301e".StartsWith("née", StringComparison.CurrentCulture)
"ne\u0301e".Equals("née", StringComparison.CurrentCulture)
"Straße".StartsWith("Strasse") // CurrentCulture is default for 'StartsWith'!
"ne\u0301e".StartsWith("née") // CurrentCulture is default for 'StartsWith'!
Each of the above may depend on the .NET version as well! (As an example, the first one gives true
if the current culture is the invariant culture and you are under .NET Framework 4.8; but it gives false
if the current culture is the invariant culture and you use .NET 6.)
But these examples use ordinal comparison:
"Straße".StartsWith("Strasse", StringComparison.Ordinal)
"Straße".Equals("Strasse", StringComparison.Ordinal)
"ne\u0301e".StartsWith("née", StringComparison.Ordinal)
"ne\u0301e".Equals("née", StringComparison.Ordinal)
"Straße".Equals("Strasse") // Ordinal is default for 'Equals'!
"ne\u0301e".Equals("née") // Ordinal is default for 'Equals'!
So remember to check what the default comparison is for the string
method you use, and specify the opposite one if needed. (Or always specify the comparison, even when redundant, if you prefer.)
精彩评论