How to not include line breaks when comparing two strings
i am comparing updates to two strings. i did a:
string1 != string2
and they turn out different. I put them in the "Add Watch" and i see the only difference is one has line breaks开发者_Go百科 and the other doesnt'.:
string1 = "This is a test. \nThis is a test";
string2 = "This is a test. This is a test";
i basically want to do a compare but dont include line breaks. So if line break is the only difference then consider them equal.
A quick and dirty way, when performance isn't much of an issue:
string1.Replace("\n", "") != string2.Replace("\n", "")
I'd suggest regex to reduce every space
, tab
, \r
, \n
to a single space :
Regex.Replace(string1, @"\s+", " ") != Regex.Replace(string2, @"\s+", " ")
Assuming:
- The sort of direct char-value-for-char-value comparison of != and == is what is wanted here, except for the matter of newlines.
- The strings are, or may, be large enough or compared often enough to make just replacing
"\n"
with an empty string too inefficient.
Then:
public bool LinelessEquals(string x, string y)
{
//deal with quickly handlable cases quickly.
if(ReferenceEquals(x, y))//same instance
return true; // - generally happens often in real code,
//and is a fast check, so always worth doing first.
//We already know they aren't both null as
//ReferenceEquals(null, null) returns true.
if(x == null || y == null)
return false;
IEnumerator<char> eX = x.Where(c => c != '\n').GetEnumerator();
IEnumerator<char> eY = y.Where(c => c != '\n').GetEnumerator();
while(eX.MoveNext())
{
if(!eY.MoveNext()) //y is shorter
return false;
if(ex.Current != ey.Current)
return false;
}
return !ey.MoveNext(); //check if y was longer.
}
This is defined as equality rather than inequality, so you could easily adapt it to be an implementation of IEqualityComparer<string>.Equals
. Your question for a linebreak-less string1 != string2
becomes: !LinelessEquals(string1, string2)
Here's an equality comparer for strings that ignores certain characters, such as \r
and \n
.
This implementation doesn't allocate any heap memory during execution, helping its performance. It also avoids virtual calls through IEnumerable
and IEnumerator
.
public sealed class SelectiveStringComparer : IEqualityComparer<string>
{
private readonly string _ignoreChars;
public SelectiveStringComparer(string ignoreChars = "\r\n")
{
_ignoreChars = ignoreChars;
}
public bool Equals(string x, string y)
{
if (ReferenceEquals(x, y))
return true;
if (x == null || y == null)
return false;
var ix = 0;
var iy = 0;
while (true)
{
while (ix < x.Length && _ignoreChars.IndexOf(x[ix]) != -1)
ix++;
while (iy < y.Length && _ignoreChars.IndexOf(y[iy]) != -1)
iy++;
if (ix >= x.Length)
return iy >= y.Length;
if (iy >= y.Length)
return false;
if (x[ix] != y[iy])
return false;
ix++;
iy++;
}
}
public int GetHashCode(string obj)
{
throw new NotSupportedException();
}
}
A cleaner approach would be to use:
string1.Replace(Environment.NewLine, String.Empty) != string2.Replace(Environment.NewLine, String.Empty);
This is a generalized and tested version of Jon Hannas answer.
/// <summary>
/// Compares two character enumerables one character at a time, ignoring those specified.
/// </summary>
/// <param name="x"></param>
/// <param name="y"></param>
/// <param name="ignoreThese"> If not specified, the default is to ignore linefeed and newline: {'\r', '\n'} </param>
/// <returns></returns>
public static bool EqualsIgnoreSome(this IEnumerable<char> x, IEnumerable<char> y, params char[] ignoreThese)
{
// First deal with quickly handlable cases quickly:
// Same instance - generally happens often in real code, and is a fast check, so always worth doing first.
if (ReferenceEquals(x, y))
return true; //
// We already know they aren't both null as ReferenceEquals(null, null) returns true.
if (x == null || y == null)
return false;
// Default ignore is newlines:
if (ignoreThese == null || ignoreThese.Length == 0)
ignoreThese = new char[] { '\r', '\n' };
// Filters by specifying enumerator.
IEnumerator<char> eX = x.Where(c => !ignoreThese.Contains(c)).GetEnumerator();
IEnumerator<char> eY = y.Where(c => !ignoreThese.Contains(c)).GetEnumerator();
// Compares.
while (eX.MoveNext())
{
if (!eY.MoveNext()) //y is shorter
return false;
if (eX.Current != eY.Current)
return false;
}
return !eY.MoveNext(); //check if y was longer.
}
string1.replace('\n','') != string2.replace('\n','')
Cant you just strip out the line breaks before comparing the strings?
E.g. (pseudocode)...
string1.replace('\n','') != string2.replace('\n','')
Here's a version in VB.net based on Drew Noakes answer
Dim g_sIgnore As String = vbSpace & vbNewLine & vbTab 'String.Format("\n\r\t ")
Public Function StringCompareIgnoringWhitespace(s1 As String, s2 As String) As Boolean
Dim i1 As Integer = 0
Dim i2 As Integer = 0
Dim s1l As Integer = s1.Length
Dim s2l As Integer = s2.Length
Do
While i1 < s1l AndAlso g_sIgnore.IndexOf(s1(i1)) <> -1
i1 += 1
End While
While i2 < s2l AndAlso g_sIgnore.IndexOf(s2(i2)) <> -1
i2 += 1
End While
If i1 = s1l And i2 = s2l Then
Return True
Else
If i1 < s1l AndAlso i2 < s2l AndAlso s1(i1) = s2(i2) Then
i1 += 1
i2 += 1
Else
Return False
End If
End If
Loop
Return False
End Function
I also tested it with
Try
Debug.Assert(Not StringCompareIgnoringWhitespace("a", "z"))
Debug.Assert(Not StringCompareIgnoringWhitespace("aa", "zz"))
Debug.Assert(StringCompareIgnoringWhitespace("", ""))
Debug.Assert(StringCompareIgnoringWhitespace(" ", ""))
Debug.Assert(StringCompareIgnoringWhitespace("", " "))
Debug.Assert(StringCompareIgnoringWhitespace(" a", "a "))
Debug.Assert(StringCompareIgnoringWhitespace(" aa", "aa "))
Debug.Assert(StringCompareIgnoringWhitespace(" aa ", " aa "))
Debug.Assert(StringCompareIgnoringWhitespace(" aa a", " aa a"))
Debug.Assert(Not StringCompareIgnoringWhitespace("a", ""))
Debug.Assert(Not StringCompareIgnoringWhitespace("", "a"))
Debug.Assert(Not StringCompareIgnoringWhitespace("ccc", ""))
Debug.Assert(Not StringCompareIgnoringWhitespace("", "ccc"))
Catch ex As Exception
Console.WriteLine(ex.ToString)
End Try
I've run into this problem a number of times when I'm writing unit tests that need to compare multiple line expected strings with the actual output strings.
For example, if I'm writing a method that outputs a multi-line string I care about what each line looks like, but I don't care about the particular newline character used on a Windows or Mac machine.
In my case I just want to assert that each line is equal in my unit tests and bail out if one of them isn't.
public static void AssertAreLinesEqual(string expected, string actual)
{
using (var expectedReader = new StringReader(expected))
using (var actualReader = new StringReader(actual))
{
while (true)
{
var expectedLine = expectedReader.ReadLine();
var actualLine = actualReader.ReadLine();
Assert.AreEqual(expectedLine, actualLine);
if(expectedLine == null || actualLine == null)
break;
}
}
}
Of course, you could also make the method a little more generic and write to return a bool
instead.
public static bool AreLinesEqual(string expected, string actual)
{
using (var expectedReader = new StringReader(expected))
using (var actualReader = new StringReader(actual))
{
while (true)
{
var expectedLine = expectedReader.ReadLine();
var actualLine = actualReader.ReadLine();
if (expectedLine != actualLine)
return false;
if(expectedLine == null || actualLine == null)
break;
}
}
return true;
}
What surprises me most is that there isn't a method like this included in any unit testing framework I've used.
I've had this issue with line endings in an unit test.
//compare files ignoring line ends
org.junit.Assert.assertEquals(
read.readPayload("myFile.xml")
.replace("\n", "")
.replace("\r", ""),
values.getFile()
.replace("\n", "")
.replace("\r", ""));
I usually do not like to make this kind of comparison (comparing the whole file), as a better approach would be validating the fields. But it answers this question here, as it removes line endings for most of the systems (the replace
calls is the trick).
PS: read.readPayload
reads a text file from the resources folder and puts it into a String, and values
is a structure that contains a String with the raw content of a file (as String) in its attributes.
PS2: No performance was considered, since it was just an ugly fix for unit test
精彩评论