How do I split up a search string to allow for quoted text?
I want to make a list of strings from the text of a search field. I want to make anything that is in double quotes be split out.
ex.
sample' "string's are, more "text" making" 12.34,hello"pineapple sundays
Produces
sample'
string's are, more_ //underscore shown to display space
text
making
12.34
hello
pineapple
sundays
Edit: Here is my (somewhat) elegant solution, thanks for the help everyone!
Private Function GetSearchTerms(ByVal searchText As String) As String()
'Clean search string of unwanted characters'
searchText = System.Text.RegularExpressions.Regex.Replace(searchText, "[^a-zA-Z0-9""'.,= ]", "")
'Guarantees the first entry will not be an entry in quotes if the searchkeywords starts with double quotes'
Dim searches As String() = searchText.Replace("""", " "" ").Split("""")
Dim myWords As System.Collections.Generic.List(Of String) = New System.Collections.Generic.List(Of String)
Dim delimiters As String() = New String() {" ", ","}
For index As Integer = 0 To searches.Length - 1
'even is regular text, split up into individual search terms'
If (index Mod 2 = 0) Then
myWords.AddRange(searches(index).Split(delimiters, StringSplitOptions.RemoveEmptyEntries))
Else
'check for unclosed double quote, if so, split it up and add, space we added earlier will get split out'
If (searches.Length Mod 2 = 0 And index = searches.Length - 1) Then
myWords.AddRange(searches(index).Split(delimiters, StringSplitOptions.RemoveEmptyEntries))
Else
'2 double quotes found'
'remove the 2 spaces that we added earlier'
Dim myQuotedString As String = searches(index).Substring(1, searches(index).Length - 2)
If (myQuotedString.Length > 0) Then
myWords.Add(myQuotedString)
End If
End If
End If
N开发者_如何学Cext
Return myWords.ToArray()
End Function
Oi, vb commenting is ugly, anyone know how to clean this up?
This is a more complex parsing problem than you fully appreciate. I suggest you look at the TextFieldParser class and the FileHelpers library: http://www.filehelpers.com/
This is not THE COMPLETE solution since it is missing few validation checks, but it has everything you need.
My CharOccurs() finds occurrences of '"'
and stores them into list in order.
public static List<int> CharOccurs(string stringToSearch, char charToFind)
{
List<int> count = new List<int>();
int chr = 0;
while (chr != -1)
{
chr = stringToSearch.IndexOf(charToFind, chr);
if (chr != -1)
{
count.Add(chr);
chr++;
}
else
{
chr = -1;
}
}
return count;
}
This below code is pretty much explanatory iteself. I take the string which is within quoted and split them differently with only '"' character
. Then I do SubStrings on outside quotes string and split them on ",", space and '"'
charaters. Please add your validations checks wherever needed to make it generic.
string input = "sample' \"string's are, more \"text\" making\" 12.34,hello\"pineapple sundays";
List<int> positions = CharOccurs(input, '\"');
string within_quotes, outside_quotes;
string[] arr_within_quotes;
List<string> output = new List<string>();
output.AddRange(input.Substring(0, positions[0]-1).Split(new char[] { ' ', ',', '"' }));
if (positions.Count % 2 == 0)
{
within_quotes = input.Substring(positions[0]+1, positions[positions.Count - 1] - positions[0]-1);
arr_within_quotes = within_quotes.Split('"');
output.AddRange(arr_within_quotes);
output.AddRange(input.Substring(positions[positions.Count - 1] + 1).Split(new char[] { ' ', ',' }));
}
else
{
within_quotes = input.Substring(positions[0]+1, positions[positions.Count - 2] - positions[0]-1);
arr_within_quotes = within_quotes.Split('"');
output.AddRange(arr_within_quotes);
output.AddRange(input.Substring(positions[positions.Count - 2] + 1).Split(new char[] { ' ', ',', '"' }));
}
I Wrote this Parse Line function a few months ago for VB.NET, it may be of some use to you, it works out if there are Text Qualifiers and will split based on the Text, ill try to convert it to C# for you in the coming few minutes if you want me to.
You Would have your line of Text:
sample' "string's are, more "text" making" 12.34,hello"pineapple sundays
and you would have that as your strLine and you would set your strDataDelimeters = "," and you would set you strTextQualifier = """"
Hope this helps you out.
Public Function ParseLine(ByVal strLine As String, Optional ByVal strDataDelimiter As String = "", Optional ByVal strTextQualifier As String = "", Optional ByVal strQualifierSplitter As Char = vbTab) As String()
Try
Dim strField As String = Nothing
Dim strNewLine As String = Nothing
Dim lngChrPos As Integer = 0
Dim bUseQualifier As Boolean = False
Dim bRemobedLastDel As Boolean = False
Dim bEmptyLast As Boolean = False ' Take into account where the line ends in a field delimiter, the ParseLine function should keep that empty field as well.
Dim strList As String()
'TEST,23479234,Just Right 950g,02/04/2006,1234,5678,9999,0000
'TEST,23479234,Just Right 950g,02/04/2006,1234,5678,9999,0000,
'TEST,23479234,Just Right 950g,02/04/2006,1234,,,0000,
'TEST,23479234,Just Right 950g,02/04/2006,1234,5678,9999,,
'TEST,23479234,"Just Right 950g, BO",02/04/2006,,5678,9999,,
'TEST,23479234,"Just Right"" 950g, BO",02/04/2006,,5678,9999,1111,
'TEST23479234 'Kellogg''s Just Right 950g' 02/04/2006 1234 5678 0000 9999
'TEST23479234 '' 02/04/2006 1234 5678 0000 9999
bUseQualifier = strTextQualifier.Length()
'split data based on options..
If bUseQualifier Then
'replace double qualifiers for ease of parsing..
'strLine = strLine.Replace(New String(strTextQualifier, 2), vbTab)
'loop and find each field..
Do Until strLine = Nothing
If strLine.Substring(0, 1) = strTextQualifier Then
'find closing qualifier
lngChrPos = strLine.IndexOf(strTextQualifier, 1)
'check for missing double qualifiers, unclosed qualifiers
Do Until (strLine.Length() - 1) = lngChrPos OrElse lngChrPos = -1 OrElse _
strLine.Substring(lngChrPos + 1, 1) = strDataDelimiter
lngChrPos = strLine.IndexOf(strTextQualifier, lngChrPos + 1)
Loop
'get field from line..
If lngChrPos = -1 Then
strField = strLine.Substring(1)
strLine = vbNullString
Else
strField = strLine.Substring(1, lngChrPos - 1)
If (strLine.Length() - 1) = lngChrPos Then
strLine = vbNullString
Else
strLine = strLine.Substring(lngChrPos + 2)
If strLine = "" Then
bEmptyLast = True
End If
End If
'strField = String.Format("{0}{1}{2}", strTextQualifier, strField, strTextQualifier)
End If
Else
'find next delimiter..
'lngChrPos = InStr(1, strLine, strDataDelimiter)
lngChrPos = strLine.IndexOf(strDataDelimiter)
'get field from line..
If lngChrPos = -1 Then
strField = strLine
strLine = vbNullString
Else
strField = strLine.Substring(0, lngChrPos)
strLine = strLine.Substring(lngChrPos + 1)
If strLine = "" Then
bEmptyLast = True
End If
End If
End If
' Now replace double qualifiers with a single qualifier in the "corrected" string
strField = strField.Replace(New String(strTextQualifier, 2), strTextQualifier)
'restore double qualifiers..
'strField = IIf(strField = vbNullChar, vbNullString, strField)
'strField = Replace$(strField, vbTab, strTextQualifier)
'strField = IIf(strField = vbTab, vbNullString, strField)
'strField = strField.Replace(vbTab, strTextQualifier)
'save field to array..
strNewLine = String.Format("{0}{1}{2}", strNewLine, strQualifierSplitter, strField)
Loop
If bEmptyLast = True Then
strNewLine = String.Format("{0}{1}", strNewLine, strQualifierSplitter)
End If
'trim off first nullchar..
strNewLine = strNewLine.Substring(1)
'split new line..
strList = strNewLine.Split(strQualifierSplitter)
Else
If strLine.Substring(strLine.Length - 1, 1) = strDataDelimiter Then
strLine = strLine.Substring(0)
End If
'no qualifier.. do a simply split..
strList = strLine.Split(strDataDelimiter)
End If
'return result..
Return strList
Catch ex As Exception
Throw New Exception(String.Format("Error Splitting Special String - {0}", ex.Message.ToString()))
End Try
End Function
If you wanted to display an underscore to indicate a space as before the ", like you show in your question you can use:
string[] splitString = t.Replace(" \"", "_\"").Split('"');
Regular expressions for this sort of thing get complicated fast as you start to add all sorts of exceptions.
None the less, if more for the sake of interest and completeness than anything else:
(?<term>[a-zA-Z0-9'.=]+)|("(?<term>[^"]+)")
精彩评论