How to find which delimiter was used during string split (VB.NET)
lets say I have a string that I want to split based on several characters, like "."
, "!"
, and "?"
. How do I figure out which one of those characters split my string so I can add that same character back on to the end of the split segments in question?
Dim linePunctuation as Integer = 0
Dim myS开发者_开发问答tring As String = "some text. with punctuation! in it?"
For i = 1 To Len(myString)
If Mid$(entireFile, i, 1) = "." Then linePunctuation += 1
Next
For i = 1 To Len(myString)
If Mid$(entireFile, i, 1) = "!" Then linePunctuation += 1
Next
For i = 1 To Len(myString)
If Mid$(entireFile, i, 1) = "?" Then linePunctuation += 1
Next
Dim delimiters(3) As Char
delimiters(0) = "."
delimiters(1) = "!"
delimiters(2) = "?"
currentLineSplit = myString.Split(delimiters)
Dim sentenceArray(linePunctuation) As String
Dim count As Integer = 0
While linePunctuation > 0
sentenceArray(count) = currentLineSplit(count)'Here I want to add what ever delimiter was used to make the split back onto the string before it is stored in the array.'
count += 1
linePunctuation -= 1
End While
If you add a capturing group to your regex like this:
SplitArray = Regex.Split(myString, "([.?!])")
Then the returned array contains both the text between the punctuation, and separate elements for each punctuation character. The Split()
function in .NET includes text matched by capturing groups in the returned array. If your regex has several capturing groups, all their matches are included in the array.
This splits your sample into:
some text
.
with punctuation
!
in it
?
You can then iterate over the array to get your "sentences" and your punctuation.
.Split() does not provide this information.
You will need to use a regular expression to accomplish what you are after, which I infer as the desire to split an English-ish paragraph into sentences by splitting on punctuation.
The simplest implementation would look like this.
var input = "some text. with punctuation! in it?";
string[] sentences = Regex.Split(input, @"\b(?<sentence>.*?[\.!?](?:\s|$))");
foreach (string sentence in sentences)
{
Console.WriteLine(sentence);
}
Results
some text. with punctuation! in it?
But you are going to find very quickly that language, as spoken/written by humans, does not follow simple rules most of the time.
Here it is in VB.NET for you:
Dim sentences As String() = Regex.Split(line, "\b(?<sentence>.*?[\.!?](?:\s|$))")
Once you've called Split
with all 3 characters, you've tossed that information away. You could do what you're trying to do by splitting yourself or by splitting on one punctuation mark at a time.
精彩评论