String chunking algorithm with natural language context
I have a arbitrarily large string of text from the user that needs to be split into 10k chunks (potentially adjustable value) and sent off to another system for processing.
- Chunks cannot be longer than 10k (or other arbitrary value)
- Text should be broken with natural language context in mind
- split on punctuation when possible
- split on spaces if no punction exists
- break a word as a last resort
I'm trying not to re-invent the wheel with this, any suggestions before I roll this 开发者_如何学Cfrom scratch?
Using C#.
This may not handle every case as you need, but it should get you on your way.
public IList<string> ChunkifyText(string bigString, int maxSize, char[] punctuation)
{
List<string> results = new List<string>();
string chunk;
int startIndex = 0;
while (startIndex < bigString.Length)
{
if (startIndex + maxSize + 1 > bigString.Length)
chunk = bigString.Substring(startIndex);
else
chunk = bigString.Substring(startIndex, maxSize);
int endIndex = chunk.LastIndexOfAny(punctuation);
if (endIndex < 0)
endIndex = chunk.LastIndexOf(" ");
if (endIndex < 0)
endIndex = Math.Min(maxSize - 1, chunk.Length - 1);
results.Add(chunk.Substring(0, endIndex + 1));
startIndex += endIndex + 1;
}
return results;
}
I'm sure this will probably end up being more difficult than you're expecting (most natural language things), but check out Sharp Natural Language Parser.
I'm currently using SharpNLP, it works pretty well, but there's always 'gotcha's'.
Let me kow if this isn't what you're looking for.
Mark
精彩评论