开发者

String chunking algorithm with natural language context

I have a arbitrarily large string of text from the user that needs to be split into 10k chunks (potentially adjustable value) and sent off to another system for processing.

  • Chunks cannot be longer than 10k (or other arbitrary value)
  • Text should be broken with natural language context in mind
    • split on punctuation when possible
    • split on spaces if no punction exists
    • break a word as a last resort

I'm trying not to re-invent the wheel with this, any suggestions before I roll this 开发者_如何学Cfrom scratch?

Using C#.


This may not handle every case as you need, but it should get you on your way.

    public IList<string> ChunkifyText(string bigString, int maxSize, char[] punctuation)
    {
        List<string> results = new List<string>();

        string chunk;
        int startIndex = 0;

        while (startIndex < bigString.Length)
        {
            if (startIndex + maxSize + 1 > bigString.Length)
                chunk = bigString.Substring(startIndex);
            else
                chunk = bigString.Substring(startIndex, maxSize);

            int endIndex = chunk.LastIndexOfAny(punctuation);

            if (endIndex < 0)
                endIndex = chunk.LastIndexOf(" ");

            if (endIndex < 0)
                endIndex = Math.Min(maxSize - 1, chunk.Length - 1);

            results.Add(chunk.Substring(0, endIndex + 1));

            startIndex += endIndex + 1;
        }

        return results;
    }


I'm sure this will probably end up being more difficult than you're expecting (most natural language things), but check out Sharp Natural Language Parser.

I'm currently using SharpNLP, it works pretty well, but there's always 'gotcha's'.

Let me kow if this isn't what you're looking for.

Mark

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜