开发者

Create short human-readable string from longer string

I have a requirement to contract a string such as...

Would you consider becoming a robot? You would be provided with a free annual oil change."

...to something much shorter but yet still humanly identifiable (it will need to be found from a select list - my current solution has users entering an arbitrary title for the sole purpose of selection)

I would like to extract only the portion of the string which forms a question (if possible) and then somehow reduce it to something 开发者_JAVA百科like

WouldConsiderBecomingRobot

Are there any grammatical algorithms out there that might help me with this? I'm thinking there might be something that allows be to pick out just verbs and nouns.

As this is just to act as a key it doesn't have to be perfect; I'm not seeking to trivialise the inherant complexity of the english language.


Probably too simplistic, but I might be tempted to start with a list of "filler words":

var fillers = new[]{"you","I","am","the","a","are"};

Then extract everything before a questionmark (using regex, string mashing, whatever you fancy), yielding you "Would you consider becoming a robot".

Then go through the string extracting every word considered a filler.

var sentence = "Would you consider becoming a robot";
var newSentence = String.Join("",sentence.Split(" ").Where(w => !fillers.Contains(w)).ToArray());
// newSentence is "Wouldconsiderbecomingrobot".

Pascal casing each word would result in your desired string - i'll leave that as an excercise for the reader.


Create a popular social media website. When users want to join or post comments, prompt them to solve a captcha. The captcha will consist of matching your shortened versions of the long strings to their full versions. Your shortening algorithm will be based on a neural net or genetic algorithm which is trained from the capcha results.

You can also sell advertising on the website.


I ended up creating the following extension method which does work surprisingly well. Thanks to Joe Blow for his excellent and effective suggestions:

    public static string Contract(this string e, int maxLength)
    {
        if(e == null) return e;

        int questionMarkIndex = e.IndexOf('?');
        if (questionMarkIndex == -1)
            questionMarkIndex = e.Length - 1;

        int lastPeriodIndex = e.LastIndexOf('.', questionMarkIndex, 0);

        string question = e.Substring(lastPeriodIndex != -1 ? lastPeriodIndex : 0, questionMarkIndex + 1).Trim();

        var punctuation =
            new [] {",", ".", "!", ";", ":", "/", "...", "...,", "-,", "(", ")", "{", "}", "[", "]","'","\""};

        question = punctuation.Aggregate(question, (current, t) => current.Replace(t, ""));

        IDictionary<string, bool> words = question.Split(' ').ToDictionary(x => x, x => false);

        string mash = string.Empty;
        while (words.Any(x => !x.Value) && mash.Length < maxLength)
        {
            int maxWordLength = words.Where(x => !x.Value).Max(x => x.Key.Length);
            var pair = words.Where(x => !x.Value).Last(x => x.Key.Length == maxWordLength);
            words.Remove(pair);
            words.Add(new KeyValuePair<string, bool>(pair.Key, true));
            mash = string.Join("", words.Where(x => x.Value)
                                       .Select(x => x.Key.Capitalize())
                                       .ToArray()
                );
        }

        return mash;
    }

This contracts the following to 15 chars:

  • This does not have any prereqs - write an essay...: PrereqsWriteEssay
  • You've selected a car: YouveSelectedCar


I don't think there is any algorithm that can identify if each word of a string is a noun, adjective or whatever. The only solution would be to use a custom dictionary : just create a list of words that can't be identified as verbs or nouns (I, you, they, them, his, hers, of, a, the etc.).

Then you just have to keep all the words before the question mark that are not in the list.

It is just a workaround, and I you said, it is not perfect.

Hope this helps !


Welcome to the wonderful world of natural language processing. If you want to identify nouns and verbs, you will need a part of speech tagger.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜