开发者

String splitting issue problem with multiword expressions

I have a series of strings like:

'i would like a blood orange'

I also have a list of strings like:

["blood orange", "loan shark"]

Operating on the string, I want the following list:

["i", "would", "like", "a", "blood orange"]

What is the best way to get the above list? I've been using re throughout my code, but I'm stum开发者_开发问答ped with this issue.


This is a fairly straightforward generator implementation: split the string into words, group together words which form phrases, and yield the results.

(There may be a cleaner way to handle skip, but for some reason I'm drawing a blank.)

def split_with_phrases(sentence, phrase_list):
    words = sentence.split(" ")
    phrases = set(tuple(s.split(" ")) for s in phrase_list)
    print phrases
    max_phrase_length = max(len(p) for p in phrases)

    # Find a phrase within words starting at the specified index.  Return the
    # phrase as a tuple, or None if no phrase starts at that index.
    def find_phrase(start_idx):
        # Iterate backwards, so we'll always find longer phrases before shorter ones.
        # Otherwise, if we have a phrase set like "hello world" and "hello world two",
        # we'll never match the longer phrase because we'll always match the shorter
        # one first.
        for phrase_length in xrange(max_phrase_length, 0, -1):
            test_word = tuple(words[idx:idx+phrase_length])
            if test_word in phrases:
                return test_word
        return None

    skip = 0
    for idx in xrange(len(words)):
        if skip:
            # This word was returned as part of a previous phrase; skip it.
            skip -= 1
            continue

        phrase = find_phrase(idx)
        if phrase is not None:
            skip = len(phrase)
            yield " ".join(phrase)
            continue

        yield words[idx]

print [s for s in split_with_phrases('i would like a blood orange',
    ["blood orange", "loan shark"])]


Ah, this is crazy, crude and ugly. But looks like it works. You may wanna clean and optimize it but certain ideas here might work.

list_to_split = ['i would like a blood orange', 'i would like a blood orange ttt blood orange']
input_list = ["blood orange", "loan shark"]

for item in input_list:
    for str_lst in list_to_split:
        if item in str_lst:
            tmp = str_lst.split(item)
            lst = []
            for itm in tmp:
                if itm!= '':
                    lst.append(itm)
                    lst.append(item)
            print lst

output:

['i would like a ', 'blood orange']
['i would like a ', 'blood orange', ' ttt ', 'blood orange']


One quick and dirty, completely un-optimized approach might be to just replace the compounds in the string with a version including a different separator (preferably one that does not occur anywhere else in your target string or compound words). Then split and replace. A more efficient approach would be to iterate only once through the string, matching the compound words where appropriate - but you may have to watch out for instances where there are nested compounds, etc., depending on your array.


#!/usr/bin/python
import re

my_string = "i would like a blood orange"
compounds = ["blood orange", "loan shark"]
for i in range(0,len(compounds)):
    my_string = my_string.replace(compounds[i],compounds[i].replace(" ","&"))

my_segs = re.split(r"\s+",my_string)    
for i in range(0,len(my_segs)):
    my_segs[i] = my_segs[i].replace("&"," ")
print my_segs

Edit: Glenn Maynard's solution is better.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜