String splitting issue problem with multiword expressions

2023-01-21 01:23 问答作者：

I have a series of strings like:

'i would like a blood orange'

I also have a list of strings like:

["blood orange", "loan shark"]

Operating on the string, I want the following list:

["i", "would", "like", "a", "blood orange"]

What is the best way to get the above list? I've been using re throughout my code, but I'm stum开发者_开发问答ped with this issue.

This is a fairly straightforward generator implementation: split the string into words, group together words which form phrases, and yield the results.

(There may be a cleaner way to handle skip, but for some reason I'm drawing a blank.)

def split_with_phrases(sentence, phrase_list):
    words = sentence.split(" ")
    phrases = set(tuple(s.split(" ")) for s in phrase_list)
    print phrases
    max_phrase_length = max(len(p) for p in phrases)

    # Find a phrase within words starting at the specified index.  Return the
    # phrase as a tuple, or None if no phrase starts at that index.
    def find_phrase(start_idx):
        # Iterate backwards, so we'll always find longer phrases before shorter ones.
        # Otherwise, if we have a phrase set like "hello world" and "hello world two",
        # we'll never match the longer phrase because we'll always match the shorter
        # one first.
        for phrase_length in xrange(max_phrase_length, 0, -1):
            test_word = tuple(words[idx:idx+phrase_length])
            if test_word in phrases:
                return test_word
        return None

    skip = 0
    for idx in xrange(len(words)):
        if skip:
            # This word was returned as part of a previous phrase; skip it.
            skip -= 1
            continue

        phrase = find_phrase(idx)
        if phrase is not None:
            skip = len(phrase)
            yield " ".join(phrase)
            continue

        yield words[idx]

print [s for s in split_with_phrases('i would like a blood orange',
    ["blood orange", "loan shark"])]

Ah, this is crazy, crude and ugly. But looks like it works. You may wanna clean and optimize it but certain ideas here might work.

list_to_split = ['i would like a blood orange', 'i would like a blood orange ttt blood orange']
input_list = ["blood orange", "loan shark"]

for item in input_list:
    for str_lst in list_to_split:
        if item in str_lst:
            tmp = str_lst.split(item)
            lst = []
            for itm in tmp:
                if itm!= '':
                    lst.append(itm)
                    lst.append(item)
            print lst

output:

['i would like a ', 'blood orange']
['i would like a ', 'blood orange', ' ttt ', 'blood orange']

One quick and dirty, completely un-optimized approach might be to just replace the compounds in the string with a version including a different separator (preferably one that does not occur anywhere else in your target string or compound words). Then split and replace. A more efficient approach would be to iterate only once through the string, matching the compound words where appropriate - but you may have to watch out for instances where there are nested compounds, etc., depending on your array.


#!/usr/bin/python
import re

my_string = "i would like a blood orange"
compounds = ["blood orange", "loan shark"]
for i in range(0,len(compounds)):
    my_string = my_string.replace(compounds[i],compounds[i].replace(" ","&"))

my_segs = re.split(r"\s+",my_string)    
for i in range(0,len(my_segs)):
    my_segs[i] = my_segs[i].replace("&"," ")
print my_segs

Edit: Glenn Maynard's solution is better.

继续阅读：python regex

String splitting issue problem with multiword expressions

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

Easiest way to get words of one line from istream into a vector?

Infinite gtk warnings when I right click on the icon

Best solution for private video database [closed]

国内夏季避暑旅游胜地有哪些？