need help with splitting a string in python

2023-01-10 11:11 问答作者：

I am trying to tokenize a string using the pattern as below.

>>> splitter = re.compile(r'((\w*)(\d*)\-\s?(\w*)(\d*)|(?x)\$?\d+(\.\d+)?(\,\d+)?|([A-Z]\.)+|(Mr)\.|(Sen)\.|(Miss)\.|.$|\w+|[^\w\s])')
>>> splitter.split("Hello! Hi, I am debating this predicament called life. Can you help me?")

I get the following output. Could someone point out what I'd need to correct please? I'm confused about the whole bunch of "None"'s. Also if there is a better way to tokenize a string I'd really appreciate the additional help.

['', 'Hello', None, None, None, None, None, None, None, None, None, None, '', '!', None, None, None, None, None, None, None, None, None, None, ' ', 'Hi', None,None, None, None, None, None, None, None, None, None, '', ',', None, None, None, None, None, None, None, None, None, None, ' ', 'I', None, None, None, None, None, None, None, None, None, None, ' ', 'am', None, None, None, None, None, None,None, None, None, None, ' ', 'debating', None, None, None, None, None, None, None, None, None, None, ' ', 'this', None, None, None, None, None, None, None, None, None, None, ' ', 'predicament', None, None, None, None, None, None, None, None, None, None, ' ', 'called', None, None, None, None, None, None, None, None, None, None, ' ', 'life', None, None, None, None, None, None, None, None, None, None, '', '.', None, None, None, None, None, None, None, None, None, None, ' ', 'Can', None, None, None, None, None, None, None, None, None, None, ' ', 'you', None, None, None, None, N开发者_如何学编程one, None, None, None, None, None, ' ', 'help', None, None,None, None, None, None, None, None, None, None, ' ', 'me', None, None, None, None, None, None, None, None, None, None, '', '?', None, None, None, None, None, None, None, None, None, None, '']

The output that I'd like is:-

['Hello', '!', 'Hi', ',', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life', '.', 'Can', 'you', 'help', 'me', '?']

Thank you.

I recommend NLTK's tokenizers. Then you don't need to worry about tedious regular expressions yourself:

>>> import nltk
>>> nltk.word_tokenize("Hello! Hi, I am debating this predicament called life. Can you help me?")
['Hello', '!', 'Hi', ',', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life.', 'Can', 'you', 'help', 'me', '?']

re.split rapidly runs out of puff when used as a tokeniser. Preferable is findall (or match in a loop) with a pattern of alternatives this|that|another|more

>>> s = "Hello! Hi, I am debating this predicament called life. Can you help me?"
>>> import re
>>> re.findall(r"\w+|\S", s)
['Hello', '!', 'Hi', ',', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life', '.', 'Can', 'you', 'help', 'me', '?']
>>>

This defines tokens as either one or more "word" characters, or a single character that's not whitespace. You may prefer [A-Za-z] or [A-Za-z0-9] or something else instead of \w (which allows underscores). You may even want something like r"[A-Za-z]+|[0-9]+|\S"

If things like Sen., Mr. and Miss (what happened to Mrs and Ms?) are significant to you, your regex should not list them out, it should just define a token that ends in ., and you should have a dictionary or set of probable abbreviations.

Splitting text into sentences is complicated. You may like to look at the nltk package instead of trying to reinvent the wheel.

Update: if you need/want to distinguish between the types of tokens, you can get an index or a name like this without a (possibly long) chain of if/elif/elif/.../else:

>>> s = "Hello! Hi, I we 0 1 987?"

>>> pattern = r"([A-Za-z]+)|([0-9]+)|(\S)"
>>> list((m.lastindex, m.group()) for m in re.finditer(pattern, s))
[(1, 'Hello'), (3, '!'), (1, 'Hi'), (3, ','), (1, 'I'), (1, 'we'), (2, '0'), (2,     '1'), (2, '987'), (3, '?')]

>>> pattern = r"(?P<word>[A-Za-z]+)|(?P<number>[0-9]+)|(?P<other>\S)"
>>> list((m.lastgroup, m.group()) for m in re.finditer(pattern, s))
[('word', 'Hello'), ('other', '!'), ('word', 'Hi'), ('other', ','), ('word', 'I'), ('word', 'we'), ('number', '0'), ('number', '1'), ('number', '987'), ('other'
, '?')]
>>>

Could be missing something but I beleive something like the following would work:

s = "Hello! Hi, I am debating this predicament called life. Can you help me?"
s.split(" ")

This is assuming you want spaces. You should get something along the lines of:

['Hello!', 'Hi,', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life.', 'Can', 'you', 'help', 'me?']

With this, if you needed a specific piece, you could probably loop though it to get what you need.

Hopefully this helps....

The reason you're getting all of those None's is because you have lots of parenthesized groups in your regular expression separated by |'s. Every time your regular expression finds a match, it's only matching one of the alternatives given by the |'s. The parenthesized groups in the other, unused alternatives get set to None. And re.split by definition reports the values of all parenthesized groups every time it gets a match, hence lots of None's in your result.

You could filter those out pretty easily (e.g. tokens = [t for t in tokens if t] or something similar) but I think split isn't really the tool you want for tokenizing. split is meant for just throwing away whitespace. If you want really want to use regular expressions to tokenize something, here's a toy example of another method (I'm not going to even try to unpack that monster r.e. you're using...use the re.VERBOSE option for the love of Ned...but hopefully this toy example will give you the idea):

tokenpattern = re.compile(r"""
(?P<words>\w+) # Things with just letters and underscores
|(?P<numbers>\d+) # Things with just digits
|(?P<other>.+?) # Anything else
""", re.VERBOSE)

The (?P<something>... business lets you identify the type of token you're looking for by name in the code below:

for match in tokenpattern.finditer("99 bottles of beer"):
  if match.group('words'):
    # This token is a word
    word = match.group('words')
    #...
  elif match.group('numbers'):
    number = int(match.group('numbers')):
  else:
    other = match.group('other'):

Note that this is still a r.e. using a bunch of parenthesized groups separated by |'s, so the same thing is going to happen as in your code: for each match, one group will be defined and the others will be set to None. This method checks for that explicitly.

Perhaps he didn't mean it as such, but John Machin's comment "str.split is NOT a place to get started" (as part of the exchange after Frank V's answer) came as a bit of a challenge. So ...

the_string = "Hello! Hi, I am debating this predicament called life. Can you help me?"
tokens = the_string.split()
punctuation = ['!', ',', '.', '?']
output_list = []
for token in tokens:
    if token[-1] in punctuation:
        output_list.append(token[:-1])
        output_list.append(token[-1])
    else:
        output_list.append(token)
print output_list

This seems to provide the requested output.

Granted, John's answer is simpler in terms of number of lines of code. However, I have a couple points to make supporting this sort of solution.

I don't completely agree with Jamie Zawinski's 'Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.' (Neither did he from what I've read.) My point in quoting this is that regular expressions can be a pain to get working if you're not accustomed to them.

Also, while it won't normally be an issue, the performance of the above solution was consistently better than the regex solution, when measured with timeit. The above solution (with the print statement removed) came in at about 8.9 seconds; John's regular expression solution came in at about 11.8 seconds. This involved 10 tries each of 1 million iterations on a quad core dual processor system running at 2.4 GHz.

继续阅读：python regex string-split

need help with splitting a string in python

更多精彩内容

精彩评论

最新问答

宫颈癌术后可以性生活吗？

决战平安京人面树赏金特典皮肤什么时候上线?？

CF2024宠粉节活动入口在哪?？

原神养石任务怎么做?？

射戮骑士什么时候发售?？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？