开发者

Return list of words from a list of lines with regexp

I'm running the following code on a list of strings to return a list of its words:

words = [re.split('\\s+', line) for line in lines]

However, I end up getting something like:

[['import', 're', ''], ['', ''], ['def', 'word_count(filename):', ''], ...]

As opposed to the desired:

['import', 're', '', '', '', 'def', 'word_count(f开发者_Python百科ilename):', '', ...]

How can I unpack the lists re.split('\\s+', line) produces in the above list comprehension? Naïvely, I tried using * but that doesn't work.

(I'm looking for a simple and Pythonic way of doing; I was tempted to write a function but I'm sure the language accommodates for this issue.)


>>> import re
>>> from itertools import chain
>>> lines = ["hello world", "second line", "third line"]
>>> words = chain(*[re.split(r'\s+', line) for line in lines])

This will give you an iterator that can be used for looping through all words:

>>> for word in words:
...    print(word)
... 
hello
world
second
line
third
line

Creating a list instead of an iterator is just a matter of wrapping the iterator in a list call:

>>> words = list(chain(*[re.split(r'\s+', line) for line in lines]))


The reason why you get a list of lists is because re.split() returns a list which then in 'appended' to the list comprehension output.

It's unclear why you are using that (or probably just a bad example) but if you can get the full content (all lines) as a string you can just do

words = re.split(r'\s+', lines)

if lines is the product of:

open('filename').readlines()

use

open('filename').read()

instead.


You can always do this:

words = []
for line in lines:
  words.extend(re.split('\\s+',line))

It's not nearly as elegant as a one-liner list comprehension, but it gets the job done.


Just stumbled across this old question, and I think I have a better solution. Normally if you want to nest a list comprehension ("append" each list), you think backwards (un-for-loop-like). This is not what you want:

>>> import re
>>> lines = ["hello world", "second line", "third line"]
>>> [[word for word in re.split(r'\s+', line)] for line in lines]
[['hello', 'world'], ['second', 'line'], ['third', 'line']]

However if you want to "extend" instead of "append" the lists you're generating, just leave out the extra set of square brackets and reverse your for-loops (putting them back in the "right" order).

>>> [word for line in lines for word in re.split(r'\s+', line)]
['hello', 'world', 'second', 'line', 'third', 'line']

This seems like a more Pythonic solution to me since it is based in list-processing logic rather than some random-ass built-in function. Every programmer should know how to do this (especially ones trying to learn Lisp!)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜