Return list of words from a list of lines with regexp

2022-12-26 09:11 问答作者：

I'm running the following code on a list of strings to return a list of its words:

words = [re.split('\\s+', line) for line in lines]

However, I end up getting something like:

[['import', 're', ''], ['', ''], ['def', 'word_count(filename):', ''], ...]

As opposed to the desired:

['import', 're', '', '', '', 'def', 'word_count(f开发者_Python百科ilename):', '', ...]

How can I unpack the lists re.split('\\s+', line) produces in the above list comprehension? Naïvely, I tried using * but that doesn't work.

(I'm looking for a simple and Pythonic way of doing; I was tempted to write a function but I'm sure the language accommodates for this issue.)

>>> import re
>>> from itertools import chain
>>> lines = ["hello world", "second line", "third line"]
>>> words = chain(*[re.split(r'\s+', line) for line in lines])

This will give you an iterator that can be used for looping through all words:

>>> for word in words:
...    print(word)
... 
hello
world
second
line
third
line

Creating a list instead of an iterator is just a matter of wrapping the iterator in a list call:

>>> words = list(chain(*[re.split(r'\s+', line) for line in lines]))

The reason why you get a list of lists is because re.split() returns a list which then in 'appended' to the list comprehension output.

It's unclear why you are using that (or probably just a bad example) but if you can get the full content (all lines) as a string you can just do

words = re.split(r'\s+', lines)

if lines is the product of:

open('filename').readlines()

use

open('filename').read()

instead.

You can always do this:

words = []
for line in lines:
  words.extend(re.split('\\s+',line))

It's not nearly as elegant as a one-liner list comprehension, but it gets the job done.

Just stumbled across this old question, and I think I have a better solution. Normally if you want to nest a list comprehension ("append" each list), you think backwards (un-for-loop-like). This is not what you want:

>>> import re
>>> lines = ["hello world", "second line", "third line"]
>>> [[word for word in re.split(r'\s+', line)] for line in lines]
[['hello', 'world'], ['second', 'line'], ['third', 'line']]

However if you want to "extend" instead of "append" the lists you're generating, just leave out the extra set of square brackets and reverse your for-loops (putting them back in the "right" order).

>>> [word for line in lines for word in re.split(r'\s+', line)]
['hello', 'world', 'second', 'line', 'third', 'line']

This seems like a more Pythonic solution to me since it is based in list-processing logic rather than some random-ass built-in function. Every programmer should know how to do this (especially ones trying to learn Lisp!)

继续阅读：list-comprehension python python-3.x regex

Return list of words from a list of lines with regexp

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？