Return list of words from a list of lines with regexp
I'm running the following code on a list of strings to return a list of its words:
words = [re.split('\\s+', line) for line in lines]
However, I end up getting something like:
[['import', 're', ''], ['', ''], ['def', 'word_count(filename):', ''], ...]
As opposed to the desired:
['import', 're', '', '', '', 'def', 'word_count(f开发者_Python百科ilename):', '', ...]
How can I unpack the lists re.split('\\s+', line)
produces in the above list comprehension? Naïvely, I tried using *
but that doesn't work.
(I'm looking for a simple and Pythonic way of doing; I was tempted to write a function but I'm sure the language accommodates for this issue.)
>>> import re
>>> from itertools import chain
>>> lines = ["hello world", "second line", "third line"]
>>> words = chain(*[re.split(r'\s+', line) for line in lines])
This will give you an iterator that can be used for looping through all words:
>>> for word in words:
... print(word)
...
hello
world
second
line
third
line
Creating a list instead of an iterator is just a matter of wrapping the iterator in a list
call:
>>> words = list(chain(*[re.split(r'\s+', line) for line in lines]))
The reason why you get a list of lists is because re.split() returns a list which then in 'appended' to the list comprehension output.
It's unclear why you are using that (or probably just a bad example) but if you can get the full content (all lines) as a string you can just do
words = re.split(r'\s+', lines)
if lines is the product of:
open('filename').readlines()
use
open('filename').read()
instead.
You can always do this:
words = []
for line in lines:
words.extend(re.split('\\s+',line))
It's not nearly as elegant as a one-liner list comprehension, but it gets the job done.
Just stumbled across this old question, and I think I have a better solution. Normally if you want to nest a list comprehension ("append" each list), you think backwards (un-for-loop-like). This is not what you want:
>>> import re
>>> lines = ["hello world", "second line", "third line"]
>>> [[word for word in re.split(r'\s+', line)] for line in lines]
[['hello', 'world'], ['second', 'line'], ['third', 'line']]
However if you want to "extend" instead of "append" the lists you're generating, just leave out the extra set of square brackets and reverse your for-loops (putting them back in the "right" order).
>>> [word for line in lines for word in re.split(r'\s+', line)]
['hello', 'world', 'second', 'line', 'third', 'line']
This seems like a more Pythonic solution to me since it is based in list-processing logic rather than some random-ass built-in function. Every programmer should know how to do this (especially ones trying to learn Lisp!)
精彩评论