开发者

How to tokenize the sample string using Regular Expression in Python?

I am new to regular expression. On top of finding out the pat开发者_高级运维tern to match the following string, please also point out references and/or samples web sites.

The data string

1.  First1 Last1 - 20 (Long Description) 
2.  First2 Last2 - 40 (Another Description)

I want to be able to extract tuples {First1,Last1,20} and {First2,Last2,40} from the above string.


Thisone seems ok: http://docs.python.org/howto/regex.html#regex-howto Just skim it over, try some examples. regexpes are a little tricky (basicly a little programming language), and require some time to learn, but they are very useful to know. Just experiment and take one step at a time.

(yes, I could just give you the answer, but fish, man, teach)

...

as reqested, a solution when you don't use the split() solution: iterate over the lines, and check for each line:

p = re.compile('\d+\.\s+(\w+)\s+(\w+)\s+-\s+(\d+)')
m = p.match(the_line)
// m.group(0) will be the first word
// m.group(1) the second word
// m.group(2) will be the firstnumber after the last word.

The regexp is :<some digits><a dot>
<some whitespace><alphanumeric characters, captured as group 0>
<some whtespace><alphanumeric characters, captured as group 1>
<some whitespace><a '-'><some witespace><digits, captured as group 2>

it's a little strict, but that way you'll catch non-conforming lines.


There is no need to use regex here:

foo = "1.  First1 Last1 - 20 (Long Description)"
foo.split(" ")
>>> ['1.', '', 'First1', 'Last1', '-', '20', '(Long', 'Description)']

You can now select the elements you like (they will always be at the same indices).

In 2.7+ you can use itertools.compress to select the elements:

tuple(compress(foo.split(" "), [0,0,1,1,0,1]))


Based on Harman's partial solution, I came up with this:

(?P<first>\w+)\s+(?P<last>\w+)[-\s]*(?P<number>\d[\d,]*)

code and the output:

>>> regex = re.compile("(?P<first>\w+)\s+(?P<last>\w+)[-\s]*(?P<number>\d[\d,]*)")
>>> r = regex.search(string)
>>> regex.findall(string)
[(u'First1', u'Last1', u'20'), (u'First2', u'Last2', u'40')]
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜