How to tokenize the sample string using Regular Expression in Python?
I am new to regular expression. On top of finding out the pat开发者_高级运维tern to match the following string, please also point out references and/or samples web sites.
The data string
1. First1 Last1 - 20 (Long Description)
2. First2 Last2 - 40 (Another Description)
I want to be able to extract tuples {First1,Last1,20} and {First2,Last2,40} from the above string.
Thisone seems ok: http://docs.python.org/howto/regex.html#regex-howto Just skim it over, try some examples. regexpes are a little tricky (basicly a little programming language), and require some time to learn, but they are very useful to know. Just experiment and take one step at a time.
(yes, I could just give you the answer, but fish, man, teach)
...
as reqested, a solution when you don't use the split() solution: iterate over the lines, and check for each line:
p = re.compile('\d+\.\s+(\w+)\s+(\w+)\s+-\s+(\d+)')
m = p.match(the_line)
// m.group(0) will be the first word
// m.group(1) the second word
// m.group(2) will be the firstnumber after the last word.
The regexp is :<some digits><a dot>
<some whitespace><alphanumeric characters, captured as group 0>
<some whtespace><alphanumeric characters, captured as group 1>
<some whitespace><a '-'><some witespace><digits, captured as group 2>
it's a little strict, but that way you'll catch non-conforming lines.
There is no need to use regex here:
foo = "1. First1 Last1 - 20 (Long Description)"
foo.split(" ")
>>> ['1.', '', 'First1', 'Last1', '-', '20', '(Long', 'Description)']
You can now select the elements you like (they will always be at the same indices).
In 2.7+ you can use itertools.compress
to select the elements:
tuple(compress(foo.split(" "), [0,0,1,1,0,1]))
Based on Harman's partial solution, I came up with this:
(?P<first>\w+)\s+(?P<last>\w+)[-\s]*(?P<number>\d[\d,]*)
code and the output:
>>> regex = re.compile("(?P<first>\w+)\s+(?P<last>\w+)[-\s]*(?P<number>\d[\d,]*)")
>>> r = regex.search(string)
>>> regex.findall(string)
[(u'First1', u'Last1', u'20'), (u'First2', u'Last2', u'40')]
精彩评论