Parsing two-dimensional text
I need to parse text files where rel开发者_如何学JAVAevant information is often spread across multiple lines in a nonlinear way. An example:
1234
1 IN THE SUPERIOR COURT OF THE STATE OF SOME STATE
2 IN AND FOR THE COUNTY OF SOME COUNTY
3 UNLIMITED JURISDICTION
4 --o0o--
5
6 JOHN SMITH AND JILL SMITH, )
)
7 Plaintiffs, )
)
8 vs. ) No. 12345
)
9 ACME CO, et al., )
)
10 Defendants. )
___________________________________)
I need to pull out Plaintiff and Defendant identities.
These transcripts have a very wide variety of formattings, so I can't always count on those nice parentheses being there, or the plaintiff and defendant information being neatly boxed off, e.g.:
1 SUPREME COURT OF THE STATE OF SOME OTHER STATE
COUNTY OF COUNTYVILLE
2 First Judicial District
Important Litigation
3 --------------------------------------------------X
THIS DOCUMENT APPLIES TO:
4
JOHN SMITH,
5 Plaintiff, Index No.
2000-123
6
DEPOSITION
7 - against - UNDER ORAL
EXAMINATION
8 OF
JOHN SMITH,
9 Volume I
10 ACME CO,
et al,
11 Defendants.
12 --------------------------------------------------X
The two constants are:
- "Plaintiff" will occur after the name of the plaintiff(s), but not necessarily on the same line.
- Plaintiffs and defendants' names will be in upper case.
Any ideas?
I like Martin's answer.
Here's perhaps a more general approach using Python:
import re
# load file into memory
# (if large files, provide some limit to how much of the file gets loaded)
with open('paren.txt','r') as f:
paren = f.read() # example doc with parens
# match all sequences of one or more alphanumeric (or underscore) characters
# when followed by the word `Plaintiff`; this is intentionally general
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)', paren,
re.DOTALL|re.MULTILINE)
# join the list separating by whitespace
str_of_matches = ' '.join(list_of_matches)
# split string by digits (line numbers)
tokens = re.split(r'\d',str_of_matches)
# plaintiffs will be in 2nd-to-last group
plaintiff = tokens[-2].strip()
Tests:
with open('paren.txt','r') as f:
paren = f.read() # example doc with parens
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',paren,
re.DOTALL|re.MULTILINE)
str_of_matches = ' '.join(list_of_matches)>>> tokens = re.split(r'\d', str_of_matches)
tokens = re.split(r'\d', str_of_matches)
plaintiff = tokens[-2].strip()
plaintiff
# prints 'JOHN SMITH and JILL SMITH'
with open('no_paren.txt','r') as f:
no_paren = f.read() # example doc with no parens
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',no_paren,
re.DOTALL|re.MULTILINE)
str_of_matches = ' '.join(list_of_matches)
tokens = re.split(r'\d', str_of_matches)
plaintiff = tokens[-2].strip()
plaintiff
# prints 'JOHN SMITH'
精彩评论