Beginner Regex problem with whitespace and backtracking
I'm trying to extract data from a PDF which is in the form of table with headings such as name, countr开发者_运维知识库y, and various numeric fields.
I am having problems where the names and countries are of different length. I'm also not sure how to get to the numbers as whatever I try misses out the first digit.
e.g.
Sean O'Hair United States 2.758 137.906 50 -7.525 0.000
Y.E. Yang Korea 2.734 153.128 56 -6.722 0.000
Bo Van Pelt United States 2.733 153.056 56 -4.895 0.000
Unlikely this is still a problem given how old it is, but it's listed as unanswered so for the benefit of anyone with a similar problem...
Here's a quick pattern that'll extract all matches into an array - it may or not need to be made more flexible:
<cfset Matches = rematch( '\D+ \d\.\d{3} \d+\.\d{3} \d\d -\d\.\d{3} 0.000' , Input ) />
Then looping through those results, for each match you can separate the name+country from the numbers with:
<cfset NameAndCountry = trim(Left( CurMatch , refind('\d',CurMatch)-1 )) />
<cfset Numbers = Right( CurMatch , Len(CurMatch)-Len(NameAndCountry) ) />
Extracting the countries from the names is not simple - there aren't really any rules for which is which, so it needs a set of countries to loop through and check against, something like:
<cfloop index="CurCountry" array=#Countries# >
<cfif NameAndCountry.endsWith( CurCountry ) >
<cfset Name = Left( NameAndCountry , Len(NameAndCountry)-Len(CurCountry) />
<cfbreak />
</cfif>
</cfloop>
For the numbers, using ListToArray with space as delimiter can separate them.
If you pipe your example data through:
sed -e 's/^[^0-9]*//'
it'll strip all the non-number characters from the beginning. Does that help?
P.S. Splitting the name from the country would be tricky, since it looks like there's just a space between, and there's also spaces in the middle of names and countries.
EDIT: Oops, that would remove a minus sign from the first number. Probably better to only remove words (sequences of non-digits followed by a space):
sed -e 's/^\([^0-9 ]* \)*//'
精彩评论