开发者

Beginner Regex problem with whitespace and backtracking

I'm trying to extract data from a PDF which is in the form of table with headings such as name, countr开发者_运维知识库y, and various numeric fields.

I am having problems where the names and countries are of different length. I'm also not sure how to get to the numbers as whatever I try misses out the first digit.

e.g.

Sean O'Hair United States 2.758 137.906 50 -7.525 0.000  
 Y.E. Yang Korea 2.734 153.128 56 -6.722 0.000  
 Bo Van Pelt United States 2.733 153.056 56 -4.895 0.000


Unlikely this is still a problem given how old it is, but it's listed as unanswered so for the benefit of anyone with a similar problem...

Here's a quick pattern that'll extract all matches into an array - it may or not need to be made more flexible:

<cfset Matches = rematch( '\D+ \d\.\d{3} \d+\.\d{3} \d\d -\d\.\d{3} 0.000' , Input ) />

Then looping through those results, for each match you can separate the name+country from the numbers with:

<cfset NameAndCountry = trim(Left( CurMatch , refind('\d',CurMatch)-1 )) />
<cfset Numbers = Right( CurMatch , Len(CurMatch)-Len(NameAndCountry) ) />

Extracting the countries from the names is not simple - there aren't really any rules for which is which, so it needs a set of countries to loop through and check against, something like:

<cfloop index="CurCountry" array=#Countries# >
    <cfif NameAndCountry.endsWith( CurCountry ) >
        <cfset Name = Left( NameAndCountry , Len(NameAndCountry)-Len(CurCountry) />
        <cfbreak />
    </cfif>
</cfloop>

For the numbers, using ListToArray with space as delimiter can separate them.


If you pipe your example data through:

sed -e 's/^[^0-9]*//'

it'll strip all the non-number characters from the beginning. Does that help?

P.S. Splitting the name from the country would be tricky, since it looks like there's just a space between, and there's also spaces in the middle of names and countries.

EDIT: Oops, that would remove a minus sign from the first number. Probably better to only remove words (sequences of non-digits followed by a space):

sed -e 's/^\([^0-9 ]* \)*//'
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜