Beginner Regex problem with whitespace and backtracking

2023-02-17 07:51 问答作者：

I'm trying to extract data from a PDF which is in the form of table with headings such as name, countr开发者_运维知识库y, and various numeric fields.

I am having problems where the names and countries are of different length. I'm also not sure how to get to the numbers as whatever I try misses out the first digit.

e.g.

Sean O'Hair United States 2.758 137.906 50 -7.525 0.000  
 Y.E. Yang Korea 2.734 153.128 56 -6.722 0.000  
 Bo Van Pelt United States 2.733 153.056 56 -4.895 0.000

Unlikely this is still a problem given how old it is, but it's listed as unanswered so for the benefit of anyone with a similar problem...

Here's a quick pattern that'll extract all matches into an array - it may or not need to be made more flexible:

<cfset Matches = rematch( '\D+ \d\.\d{3} \d+\.\d{3} \d\d -\d\.\d{3} 0.000' , Input ) />

Then looping through those results, for each match you can separate the name+country from the numbers with:

<cfset NameAndCountry = trim(Left( CurMatch , refind('\d',CurMatch)-1 )) />
<cfset Numbers = Right( CurMatch , Len(CurMatch)-Len(NameAndCountry) ) />

Extracting the countries from the names is not simple - there aren't really any rules for which is which, so it needs a set of countries to loop through and check against, something like:

<cfloop index="CurCountry" array=#Countries# >
    <cfif NameAndCountry.endsWith( CurCountry ) >
        <cfset Name = Left( NameAndCountry , Len(NameAndCountry)-Len(CurCountry) />
        <cfbreak />
    </cfif>
</cfloop>

For the numbers, using ListToArray with space as delimiter can separate them.

If you pipe your example data through:

sed -e 's/^[^0-9]*//'

it'll strip all the non-number characters from the beginning. Does that help?

P.S. Splitting the name from the country would be tricky, since it looks like there's just a space between, and there's also spaces in the middle of names and countries.

EDIT: Oops, that would remove a minus sign from the first number. Probably better to only remove words (sequences of non-digits followed by a space):

sed -e 's/^\([^0-9 ]* \)*//'

继续阅读：cfml coldfusion coldfusion-9 regex

Beginner Regex problem with whitespace and backtracking

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？