Using regex to extract variables from a plain-text form letter?

2022-12-26 18:34 问答作者：

I'm looking for a good example of using Regular Expressions in PHP to "reverse engineer" a form letter (with a known format, of course) that has been pasted into a multiline textbox and sent to a script for processing.

So, for example, let's assume this is the original plain-text input (taken from a USDA press release):

WASHINGTON, April 5, 2010 - North American Bison Co-Op, a New Rockford,开发者_运维技巧 N.D., establishment is recalling approximately 25,000 pounds of whole beef heads containing tongues that may not have had the tonsils completely removed, which is not compliant with regulations that require the removal of tonsils from cattle of all ages, the U.S. Department of Agriculture's Food Safety and Inspection Service (FSIS) announced today.

For clarity, the fields that are variables are highlighted below:

[pr_city=]WASHINGTON, [pr_date=]April 5, 2010 - [corp_name=]North American Bison Co-Op, a [corp_city=]New Rockford, [corp_state=]N.D., establishment is recalling approximately [amount=]25,000 pounds of [product=]whole beef heads containing tongues that may not have had the tonsils completely removed, which is not compliant with regulations that require [reason=]the removal of tonsils from cattle of all ages, the U.S. Department of Agriculture's Food Safety and Inspection Service (FSIS) announced today.

How could I efficiently extract the contents of the

pr_city
pr_date
corp_name
corp_city
corp_state
amount
product
reason

fields from my example?

Any help would be appreciated, thanks.

Well, a regex that works on your example could look like this (line breaks introduced to keep this beast legible, need to be removed prior to use):

/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a 
(?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is 
recalling approximately (?P<amount>.*?) of (?P<product>.*?), 
which is not compliant with regulations that require (?P<reason>.*?), 
the U\.S\. Department of Agriculture\'s Food Safety and Inspection 
Service \(FSIS\) announced today\.$/

So, in PHP you could do

if (preg_match('/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a (?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is recalling approximately (?P<amount>.*?) of (?P<product>.*?), which is not compliant with regulations that require (?P<reason>.*?), the U\.S\. Department of Agriculture\'s Food Safety and Inspection Service \(FSIS\) announced today\.$/', $subject, $regs)) {
    $prcity = $regs['pr_city'];
    $prdate = $regs['pr_date'];
    ... etc.
} else {
    $result = "";
}

This assumes a couple of things, for instance that there are no line breaks, and that the input is the entire string (and not a larger string from which this part has to be extracted from). I've tried to make assumptions about legal values that make some sense, but there is the very real chance that other inputs could break this. So some more test cases are probably needed.

If the surrounding text is constant, then something like this partial regex could do the trick:

preg_match('/^(.*?), (.*?)- (.*?), a (.*?), (.*?), establishment is recalling approximately (.*?), which is not compliant with regulations that require (.*?), the U.S. Department of Agriculture's Food Safety and Inspection Service (FSIS) announced today./', $text, $matches);

$matches[1] = 'WASHINGTON';
$matches[2] = 'April 5, 2010';
$matches[3] = ... etc...

If the surrounding text changes, then you're going to end up with a ton of false matches, no matches, etc... Essentially you'd need an AI to parse/understand PR releases.

Edit: Please disregard this crazy answer, as the other two are better. I should probably delete it, but I'm keeping it up for reference.

I have a crazy idea that just might work: build an XML string from the input by adding markups, then parse it. It might look something like this (completely untested) code:

preg_replace('([^,]*), ([^-]*)- ...etc...', '<pr_city>\1</pr_city><pr_date>\2</pr_date> ...etc...');

Parsing the XML afterwards is a needlessly complicated process that is best left to the PHP documentation: http://www.php.net/manual/en/function.xml-parse.php .

You could also consider converting it to JSON with this method, then using json_decode() to parse it. In any case, you have to think about what happens when " marks and > symbols appear in the input.

It might be easier to just match and remove one piece of the text at a time.

继续阅读：parsing php regex text-processing

Using regex to extract variables from a plain-text form letter?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？