开发者

Regex for extracting names of colleges, universities, and institutes?

I have a bunch of strings like this in a file:

M.S., Arizona University, Tucson, Az., 1957
B.A., American International College, Springfield, 开发者_运维问答Mass., 1978
B.A., American University, Washington, D.C., 1985

and I'd like to extract Tufts University, American International College, American University, University of Massachusetts, etc, but not the high schools (it's probably safe to assume that if it contains "Academy" or "High School" that it's a high school). Any ideas?


Tested with preg_match_all in PHP, will work for the sample text you provided:

 /(?<=,)[\w\s]*(College|University|Institute)[^,\d]*(?=,|\d)/

Will need to be modified somewhat if your regex engine does not support lookaheads/lookbehinds.


Update: I looked at your linked sample text & updated the regex accordingly

 /([A-Z][^\s,.]+[.]?\s[(]?)*(College|University|Institute|Law School|School of|Academy)[^,\d]*(?=,|\d)/

The first part will match a string starting with a capital letter, optionally followed by an .. Then a space, then optionally an (. This pattern is matched zero or more times.

This should get all relevant words preceding the keywords.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜