Regex for extracting names of colleges, universities, and institutes?
I have a bunch of strings like this in a file:
M.S., Arizona University, Tucson, Az., 1957
B.A., American International College, Springfield, 开发者_运维问答Mass., 1978
B.A., American University, Washington, D.C., 1985
and I'd like to extract Tufts University, American International College, American University, University of Massachusetts, etc, but not the high schools (it's probably safe to assume that if it contains "Academy" or "High School" that it's a high school). Any ideas?
Tested with preg_match_all
in PHP, will work for the sample text you provided:
/(?<=,)[\w\s]*(College|University|Institute)[^,\d]*(?=,|\d)/
Will need to be modified somewhat if your regex engine does not support lookaheads/lookbehinds.
Update: I looked at your linked sample text & updated the regex accordingly
/([A-Z][^\s,.]+[.]?\s[(]?)*(College|University|Institute|Law School|School of|Academy)[^,\d]*(?=,|\d)/
The first part will match a string starting with a capital letter, optionally followed by an .
. Then a space, then optionally an (
. This pattern is matched zero or more times.
This should get all relevant words preceding the keywords.
精彩评论