Pulling out two separate words from a string using reg expressions?
I need to improve on a regular expression I'm using开发者_如何学编程. Currently, here it is:
^[a-zA-Z\s/-]+
I'm using it to pull out medication names from a variety of formulation strings, for example:
- SULFAMETHOXAZOLE-TRIMETHOPRIM 200-40 MG/5ML PO SUSP
- AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE
- AMOXICILLIN TRIHYDRATE 125 mg ORAL TABLET, CHEWABLE
- AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE
- Amoxicillin 1000 MG / Clavulanate 62.5 MG Extended Release Tablet
The resulting matches on these examples are:
- SULFAMETHOXAZOLE-TRIMETHOPRIM
- AMOX TR/POTASSIUM CLAVULANATE
- AMOXICILLIN TRIHYDRATE
- AMOX TR/POTASSIUM CLAVULANATE
- Amoxicillin
The first four are what I want, but on the fifth, I really need "Amoxicillin / Clavulanate".
How would I pull out patterns like "Amoxicillin / Clavulanate" (in fifth row) while missing patterns like "MG/5 ML" (in the first row)?
Update
Thanks for the help, everyone. Here's a longer list of examples with more nuances of the data:
- Amoxicillin 1000 MG / Clavulanate 62.5 MG Extended Release Tablet
- Amoxicillin 1000 MG / Clavulanate 62.5 MG Extended Release Tablet
- Amoxicillin 10 MG/ML Oral Suspension
- Amoxil 10 MG/ML Oral Suspension
- AMOXICILLIN TRIHYDRATE 125 mg ORAL TABLET, CHEWABLE
- AMOXAPINE
- AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE
- AMOXICILLIN TRIHYDRATE 125 mg ORAL TABLET, CHEWABLE
- AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE
- AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE
- CARBATROL 200 MG PO CP12
- CARBATROL 200 MG PO CP12
- CARBATROL
- CARBAMAZEPINE 100 MG PO CHEW
- CEFDINIR 250 MG/5ML PO SUSR
- AMOXICILLIN 400 MG/5ML PO SUSR
- SULFAMETHOXAZOLE-TRIMETHOPRIM 200-40 MG/5ML PO SUSP
- DIAZEPAM 2 MG PO TABS
- DIAZEPAM
- PREDNISONE 20 MG PO TABS
- AUGMENTIN 250-62.5 MG/5ML PO SUSR
- ACETAMINOPHEN 325 MG/10.15ML PO SUSP
What I've done for now is this:
private static string GetMedNameFromIncomingConceptString(string conceptAsString)
{
// look for match at beginning of string
Match firstRegMatch = new Regex(@"^[a-zA-Z\s/-]+").Match(conceptAsString);
if (firstRegMatch.Success)
{
// grab matching part of string as whole string
string firstPart = conceptAsString.Substring(firstRegMatch.Index, firstRegMatch.Length);
// look for additional match following a hash (like Amox 1000 / Clav 50)
Match secondRegMatch = new Regex(@"/\s[a-zA-Z\s/-]+").Match(conceptAsString, firstRegMatch.Length);
if (secondRegMatch.Success)
return firstPart + conceptAsString.Substring(secondRegMatch.Index, secondRegMatch.Length);
else
return firstPart;
}
else
{
return conceptAsString;
}
}
It's pretty ugly, and I imagine it may fail when I run a lot more data through it, but it works for the larger set of cases I listed above.
When a slash is part of the dosage, is it always followed immediately by a digit? If so, this regex should do for you:
([A-Z]\D+)\d[^/]*(?:/\d[^/]*)*
It actively matches the dosage information as the others suggested, but captures only the medication name. Then you do a global replace for $1
to delete the dosage. Here's how I tested it in Java:
String[] data = {
"SULFAMETHOXAZOLE-TRIMETHOPRIM 200-40 MG/5ML PO SUSP",
"AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE",
"AMOXICILLIN TRIHYDRATE 125 mg ORAL TABLET, CHEWABLE",
"AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE",
"Amoxicillin 1000 MG / Clavulanate 62.5 MG Extended Release Tablet"
};
Pattern p = Pattern.compile("([A-Z]\\D+)\\d[^/]*(?:/\\d[^/]*)*");
Matcher m = p.matcher("");
for (String s : data)
{
System.out.println(m.reset(s).replaceAll("$1"));
}
output:
SULFAMETHOXAZOLE-TRIMETHOPRIM
AMOX TR/POTASSIUM CLAVULANATE
AMOXICILLIN TRIHYDRATE
AMOX TR/POTASSIUM CLAVULANATE
Amoxicillin / Clavulanate
EDIT: Okay, it looks like the slash in the dosage is always followed by ML
, which may be preceded by a number, which may include a decimal point. Also, the dosage information may be missing entirely. This regex seems to yield the desired result for your expanded sample input:
([A-Z]\D+)(?:$|\d[^/]*(?:/[\d.]*ML[^/]*)*)
It should work in C#, too.
What you're asking for can't be done, since any attempt to do so would result in also picking up "PO SUSP", "ORAL TABLET", etc. What I recommend you do is try to pick up both the compound and the dosage, then strip off the dosage after.
I think you would be better off removing words you know wont be part of the medication name such as oral
, numbers, etc. This should leave you with what you want.
Alternatively, if you have a database of medications, you can extract only words from that database, which should leave you with just the medications.
I realize these solutions don't use regular expressions, but I don't think they're up to the task you've set for them.
The problem with your regex is that it stops matching as soon as it encounters a digit. The assumption is that once you have a dosage, you're done. However, the fifth example counters that assumption.
If you think about using regexes, consider this: How would you go explaining the rule for extracting medications for a regular Joe? Something like "Any and all strings containing letters or the characters / and -, except for the words mg, ml, oral, extended, release, tablet, chewable, po, susp." Sounds pretty difficult, considering it probably doesn't cover all cases.
If the examples are representative for your data, I do see a pattern. Assuming Perl:
/($compound+ $dosage)+ $usage/xi
where
$compound = qr/[a-z-] [\s\/]?/x;
$dosage = qr/(\/? [\d.-] \s (ml|mg))+/x; # add measurement units if needed
$usage = qr/.*/; # rest of string
Pretty hairy if you ask me, and I haven't tested it, only proven it correct. It would probably need some tweaking.
Edit: I see that you've added the tag .net
, but the regexes would look similar in C#.
Looking at the new data, the easiest, and arguably cleanest and robust way to do what you want is to first remove the usage (tablet, chewable, susp) and then to remove the dosages.
private static string GetMedNameFromIncomingConceptString(string conceptAsString) {
Regex compoundsAndDosages = new Regex(@".*[\s\d]m[gl]", RegexOptions.IgnoreCase);
Regex onlyDosage = new Regex(@"\s?[\d.-]+\s?m[gl][\/-]?", RegexOptions.IgnoreCase);
// keep compounds and dosage (= remove usage)
Match cad = compoundsAndDosages.Match(conceptAsString);
if (cad.Success) {
// remove dosages (= keep compunds)
return onlyDosage.Replace(cad.Value, "");
} else {
return conceptAsString;
}
}
精彩评论