Regex to match on capital letter, digit or capital, lowercase, and digit
I'm working on an application which will calculate molecular weight and I need to separate a string into the different molecules. I've been using a regex to do this but I haven't quite gotten it to work. I need the regex to match on patterns like H2OCl4 and Na2H2O where it would break it up into matches like:
- H2
- O
- Cl4
- Na2
- H2
- O
The regex i've been working on is this:
([A-Z]\d*|[A-Z]*[a-z]\d*)
It's really close but it currently breaks the matches into this:
- H2
- O 开发者_StackOverflow
- C
- l4
I need the Cl4 to be considered one match. Can anyone help me with the last part i'm missing in this. I'm pretty new to regular expressions. Thanks.
I think what you want is "[A-Z][a-z]?\d*"
That is, a capital letter, followed by an optional small letter, followed by an optional string of digits.
If you want to match 0, 1, or 2 lower-case letters, then you can write:
"[A-Z][a-z]{0,2}\d*"
Note, however, that both of these regular expressions assume that the input data is valid. Given bad data, it will skip over bad data. For example, if the input string is "H2ClxxzSO4", you're going to get:
- H2
- Clx
- S
- O4
If you want to detect bad data, you'll need to check the Index
property of the returned Match
object to ensure that it is equal to the beginning index.
Note that if you expect international characters in your input such as letters with diacritic marks (ñ,é,è,ê,ë, etc), then you should use the corresponding unicode category. In your case, what you want is @"\p{Lu}\p{Ll}?\d*"
.
精彩评论