Greek characters, Regular Expressions, and C#
I'm building a CMS for a scientific journal and that uses a lot of Greek characters. I need to validate a field to include a spe开发者_开发知识库cific character set and Greek characters. Here's what I have now:
[^a-zA-Z0-9-()/\s]
How do I get this to include Greek characters in addition to alphanumeric, '(', ')', '-', and '_'?
I'm using C#, by the way.
In .NET languages, you can use \p{IsGreekandCoptic}
to match Greek characters. So the resulting regex is
[^a-zA-Z0-9-()/\s\p{IsGreekandCoptic}]
\p{IsGreekandCoptic}
matches:
These characters will be matched by \p{IsGreekandCoptic} http://img203.imageshack.us/img203/3760/greekcoptic.png
If you're using a language that uses PCRE for regular expressions and UTF-8, /[\x{0374}-\x{03FF}]+/u
should match Greek characters. Greek characters fall between U+0374 and U+03FF (source), and the u
modifier tells PCRE to use unicode. As commented below, /\p{Greek}+/u
works as well with PCRE.
If you're using Javascript, it uses \uXXXX
instead of \x{XXXX}
: /[\u0374-\u03FF]+/
.
Also see this guide to Unicode Regular Expressions for more information.
For Java, from the Pattern javadoc:
\p{InGreek} A character in the Greek block (simple block)
Being my first response on SO, I can't downvote Daniel's answer on javascript regex.
I know this is very late, but Daniel's answer is incorrect. It excludes the ancient characters below! This is important if you're working on a Bible app that researches words in ancient Greek!
This is the correct regex for finding greek & coptic in js:
/[\u0370-\u03FF]+/gm
http://unicode.org/charts/PDF/U0370.pdf
Excerpt from chart:
0370 Ͱ GREEK CAPITAL LETTER HETA → 2C75 Ⱶ latin capital letter half h
0371 ͱ GREEK SMALL LETTER HETA → 2C76 ⱶ latin small letter half h
0372 Ͳ GREEK CAPITAL LETTER ARCHAIC SAMPI
0373 ͳ GREEK SMALL LETTER ARCHAIC SAMPI
EDIT: Craig points out that Daniel's regex is correct for the OP. While I can't find where the OP specifies which Greek text he's evaluating, I'll concede that my response is only valid for ancient texts.
While I'm editing this, I want to also point out that no regex here matches Greek characters with the kind of accenting that Perseus adds to their texts. So if you happen to install the http://www.perseus.tufts.edu/hopper/, or use any of their public domain resources in an app, be careful with my regex.
精彩评论