开发者

Greek characters, Regular Expressions, and C#

I'm building a CMS for a scientific journal and that uses a lot of Greek characters. I need to validate a field to include a spe开发者_开发知识库cific character set and Greek characters. Here's what I have now:

[^a-zA-Z0-9-()/\s]

How do I get this to include Greek characters in addition to alphanumeric, '(', ')', '-', and '_'?

I'm using C#, by the way.


In .NET languages, you can use \p{IsGreekandCoptic} to match Greek characters. So the resulting regex is

[^a-zA-Z0-9-()/\s\p{IsGreekandCoptic}]

\p{IsGreekandCoptic} matches:

These characters will be matched by \p{IsGreekandCoptic} http://img203.imageshack.us/img203/3760/greekcoptic.png


If you're using a language that uses PCRE for regular expressions and UTF-8, /[\x{0374}-\x{03FF}]+/u should match Greek characters. Greek characters fall between U+0374 and U+03FF (source), and the u modifier tells PCRE to use unicode. As commented below, /\p{Greek}+/u works as well with PCRE.

If you're using Javascript, it uses \uXXXX instead of \x{XXXX}: /[\u0374-\u03FF]+/.

Also see this guide to Unicode Regular Expressions for more information.


For Java, from the Pattern javadoc:

\p{InGreek} A character in the Greek block (simple block)


Being my first response on SO, I can't downvote Daniel's answer on javascript regex.

I know this is very late, but Daniel's answer is incorrect. It excludes the ancient characters below! This is important if you're working on a Bible app that researches words in ancient Greek!

This is the correct regex for finding greek & coptic in js:

/[\u0370-\u03FF]+/gm 

http://unicode.org/charts/PDF/U0370.pdf

Excerpt from chart:

0370 Ͱ GREEK CAPITAL LETTER HETA → 2C75 Ⱶ  latin capital letter half h

0371 ͱ GREEK SMALL LETTER HETA → 2C76 ⱶ  latin small letter half h

0372 Ͳ GREEK CAPITAL LETTER ARCHAIC SAMPI

0373 ͳ GREEK SMALL LETTER ARCHAIC SAMPI

EDIT: Craig points out that Daniel's regex is correct for the OP. While I can't find where the OP specifies which Greek text he's evaluating, I'll concede that my response is only valid for ancient texts.

While I'm editing this, I want to also point out that no regex here matches Greek characters with the kind of accenting that Perseus adds to their texts. So if you happen to install the http://www.perseus.tufts.edu/hopper/, or use any of their public domain resources in an app, be careful with my regex.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜