开发者

How to find non-alphabets using Java

I am processing text corpus. It contains several characters belonging to different languages, symbols, numbers, etc.

-> All I need to do is to skip the symbols like arrow mark, heart symbol, etc.

-> I should not be spoiling an开发者_JS百科y characters of different languages.

Any leads?

----UPDATE----

Character.isLetter('\unicode') is working for most of them, if not some. I have checked my regional languages, it seems it's working for some but not each and every.

Thanks.


If i understnad correctly, the characters you want to remove are of a rather limited set. Why not just check for these? Unicode has a whole bunch of non-letter characters, but in your case, the non-letter characters encountered will probably be a small subset of what exists.

Sounds like a job for regular expressions, if you ask me. Remove everything that's not a word character, digit or whitespace, and you've probably got it. Or create an array containing all characters you want filtered out (which in that case should be few and known).


You could implement a Charset that contains only the characters you want. You can then provide a CharsetDecoder to decode the text and strip out the characters you want to skip.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜