开发者

Filtering information from large bodies of text

Is there a best practice, algorithm or software (open source with a permissive license required...) which can find information from bodies of text? I'm referring to:

  • find all email addresses in a text
  • find all mentions of 开发者_Python百科cities
  • find all mentions of states
  • find all urls
  • find all mentions of telephone numbers
  • find all mentions of zipcodes ... with the ability to add more ...

I heard RapidMiner should be able to do text mining like this, but AGPL is not an acceptable license for my purpose.

Is there anything 'standard' to do this kind of analysis?


Read about Named Entity Recognition. You can try Apache OpenNLP or Apache UIMA, both of which have the, well, Apache license.


For such entities type you can use rule-based NER tool like gexp.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜