unsupervised Named entity recognition (NER) with custom controlled vocabulary for crosslink-suggestions in Java
I'm looking for a Java library that can do Named entity recognition (NER) with a custom controlled vocabulary, without needing labeled training data first. I searched some on SE, but most questions are rather unspecific.
Consider the following use-case:
- an editor is inputting articles in a CMS (about 500 words).
- the text may contain references (in plain text) to entities of a specific domain. e.g:
- names of points of interest, like bars, restaurants, as well as neighborhoods, etc.
- a controlled vocabulary of these entities exist (about 5.000 entities) .
- I imagine an entity to be a -tuple in the vocabulary
- after finishing the text, the user should be able to save the document.
- This triggers the workflow to scan the piece of text against the vocabulary, by comparing against the name of the entity. It's not required to have a 100% match: 97% on Jarao-winkler or whatever (I'm not familiar with what algo's NER uses) may be enough, I need this to be configurable.
- Hits are returned to th开发者_如何学JAVAe controller server-side. This in return returns JSON to the client containing of the entities, which are represented as suggested crosslinks to the editor.
Ideally, I'm looking for a project that uses NRE to suggests crosslinks within a CMS-environment to piggyback on. (I'm sure plugins for wordpress exist for example) not so sure if something similar exists in Java.
All other more general pointers to NRE-libraries which work with controlled custom vocabularies are welcome as well.
For people looking this up in the future:
"Approximate Dictionary-Based Chunking" see: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
(URL edited.)
Unsure if these might be helpful: http://www-nlp.stanford.edu/software/CRF-NER.shtml http://cogcomp.cs.illinois.edu/page/software
精彩评论