Solr indexing HTML entities
I am indexing documents by Solr, which were scraped from the web. The documents contain HTML entities (such as £
or £
). Mostly the do开发者_StackOverflow中文版cuments contain central european characters. Is there any charfilter for this task? I know solr.MappingCharFilterFactory, but using this would mean, that I have to define the mappings myself. I would be happier with a shared solution maintained by a community. Thanks for your help!
There is solr.HTMLStripCharFilterFactory
, which converts HTML
entities, but it also strips HTML
tags.
精彩评论