How to properly configure Apache Tika for a few document types?
I've been using Tika for a while and I know that one is supposed to use only Tika facade with either default or custom TikaConfig that represents org/apache/tika/mime/开发者_StackOverflowtika-mimetypes.xml
file.
My application doesn't allow any document type different than html,doc,docx,odt,txt,rtf,srt,sub,pdf,odf,odp,xls,ppt,msg
and the default MediaTypes includes tons of others.
Are we supposed to modify tika-mimetypes.xml so that we remove MimeTypes that we don't need ? Then as I understand it will create composite parsers and detectors only for these MimeTypes.
But what happens when it is supplied unsupported type ? Should I just catch TikaException or some SAXException and decline the file ?
Also how is one supposed to manually edit tika-mimetypes.xml ? It has 1290 MimeTypes with mostly ridiculous third party MimeTypes. Why are they there ?
If you want to only accept certain types, then you'll still want the full mimetypes set. Otherwise, how else can you detect that the file someone's just given you is in fact a MP3, and not one of your approved formats? So, keep the full mimtypes set for detecting
Once you've done the detection step, and you've decided it's a valid mimetype, you could just pass the file on to the AutoDetectParser
and be done with it. After all, you'd check the mimetype returned by the detector and bail out already if it isn't one you like.
However, if you want an extra check, there are two ways to do it. One is to have a custom org.apache.tika.parser.Parser
file, which only lists the parsers for the formats you want to have used. This is the config file that's used to decide which parsers to make available to the AutoDetectParser
, so if for example you removed the MP3Parser
from that list, then the auto detect parser would stop handling MP3.
The other way is just to have an explicit list of the parsers you wish to support. Then, rather than using the auto detect parser, simple iterate through all of them until you get to one that is able to work on the file, and directly call the parse method on that. This will give you the most contol, but possibly with slightly more work.
精彩评论