How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol
I want to know How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol
I 开发者_Python百科am able to do it on local file systems using file:// protocol but not http protocol
add this property in the nutch-site.xml file then you will crawl the pdf files
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|text|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>protocol-httpclient|urlfilter-regex|parse-(html|text|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</description>
</property>
精彩评论