how to parse (only text) web sites while crawling
i can succesfully run crawl command via cygwin 开发者_JS百科on windows xp. and i can also make web search via using tomcat.
but i also want to save parsed pages during crawling event
so when i start crawling with like this
bin/nutch crawl urls -dir crawled -depth 3
i also want save parsed html files to text files
i mean during this period which i started with above command
nutch when fetched a page it will also automaticly save that page parsed (only text) to text files
these files names could be fetched url
i really need help about this
this will be used at my university language detection project
ty
The crawled pages are stored in the segments. You can have access to them by dumping the segment content:
nutch readseg -dump crawl/segments/20100104113507/ dump
You will have to do this for each segment.
精彩评论