Solr index empty after nutch solrindex command

2023-03-26 14:28 问答作者：

I'm using Nutch and Solr to index a file share.

I first issue: bin/nutch crawl urls

Which gives me:

solrUrl is not set, indexing will be skipped...
crawl started in: crawl-20110804191414
rootUrlDir = urls
threads = 10
depth = 5
solrUrl=null
Injector: starting at 2011-08-04 19:14:14
Injector: crawlDb: crawl-20110804191414/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-08-04 19:14:16, elapsed: 00:00:02
Generator: starting at 2011-08-04 19:14:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl-20110804191414/segments/20110804191418
Generator: finished at 2011-08-04 19:14:20, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2011-08-04 19:14:20
Fetcher: segment: crawl-20110804191414/segments/20110804191418
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
fetching file:///mnt/public/Personal/Reminder Building Security.htm
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetch开发者_运维技巧er: finished at 2011-08-04 19:14:22, elapsed: 00:00:02
ParseSegment: starting at 2011-08-04 19:14:22
ParseSegment: segment: crawl-20110804191414/segments/20110804191418
ParseSegment: finished at 2011-08-04 19:14:23, elapsed: 00:00:01
CrawlDb update: starting at 2011-08-04 19:14:23
CrawlDb update: db: crawl-20110804191414/crawldb
CrawlDb update: segments: [crawl-20110804191414/segments/20110804191418]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-08-04 19:14:24, elapsed: 00:00:01
Generator: starting at 2011-08-04 19:14:24
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2011-08-04 19:14:25
LinkDb: linkdb: crawl-20110804191414/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/home/nutch/nutch-1.3/runtime/local/crawl-20110804191414/segments/20110804191418
LinkDb: finished at 2011-08-04 19:14:26, elapsed: 00:00:01
crawl finished: crawl-20110804191414

Then I: bin/nutch solrindex http://localhost:8983/solr/ crawl-20110804191414/crawldb crawl-20110804191414/linkdb crawl-20110804191414/segments/*

Which gives me:

SolrIndexer: starting at 2011-08-04 19:17:07
SolrIndexer: finished at 2011-08-04 19:17:08, elapsed: 00:00:01

When I do a : query on solr I get:

<response>
     <lst name="responseHeader">
          <int name="status">0</int>
          <int name="QTime">2</int>
          <lst name="params">
               <str name="indent">on</str>
               <str name="start">0</str>
               <str name="q">*:*</str>
               <str name="version">2.2</str>
               <str name="rows">10</str>
          </lst>
     </lst>
     <result name="response" numFound="0" start="0"/>
</response>

Note that this worked fine when I tried to use protocol-http to crawl a website but does not work when I use protocol-file to crawl a file system.

---EDIT--- After trying this again today I noticed that files with spaces in the names were causing a 404 error. That's a lot of files on the share I'm indexing. However, the thumbs.db files were making it in ok. This tells me that the problem is not what I thought it was.

I've spent much of today retracing your steps. I eventually resorted to printf debugging in /opt/nutch/src/java/org/apache/nutch/indexer/IndexerMapReduce.java, which showed me that each URL I was trying to index was appearing twice, once starting with file:///var/www/Engineering/, as I'd originally specified, and once starting with file:/u/u60/Engineering/. On this system, /var/www/Engineering is a symlink to /u/u60/Engineering. Further, the /var/www/Engineering URLs were rejected because the parseText field wasn't supplied and the /u/u60/Engineering URLs were rejected because the fetchDatum field wasn't supplied. Specifying the original URLs in the /u/u60/Engineering form solved my problem. Hope that helps the next sap in this situation.

This is because solr didn't get the data to index. Seems like you have not properly executed the previous commands. Restart the whole process and then try the last command. copy the commands from here:https://wiki.apache.org/nutch/NutchTutorial or refer my video on youtube--https://www.youtube.com/watch?v=aEap3B3M-PU&t=449s

继续阅读：nutch solr

Solr index empty after nutch solrindex command

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？