开发者

Solr index empty after nutch solrindex command

I'm using Nutch and Solr to index a file share.

I first issue: bin/nutch crawl urls

Which gives me:

solrUrl is not set, indexing will be skipped...
crawl started in: crawl-20110804191414
rootUrlDir = urls
threads = 10
depth = 5
solrUrl=null
Injector: starting at 2011-08-04 19:14:14
Injector: crawlDb: crawl-20110804191414/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-08-04 19:14:16, elapsed: 00:00:02
Generator: starting at 2011-08-04 19:14:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl-20110804191414/segments/20110804191418
Generator: finished at 2011-08-04 19:14:20, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2011-08-04 19:14:20
Fetcher: segment: crawl-20110804191414/segments/20110804191418
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
fetching file:///mnt/public/Personal/Reminder Building Security.htm
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetch开发者_运维技巧er: finished at 2011-08-04 19:14:22, elapsed: 00:00:02
ParseSegment: starting at 2011-08-04 19:14:22
ParseSegment: segment: crawl-20110804191414/segments/20110804191418
ParseSegment: finished at 2011-08-04 19:14:23, elapsed: 00:00:01
CrawlDb update: starting at 2011-08-04 19:14:23
CrawlDb update: db: crawl-20110804191414/crawldb
CrawlDb update: segments: [crawl-20110804191414/segments/20110804191418]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-08-04 19:14:24, elapsed: 00:00:01
Generator: starting at 2011-08-04 19:14:24
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2011-08-04 19:14:25
LinkDb: linkdb: crawl-20110804191414/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/home/nutch/nutch-1.3/runtime/local/crawl-20110804191414/segments/20110804191418
LinkDb: finished at 2011-08-04 19:14:26, elapsed: 00:00:01
crawl finished: crawl-20110804191414

Then I: bin/nutch solrindex http://localhost:8983/solr/ crawl-20110804191414/crawldb crawl-20110804191414/linkdb crawl-20110804191414/segments/*

Which gives me:

SolrIndexer: starting at 2011-08-04 19:17:07
SolrIndexer: finished at 2011-08-04 19:17:08, elapsed: 00:00:01

When I do a : query on solr I get:

<response>
     <lst name="responseHeader">
          <int name="status">0</int>
          <int name="QTime">2</int>
          <lst name="params">
               <str name="indent">on</str>
               <str name="start">0</str>
               <str name="q">*:*</str>
               <str name="version">2.2</str>
               <str name="rows">10</str>
          </lst>
     </lst>
     <result name="response" numFound="0" start="0"/>
</response>

:(

Note that this worked fine when I tried to use protocol-http to crawl a website but does not work when I use protocol-file to crawl a file system.

---EDIT--- After trying this again today I noticed that files with spaces in the names were causing a 404 error. That's a lot of files on the share I'm indexing. However, the thumbs.db files were making it in ok. This tells me that the problem is not what I thought it was.


I've spent much of today retracing your steps. I eventually resorted to printf debugging in /opt/nutch/src/java/org/apache/nutch/indexer/IndexerMapReduce.java, which showed me that each URL I was trying to index was appearing twice, once starting with file:///var/www/Engineering/, as I'd originally specified, and once starting with file:/u/u60/Engineering/. On this system, /var/www/Engineering is a symlink to /u/u60/Engineering. Further, the /var/www/Engineering URLs were rejected because the parseText field wasn't supplied and the /u/u60/Engineering URLs were rejected because the fetchDatum field wasn't supplied. Specifying the original URLs in the /u/u60/Engineering form solved my problem. Hope that helps the next sap in this situation.


This is because solr didn't get the data to index. Seems like you have not properly executed the previous commands. Restart the whole process and then try the last command. copy the commands from here:https://wiki.apache.org/nutch/NutchTutorial or refer my video on youtube--https://www.youtube.com/watch?v=aEap3B3M-PU&t=449s

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜