开发者

What is difference between crawling, Parsing, Indexing, Search from Python libraries perspective [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhet开发者_如何学Corical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 11 years ago.

I am confused between these terms. They somehow looks same to me. Can someone please Explain me the steps in which order they perform and which libraries can do the work. To me its all look the same.

I want to know at each step what is the input and what is the output e,g

Crawling
Input = URL
Output = ?

Indexing
Input = ?


I'll give you a general discription, algorithmically, make the modifications to your python libs.

Crawling: starting from a set of URLs and its goal is to expand the set's size, it actually follows out links and try to expand the graph as much as it can (until it covers the net-graph connected to the initial set of URLs or until resources [usually time] expires). so:
input = Set of URLs
output = bigger set of URLs which are reachable from the input

Indexing: using the data the crawlers gathered to "index" the files. index is actually a list that maps each term (usually word) in the collection to the documents that this term appears in.
input:set of URLs
output: index file/library.

Search: use the index to search for relevant documents to a given query.
input: a query (String) and the index [usually it is an implicit argument, since its part of the state..]
output: relevant documents to the query (documents is actually a web site here, that was crawled...)

I encourage you to have a look at PyLucene which do all of these things (and more!)... and read some more about Information Retrieval


You should also check out Scrapy, a django app:

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

It crawls the sites and extracts the data of interest, which you can specify using xpath across the site periodically, and saves it to the database as a new version.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜