What is difference between crawling, Parsing, Indexing, Search from Python libraries perspective [closed]

2023-03-14 02:46 问答作者：

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhet开发者_如何学Corical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 11 years ago.

I am confused between these terms. They somehow looks same to me. Can someone please Explain me the steps in which order they perform and which libraries can do the work. To me its all look the same.

I want to know at each step what is the input and what is the output e,g

Crawling
Input = URL
Output = ?

Indexing
Input = ?

I'll give you a general discription, algorithmically, make the modifications to your python libs.

Crawling: starting from a set of URLs and its goal is to expand the set's size, it actually follows out links and try to expand the graph as much as it can (until it covers the net-graph connected to the initial set of URLs or until resources [usually time] expires). so:
input = Set of URLs
output = bigger set of URLs which are reachable from the input

Indexing: using the data the crawlers gathered to "index" the files. index is actually a list that maps each term (usually word) in the collection to the documents that this term appears in.
input:set of URLs
output: index file/library.

Search: use the index to search for relevant documents to a given query.
input: a query (String) and the index [usually it is an implicit argument, since its part of the state..]
output: relevant documents to the query (documents is actually a web site here, that was crawled...)

I encourage you to have a look at PyLucene which do all of these things (and more!)... and read some more about Information Retrieval

You should also check out Scrapy, a django app:

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

It crawls the sites and extracts the data of interest, which you can specify using xpath across the site periodically, and saves it to the database as a new version.

继续阅读：django information-retrieval python search

What is difference between crawling, Parsing, Indexing, Search from Python libraries perspective [closed]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？