How to search a particular type of web addresses?

2022-12-21 04:29 问答作者：

See these url's:

http://en.wikipedia.org/wiki/1_(number)

http://en.wikipedia.org/wiki/10_(number开发者_开发百科)

http://en.wikipedia.org/wiki/100_(number)

http://en.wikipedia.org/wiki/10000_(number)

Is there some way to search a list of all the pages of this format on the WWW?

I see two problems to solve.

The first one: You don't have any real central directory of all URLs in the world, and even you will not have a sitemap on every site you know

An idea would be to check if a search engine (Google or other) let you works at URL level instead of content level for searching. You would then generate search query that could return list of sites matching your regex and try to do it.

The second one: For certain webservices which may exposing functions as resources, you may have an infinite URL list matching a regex

You may use several check to avoid this.

By the way, you are facing the same problem as every search engine ... making an inventory of all the web. No one ever solved this problem.

EDIT: webcrawler basic algorithm

take a list of seed sites
for each seed
  parse the webpage returned
  add each link found in the page to the seed list
  apply some algorithms for referencing the page to several keywords in a db

Usually grep -E "http://en.wikipedia.org/wiki/10*_\(number\)" list_of_urls

But if you want to know whether some website presents some content on urls of some format, you have a few possibilities.

There is some sitemap, where you can grab your list_of_urls and use it in grep. (http://en.wikipedia.org/wiki/Special:AllPages)
You have to build a list of these addresses and try them. There is no standard way for an HTTP server to advertise all its pages.
The Google's way - crawl the site following the links so you can find all public pages it has and then search in the list you've built.

Also, Google supports allinurl: and site: keywords, they could help you too.

继续阅读：grep regex

How to search a particular type of web addresses?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？