What's the principle of web scanning softwares? [closed]
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 12 years ago.
Improve this questionHow can it scan all available pages automatically?
One way I can think of is to scan it recursively from the home page.
But it won't be able to scan out the back end CMS .
So开发者_如何学Go how do those scanning tools work?
Stupid web crawler:
Start by creating an array to store links, and putting one URL in there yourself. Create a second empty array to store visited URLs. Now start a program which does the following.
- Read and remove first item in the link array
- Download webpage at that URL
- Parse HTML for link tags, add all links found to the link array
- Add the webpage URL to the visited URL array
- Goto 1
If you assume that every page on the web is reachable by following some number of random links (possibly billions), then simply repeating steps 1 through 4 will eventually result in downloading the entire web. Since the web is not actually a fully connected graph, you have to start the process from different points to eventually reach every page.
精彩评论