How to generate graphical sitemap of large website [closed]
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this questionI would like to generate a graph开发者_运维知识库ical sitemap for my website. There are two stages, as far as I can tell:
- crawl the website and analyse the link relationship to extract the tree structure
- generate a visually pleasing render of the tree
Does anyone have advice or experience with achieving this, or know of existing work I can build on (ideally in Python)?
I came across some nice CSS for rendering the tree, but it only works for 3 levels.
Thanks
The only automatic way to create a sitemap is to know the structure of your site and write a program which builds on that knowledge. Just crawling the links won't usually work because links can be between any pages so you get a graph (i.e. connections between nodes). There is no way to convert a graph into a tree in the general case.
So you must identify the structure of your tree yourself and then crawl the relevant pages to get the titles of the pages.
As for "but it only works for 3 levels": Three levels is more than enough. If you try to create more levels, your sitemap will become unusable (too big, too wide). No one will want to download a 1MB sitemap and then scroll through 100'000 pages of links. If your site grows that big, then you must implement some kind of search.
Here is a python web crawler, which should make a good starting point. Your general strategy is this:
- you need to take care that outbound links are never followed, including links on the same domain but higher up than your starting point.
- as you spider, the site collect a hash of page urls mapped to a list of all the internal urls included in each page.
- take a pass over this list, assigning a token to each unique url.
- use your hash of {token => [tokens]} to generate a graphviz file that will lay out a graph for you
- convert the graphviz output into an imagemap where each node links to its corresponding webpage
The reason you need to do all this is, as leonm noted, that websites are graphs, not trees, and laying out graphs is a harder problem than you can do in a simple piece of javascript and css. Graphviz is good at what it does.
Please see http://aaron.oirt.rutgers.edu/myapp/docs/W1100_2200.TreeView on how to format tree views. You can also probably modify the example application http://aaron.oirt.rutgers.edu/myapp/DirectoryTree/index to scrape your pages if they are organized as directories of HTML files.
精彩评论