How to identify Menus in various websites using BeautifulSoup?
I want to identify the div element which has the main menu in a website.
The approach I am thinking of:
- Pars开发者_JAVA技巧e HTML using Beautiful Soup
- Menus usually have the highest link density i.e anchor tag count or look for a ul with all li tags having links
The above approach can fail because in various websites - the footer element can have a high link density (Ex: www.langoor.com)
Another approach is to look for the keyword "menu" in "id" or "class" attributes of the div elements. This is a very expensive approach as we might end up searching for many words.
It would be great if you could help me look in the right direction to solve this problem. Thanks!
It's quite hard, because menus in html aren't standarized. Search in DOM tree for ul/li with keywords ("menu", etc.) in first or second div (before footer) - these places are common used for menus. Or wait for html5 and nav tag.
精彩评论