开发者

Representing a set of URLs in a list as a tree structure

I have a list of dicts that stores URLs. It has simply two fields, title and url. Example:

[
  {'title': 'Index Page', 'url': 'http://www.example.com/something/index.htm'}, 
  {'title': 'Other Page', 'url': 'http://www.example.com/something/other.htm'},
  {'title': 'About Page', 'url': 'http://www.example.com/thatthing/about.htm'},
  {'title': 'Detail Page', 'url': 'http://www.example.com/something/thisthing/detail.htm'},
]

However, I'd to get a tree structure from this list of dicts. I'm looking for something like this:

{ 'www.example.com': 
  [ 
    { 'something': 
      [ 
        { 'thisthing':
          [
            { 'title': 'Detail Page', 'url': 'detail.htm'}
          ]
        },
        [
          { 'title': 'Index Page', 'url': 'index.htm'},
          { 'title': 'Other Page', 'url': 'other.htm'}
        ]
      ]
    },
    { 'thatthing': 
      [ 
        { 'title': 开发者_运维技巧'About Page', 'url': 'about.htm'}
      ]
    }
  ]
}

My first attempt at this would be a urlparse soup in a bunch of for loops and I'm confident that there's a better and faster way to do this.

I've seen people on SO work magic with list comprehensions, lambda functions, etc. I'm still in the process of figuring it out.

(For Django developers: I'll be using this my Django application. I'm storing the URLs in a model called Page which has two fields name and title)


Third time is the charm... that is some nice structure you have there :). In your comment you mention that you "haven't been able to think of a better tree format to represent data like this"... this made me again take the liberty to (just slightly) alter the formatting of the output. In order to dynamically add subelements, a dictionary has to be created to house them. But for "leaf nodes", this dictionary is never filled in. If desired these can of course be removed by another loop, but it cannot happen during the iteration because the empty dict should be present for possible new nodes. The some goes for nodes that do not have a file in the: these will contain an empty list.

ll = [
  {'title': 'Index Page', 'url': 'http://www.example.com/something/index.htm'}, 
  {'title': 'Other Page', 'url': 'http://www.example.com/something/other.htm'},
  {'title': 'About Page', 'url': 'http://www.example.com/thatthing/about.htm'},
  {'title': 'Detail Page', 'url': 'http://www.example.com/something/thisthing/detail.htm'},
]

# First build a list of all url segments: final item is the title/url dict
paths = []
for item in ll:
    split = item['url'].split('/')
    paths.append(split[2:-1])
    paths[-1].append({'title': item['title'], 'url': split[-1]})

# Loop over these paths, building the format as we go along
root = {}
for path in paths:
    branch = root.setdefault(path[0], [{}, []])
    for step in path[1:-1]:
        branch = branch[0].setdefault(step, [{}, []])
    branch[1].append(path[-1])

# As for the cleanup: because of the alternating lists and
# dicts it is a bit more complex, but the following works:
def walker(coll):
    if isinstance(coll, list):
        for item in coll:
            yield item
    if isinstance(coll, dict):
        for item in coll.itervalues():
            yield item

def deleter(coll):
    for data in walker(coll):
        if data == [] or data == {}:
            coll.remove(data)
        deleter(data)

deleter(root)

import pprint
pprint.pprint(root)

Output:

{'www.example.com':
    [
        {'something':
            [
                {'thisthing':
                    [
                        [
                            {'title': 'Detail Page', 'url': 'detail.htm'}
                        ]
                    ]
                },
                [
                    {'title': 'Index Page', 'url': 'index.htm'},
                    {'title': 'Other Page', 'url': 'other.htm'}
                ]
            ],
         'thatthing':
            [
                [
                    {'title': 'About Page', 'url': 'about.htm'}
                ]
            ]
        },
    ]
}


Here's my solution. It seems to work. A very different approach from Jro's:

import itertools
import pprint

pages = [
  {'title': 'Index Page', 'url': 'http://www.example.com/something/index.htm'},
  {'title': 'Other Page', 'url': 'http://www.example.com/something/other.htm'},
  {'title': 'About Page', 'url': 'http://www.example.com/thatthing/about.htm'},
  {'title': 'dtgtet Page', 'url': 'http://www.example.com/thatthing/'},
  {'title': 'Detail Page', 'url': 'http://www.example.com/something/thisthing/detail.htm'},
  {'title': 'Detail Page', 'url': 'http://www.example.com/something/thisthing/thisthing/detail.htm'},
]



def group_urls(url_set, depth=0):
    """
    Fetches the actions for a particular domain
    """
    url_set = sorted(url_set, key=lambda x: x['url'][depth])

    tree = []

    leaves = filter(lambda x: len(x['url']) - 1 == depth, url_set)
    for cluster, group in itertools.groupby(leaves, lambda x: x['url'][depth]):
        branch = list(group)
        tree.append({cluster: branch})

    twigs = filter(lambda x: len(x['url']) - 1 > depth, url_set)
    for cluster, group in itertools.groupby(twigs, lambda x: x['url'][depth]):
        branch = group_urls(list(group), depth+1)
        tree.append({cluster: branch})

    return tree

if __name__ == '__main__':
    for page in pages:
        page['url'] = page['url'].strip('http://').split('/')

    pprint.pprint(group_urls(pages))

I can't seem to figure out why I need to sort at the beginning of every recursion. I bet if i worked that around, I could shave a off another couple of a seconds.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜