开发者

Python linked list O(1) insert/remove

I am looking for a linked list and related algorithms implementation for Python. Everyone I ask just recommends using built in Python lists, but performance measurements indicate that list insertion and removal is a bottleneck for our application. It's trivial to implement a simple linked list, but I wonder if there is a mature library which includes some operations like sort, merge, splice, search, lower/upper bound, etc...

I know this is a dupe, but searching for python list on any search engine gives predictably poor results, with most people just saying that linked lists are unneeded in python (pfft!).

PS: I need to insert and remove from anywhere in the list, not just the ends.

OK, you asked for it: I need to maintain an ordered list of several hundred thousand entries. I will iterate over the list forwards (one by one), using a visitor on each entry, starting from the beginning or a position found by a binary search. When an entry matching a predicate is found it is removed from the list, and then, another binary search is performed on a subset of the list beginning from the removed entry's previous position, until a position determined statistically beforehand. Ignoring the error condition, the modified entry may be used to create another linked list which is spliced into the new position found through the second binary search. Iteration is continued from the position where the entry was removed. On occasion, several thousand contiguous ordered entries may be added to/removed from any place in the list. Sometimes several thousand non-contiguous entries must be searched for and removed incrementally.

python's list is unnacceptable as the cost of insertion/removal is prohibitive, and the minor gains in speed for the binary search are totally irrelevant to the total cost. Our in house tests confirm this.

if I have neglected any detail perhaps I can e-mail you a copy of my company's non-disclosure agreement and I can privately corresp开发者_如何学JAVAond with you on the matter. sarcasm.end().


Here's a blog post sharing your pain. It includes an implementation of a linked list and a performance comparison.


Perhaps blist would be better, though (from here)?

The use cases where the BList is slightly slower than Python's list are as follows (O(log n) vs. O(1)):

  1. A large list that never changes length.
  2. A large lists where inserts and deletes are only at the end of the list (LIFO).

With that disclaimer out of the way, here are some of the use cases where the BLists is dramatically faster than the built-in list:

  1. Insertion into or removal from a large list (O(log n) vs. O(n))
  2. Taking large slices of large lists (O(log n) vs O(n))
  3. Making shallow copies of large lists (O(1) vs. O(n))
  4. Changing large slices of large lists (O(log n + log k) vs. O(n + k))
  5. Multiplying a list to make a large, sparse list (O(log k) vs. O(kn))

Note that it's actually implemented as a B+ tree, allowing great performance for all those operations.


Python lists are O(1) for operations at the end of the list. If you'll be doing all your inserting in a semi-sequential fashion--by analogy to C, only keeping a single pointer into the middle of the list as a "cursor" of sorts--you could save yourself a lot of effort by just using two Python lists. One list for what's before the cursor, one for what's after; moving the cursor involves pulling the next item from one list and appending it to the other. That gives you arbitrary O(1) inserting at the cursor location with far less effort and wheel reinventing than making an entire new data structure, letting you reuse a lot of the existing list functions.

For the fully general case allowing multiple references into the list, though, you're probably stuck making a linked list of some sort.

Edit: You're not seriously thinking of actually doing a "binary search" on a linked list, are you? Binary search doesn't even make sense on an intrinsically sequential data structure...

Anyway, if you're okay with linear-time searching and your insertions will always preserve the list order without re-sorting, then a simple linked list might be all you need. If you do as much searching as iterating you should consider something with fast indexing, and if resorting might be required something like a tree would be better.


It's puzzling that everyone demands justification to need a linked list. Linked lists are one of the most elementary data structures for a reason: they have properties that the other major data structures lack, and if you need those properties, you need a linked list or one of its close relatives. If you don't understand why linked lists are an important data structure that can't always be replaced with a deque or a binary tree, you should never have passed your "intro to data structures" class.

Here's a quick implementation, supporting the usual stuff: constant-time insertion at any point given a reference to the node, splitting the list into two lists, and inserting a list into the middle of another list (splice). Generic Python interfaces are supported: push, pop, pushleft, popleft, extend, ordinary iteration, iteration over a slice (getiter).

I just wrote this, so it's doctested but not production tested; there are probably still bugs.

def _ref(obj):
    """
    weakref.ref has a bit of braindamage: you can't take a weakref to None.
    This is a major hassle and a needless limitation; work around it.
    """
    from weakref import ref
    if obj is None:
        class NullRef(object):
            def __call__(self): return None
        return NullRef()
    else:
        return ref(obj)

class _node(object):
    def __init__(self, obj):
        self.obj = obj
        self._next = None
        self._prev = _ref(None)
    def __repr__(self):
        return "node(%s)" % repr(self.obj)

    def __call__(self):
        return self.obj

    @property
    def next(self):
        return self._next

    @property
    def prev(self):
        return self._prev()

# Implementation note: all "_last" and "prev" links are weakrefs, to prevent circular references.
# This is important; if we don't do this, every list will be a big circular reference.  This would
# affect collection of the actual objects in the list, not just our node objects.
#
# This means that _node objects can't exist on their own; they must be part of a list, or nodes
# in the list will be collected.  We also have to pay attention to references when we move nodes
# from one list to another.
class llist(object):
    """
    Implements a doubly-linked list.
    """
    def __init__(self, init=None):
        self._first = None
        self._last = _ref(None)
        if init is not None:
            self.extend(init)

    def insert(self, item, node=None):
        """
        Insert item before node.  If node is None, insert at the end of the list.
        Return the node created for item.

        >>> l = llist()
        >>> a = l.insert(1)
        >>> b = l.insert(2)
        >>> d = l.insert(4)
        >>> l._check()
        [1, 2, 4]
        >>> c = l.insert(3, d)
        >>> l._check()
        [1, 2, 3, 4]
        """
        item = _node(item)

        if node is None:
            if self._last() is not None:
                self._last()._next = item
                item._prev = _ref(self._last())
            self._last = _ref(item)
            if self._first is None:
                self._first = item
        else:
            assert self._first is not None, "insertion node must be None when the list is empty"
            if node._prev() is not None:
                node._prev()._next = item
            item._prev = node._prev
            item._next = node
            node._prev = _ref(item)
            if node is self._first:
                self._first = item
        return item

    def remove(self, node):
        """
        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> d = l.append(4)
        >>> e = l.append(5)
        >>> l.remove(c) # Check removing from the middle
        3
        >>> l._check()
        [1, 2, 4, 5]
        >>> l.remove(a) # Check removing from the start
        1
        >>> l._check()
        [2, 4, 5]
        >>> l.remove(e) # Check removing from the end
        5
        >>> l._check()
        [2, 4]
        """
        if self._first is node:
            self._first = node._next
        if self._last() is node:
            self._last = node._prev
        if node._next is not None:
            node._next._prev = node._prev
        if node._prev() is not None:
            node._prev()._next = node._next
        node._next = None
        node._prev = _ref(None)
        return node.obj

    def __nonzero__(self):
        """
        A list is true if it has any elements.

        >>> l = llist()
        >>> bool(l)
        False
        >>> l = llist([1])
        >>> bool(l)
        True
        """
        return self._first is not None


    def __iter__(self):
        """
        >>> l = llist([1,2,3])
        >>> [i() for i in l]
        [1, 2, 3]
        """
        return self.getiter(self._first, self._last())

    def _check(self):
        if self._last() is None:
            assert self._last() is None
            return []
        node = self._first
        ret = []
        while node is not None:
            if node._next is None:
                assert node == self._last()
            if node._prev() is None:
                assert node == self._first
            if node._next is not None:
                assert node._next._prev() == node
            if node._prev() is not None:
                assert node._prev()._next == node
            ret.append(node.obj)
            node = node._next
        return ret

    def getiter(self, first, last):
        """
        Return an iterator over [first,last].

        >>> l = llist()
        >>> l.append(1)
        node(1)
        >>> start = l.append(2)
        >>> l.extend([3,4,5,6])
        >>> end = l.append(7)
        >>> l.extend([8,9])
        >>> [i() for i in l.getiter(start, end)]
        [2, 3, 4, 5, 6, 7]
        """
        class listiter(object):
            def __init__(self, first, last):
                self.node = first
                self.final_node = last

            def __iter__(self): return self
            def next(self):
                ret = self.node
                if ret is None:
                    raise StopIteration
                if ret is self.final_node:
                    self.node = None
                else:
                    self.node = self.node._next
                return ret
        return listiter(first, last)

    def append(self, item):
        """
        Add an item to the end of the list.

        >>> l = llist()
        >>> l.append(1)
        node(1)
        >>> l.append(2)
        node(2)
        >>> l._check()
        [1, 2]
        """
        return self.insert(item, None)

    def appendleft(self, item):
        """
        Add an item to the beginning of the list.

        >>> l = llist()
        >>> l.appendleft(1)
        node(1)
        >>> l.appendleft(2)
        node(2)
        >>> l._check()
        [2, 1]
        """
        return self.insert(item, self._first)

    def pop(self):
        """
        Remove an item from the end of the list and return it.

        >>> l = llist([1,2,3])
        >>> l.pop()
        3
        >>> l.pop()
        2
        >>> l.pop()
        1
        >>> l.pop()
        Traceback (most recent call last):
        ...
        IndexError: pop from empty llist
        """
        if self._last() is None:
            raise IndexError, "pop from empty llist"
        return self.remove(self._last())

    def popleft(self):
        """
        Remove an item from the beginning of the list and return it.

        >>> l = llist([1,2,3])
        >>> l.popleft()
        1
        >>> l.popleft()
        2
        >>> l.popleft()
        3
        >>> l.popleft()
        Traceback (most recent call last):
        ...
        IndexError: popleft from empty llist
        """
        if self._first is None:
            raise IndexError, "popleft from empty llist"
        return self.remove(self._first)

    def splice(self, source, node=None):
        """
        Splice the contents of source into this list before node; if node is None, insert at
        the end.  Empty source_list.  Return the first and last nodes that were moved.

        # Test inserting at the beginning.
        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> l2 = llist([4,5,6])
        >>> l.splice(l2, a)
        (node(4), node(6))
        >>> l._check()
        [4, 5, 6, 1, 2, 3]
        >>> l2._check()
        []

        # Test inserting in the middle.
        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> l2 = llist([4,5,6])
        >>> l.splice(l2, b)
        (node(4), node(6))
        >>> l._check()
        [1, 4, 5, 6, 2, 3]
        >>> l2._check()
        []

        # Test inserting at the end.
        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> l2 = llist([4,5,6])
        >>> l.splice(l2, None)
        (node(4), node(6))
        >>> l._check()
        [1, 2, 3, 4, 5, 6]
        >>> l2._check()
        []

        # Test inserting a list with a single item.
        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> l2 = llist([4])
        >>> l.splice(l2, b)
        (node(4), node(4))
        >>> l._check()
        [1, 4, 2, 3]
        >>> l2._check()
        []
        """
        if source._first is None:
            return
        first = source._first
        last = source._last()

        if node is None:
            if self._last() is not None:
                self._last()._next = source._first
            source._first._prev = self._last
            self._last = source._last
            if self._first is None:
                self._first = source._first
        else:
            source._first._prev = node._prev
            source._last()._next = node
            if node._prev() is not None:
                node._prev()._next = source._first
            node._prev = source._last
            if node is self._first:
                self._first = source._first
        source._first = None
        source._last = _ref(None)
        return first, last

    def split(self, start, end=None):
        """
        Remove all items between [node, end] and return them in a new list.  If end is None,
        remove until the end of the list.

        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> d = l.append(4)
        >>> e = l.append(5)
        >>> l._check()
        [1, 2, 3, 4, 5]
        >>> l2 = l.split(c, e)
        >>> l._check()
        [1, 2]
        >>> l2._check()
        [3, 4, 5]

        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> d = l.append(4)
        >>> e = l.append(5)
        >>> l2 = l.split(a, c)
        >>> l._check()
        [4, 5]
        >>> l2._check()
        [1, 2, 3]

        >>> l = llist()
        >>> a = l.append(1)
        >>> b = l.append(2)
        >>> c = l.append(3)
        >>> d = l.append(4)
        >>> e = l.append(5)
        >>> l2 = l.split(b, d)
        >>> l._check()
        [1, 5]
        >>> l2._check()
        [2, 3, 4]
        """
        if end is None:
            end = self._last()

        ret = llist()

        # First, move the region into the new list.  It's important to do this first, or
        # once we remove the nodes from the old list, they'll be held only by weakrefs and
        # nodes could end up being collected before we put it into the new one.
        ret._first = start
        ret._last = _ref(end)

        # Hook our own nodes back together.
        if start is self._first:
            self._first = end._next
        if end is self._last():
            self._last = start._prev

        if start._prev() is not None:
            start._prev()._next = end._next
        if end._next is not None:
            end._next._prev = start._prev
        start._prev = _ref(None)
        end._next = None

        return ret

    def extend(self, items):
        """
        >>> l = llist()
        >>> l.extend([1,2,3,4,5])
        >>> l._check()
        [1, 2, 3, 4, 5]
        """
        for item in items:
            self.append(item)

if __name__ == "__main__":
    import doctest
    doctest.testmod()


There's a single-linked list here (recipe 17.14 in the Python Cookbook 1st ed) but it's hardly "mature" or rich -- it's just doing a FIFO queue so it's pretty minimal.

This recipe is a very concise C implementation of (read-only) Lisp-like cons-boxes -- just car, cdr and cons; again, not a rich type, rather a minimal one (and to use it for mutable data, as opposed to pure functional approaches, you'd need to add setcar and setcdr at least). It may be a better starting point for you simply because cons-cells are so notoriously flexible and familiar.

Some of the operations you require will be likely best done by existing Python primitives. For example, for sorting, it's hard to see how rolling your own sort can beat the performance of Python's sorted(linkedlist) (as long of course as you make the linkedlist type a Python iterable so it plays well with the rest of the language and library;-), considering the power of the timsort algorithm implemented in the Python runtime.

More generally I suggest you carefully timeit things every step along the way to consider how much the C-coded approach is really buying you (when compared to the trivial C-coded one exemplified by the recipe in the printed Cookbook whose URL I give at the start of this answer) -- that will crucially depend on the size and nature of your application's lists, so you're the one best placed to organize these benchmarks of course.


Python's deque class is 0(1) for insertion and deletion at the beginning and end of the list.


" I will iterate over the list forwards (one by one), using a visitor on each entry, starting from the beginning or a position found by a binary search. When an entry matching a predicate is found it is removed from the list, and then, another binary search is performed on a subset of the list beginning from the removed entry's previous position"

It sounds like a linked list is absolutely the wrong data structure for this - doing a binary search will require random access to the list, which will mean repeatedly iterating through the elements. This is likely to be slower on a linked list than inserting and deleting items in a python list.

It sounds like the data structure you want is a skip list. Google throws up several implementations, but I cannot comment on their completeness or quality.

edit:

Another data structure that may be suitable is a threaded binary tree. this is like a regular binary tree but each leaf node points to next/previous subtree, so it can be iterated through as efficiently as a linked list. Implementing it in Python is left as an exercise for the reader (or Google).


For large data, keep a sorted list is a trick. Don't insert but append new items at the end and then sort it. Don't delet item but replace with a special value, sort them to the end and then pop out. For finding, a sorted list also has very quick performance with bisection method. As for small data, iterate old list, filter and build a new one, like list comprehensions method, is always the fast way.

For me, what is large data? it should be over 1000000 items...


Here's an idea, that would require a little coding, but may give you hugely better performance. It may or may not be suitable for your use case.

You can splice a new list into your list by replacing a single element. To insert the list [6, 7, 8] into [1, 2, 3, 4, 5] at index 2, you would end up with

[1, 2, [3, 6, 7, 8], 4, 5]

By not changing the length of the large (here 5 element) list, you'll not have the speed problems you're currently having.

You can 'delete' an element from the list in the same way, by replacing it with an empty list.

[1, 2, [], 4, 5]

To iterate over this mixed list is straightforward.

def IterateNestedList(xs):
    for x in xs:
        if isinstance(x, list):
            for y in IterateNestedList(x): yield y
        else: yield x


How about using any of data structures which provide sorted data access? Binary (AVL trees, AVL, Red-black), for instance? These guarantee O(log(N)) insertion complexity. Not O(1), but better than what you have.


I recently had the need for a circular and doubly linked list. Since I am very familiar with the Linux kernel's linked list. I wrote a copycat linked list in Python. It provides O(1) random insertion and deletion. It is much faster than Python's list when you are doing random insertion and deletion on a big list. The code is here: https://github.com/ningke/Pylnklist. I also wrote a little bit of introduction here: http://710003.blogspot.com/2012/06/copycat-linked-list-in-python.html


Looks like there is an external library and we are just a pip install llist away.

The documentation states:

All data types defined in this module support efficient O(1) insertion and removal of elements (except removal in sllist which is O(n)). Random access to elements using index is O(n).

In my opinion, the removal of elements from dllist should take O(1).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜