开发者

How is split('\n') method in Python implemented?

This is a theoretical question to be able to understand a difference between Java and Python. To read the content of a file into an array in Java, you need to know the number of lines, in order to define the size of the array when you declare it. And because you cannot know it in advance, you need to apply some tricks to overcome that problem.

In Python though, lists can be of any size, so reading the content of a file into a list can be either done as:

lines = open('filename').read().split('\n')

or

lines = open('filename').readlines()

How does split('\n') work in this occasion? Is Python implementati开发者_C百科on performing some kind of tricks underneath as well (like doubling the size of an array when needed, etc.)?

Any information in shedding light into this would be much appreciated.


The implementation of str.split() internally calls list.append(), which in turn calls the internal function list_resize(). From a comment in the source code of this function:

This over-allocates proportional to the list size, making room for additional growth. The over-allocation is mild, but is enough to give linear-time amortized behavior over a long sequence of appends() in the presence of a poorly-performing system realloc().

The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...


If you're looking for the actual code implementing it, try this: http://svn.python.org/view/python/trunk/Objects/stringlib/split.h?view=markup

For the "basic" split start looking around line 148.

Short summary: They cycle through the string looking for the defined split character, then add to the output tuple the string between the last find and the current one (or start of string for the 1st case) using "PyList_Append". At the end they add the remainder of the string to the tuple.

They have placeholders in there to allocate more space to the results tuple when it reaches the current max size, as well as separate functions for checking a single split character versus another split string (i.e. if you wanted to split on '/t' as two characters you could, via a separate function).


I think (though I haven't re-checked the code) that the split() method counts the number of newline in the string and then just allocates a list of the correct size.

However, all Python lists overallocate so repeatedly appending to them is amortized linear time.


You could check
1) http://svn.python.org/view/python/trunk/Objects/listobject.c?view=markup
2) http://svn.python.org/view/python/trunk/Include/listobject.h?view=markup

In short, Java: Vector :: Python: List


split( [sep [,maxsplit]])

Return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done. (thus, the list will have at most maxsplit+1 elements). If maxsplit is not specified, then there is no limit on the number of splits (all possible splits are made). Consecutive delimiters are not grouped together and are deemed to delimit empty strings (for example

'1„2'.split(',') returns ['1', '', '2'].

The sep argument may consist of multiple characters

for example,

'1, 2, 3'.split(', ') returns ['1', '2', '3'].

Splitting an empty string with a specified separator returns [''].

docs.python.org

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜