How are basic data types (strings and integers) implemented in Python and Perl
I've been wondering lately how various operations I perform on basic types like strings and integers work in terms of performance, and I figure I could get a much better idea of this if I knew how those basic types were implemented (i.e. I've heard stri开发者_高级运维ngs and integers are immutable in Python. Does that mean any operation that modifies one character in a string is O(n) because a completely new string has to be created? How about adding numbers?)
I'm curious about this in both Python and Perl, and felt silly asking basically the same question twice, so I'm just wrapping it into one.
If you can include some example operation costs with your answer, that would make it even more helpful.
In python, some_string[5] = 'a'
would be an error, but the closest equivalent operation, some_string = some_string[5:] + 'a' + some_string[6:]
would indeed be O(n). But that's not just true of immutable objects. The same is true for concatenating lists: [1,2,3] + [4,5,6]
generates a new list and is O(n).
Adding numbers creates a new value, but generally the resulting value is always the same size in memory, so it's O(1). Of course that only holds with small ints. Once you hit some threshold (20 digits on my machine), suddenly ints take up a variable amount of space. I don't know how that effects asymptotic performance.
However, I found that it doesn't seem to have even a significant effect even near log10(n) == 1000
:
>>> times = [timeit.timeit(stmt=stmt.format(10 ** i, 10 ** i), number=100) for i in range(1000)]
>>> sum(times) * 1.0 / len(times)
3.0851364135742186e-06
>>> times[-1]
3.0994415283203125e-06
For strings, the asymptotic performance hit is more immediately apparent:
>>> stmt = 's[:5] + "a" + s[6:]'
>>> setup = 's = "b" * {0}'
>>> times = [timeit.timeit(stmt=stmt, setup=setup.format(i), number=10) for i in range(100000)]
>>> sum(times) * 1.0 / len(times)
6.2434492111206052e-05
>>> times[-1]
0.0001220703125
The execution time for the last operation is well below the average. And the trend is pretty steady:
>>> for t in times[0:100000:10000]:
... print t
...
5.00679016113e-06
1.31130218506e-05
2.90870666504e-05
3.88622283936e-05
5.10215759277e-05
6.19888305664e-05
7.41481781006e-05
8.48770141602e-05
9.60826873779e-05
0.000108957290649
Still, operations like these on small strings are pretty cheap.
To expand on your other questions, indexed access is O(1) on both lists and strings.
>>> stmt = 'x = s[{0}] + s[{1}] + s[{2}]'
>>> setup = 's = "a" * {0}'
>>> times = [timeit.timeit(stmt=stmt.format(i / 2, i / 3, i / 4), setup=setup.format(i + 1), number=10) for i in range(1000000)]
>>> sum(times) * 1.0 / len(times)
3.6441037654876707e-06
>>> times[-1]
3.0994415283203125e-06
Likewise with lists:
>>> stmt = 'x = s[{0}] + s[{1}] + s[{2}]'
>>> setup = 's = ["a"] * {0}'
>>> times = [timeit.timeit(stmt=stmt.format(i / 2, i / 3, i / 4), setup=setup.format(i + 1), number=10) for i in range(100000)]
>>> sum(times) * 1.0 / len(times)
2.8617620468139648e-06
>>> times[-1]
1.9073486328125e-06
Slicing copies both strings and lists, and is therefore O(n) with n == len(slice)
. There is no "good" way to replace one letter of a string, although I want to emphasize that the "bad" way is good enough most of the time. If you want a "good" way, use a different data type; manipulate a list and join it when a string is required; or use a StringIO object. This page has some useful information on concatenating different built-in Python datatypes.
Finally, since you're really interested in the internals, I dug up the struct
declaration of PyStringObject
in stringobject.h
(from version 2.7; 3+ probably looks different). It's about what you'd expect -- a c string with some extra bells and whistles:
typedef struct {
PyObject_VAR_HEAD
(PyObject_VAR_HEAD
is a c preprocessor macro that expands to something like the below depending on rules explained here.)
Py_ssize_t ob_refcnt;
PyTypeObject *ob_type;
Py_ssize_t ob_size;
Continuing...
long ob_shash;
int ob_sstate;
char ob_sval[1];
/* Invariants:
* ob_sval contains space for 'ob_size+1' elements.
* ob_sval[ob_size] == 0.
* ob_shash is the hash of the string or -1 if not computed yet.
* ob_sstate != 0 iff the string object is in stringobject.c's
* 'interned' dictionary; in this case the two references
* from 'interned' to this object are *not counted* in ob_refcnt.
*/
} PyStringObject;
Lists have a similar structure -- c arrays with extra bells and whistles -- but aren't null terminated and generally have extra preallocated storage space.
Needless to say... much of this this applies only to cPython -- PyPy, IronPython, and Jython probably all look totally different!
Perl strings definitely are not immutable. Each string has a buffer, the initial offset of the string in the buffer, the length of buffer, and the amount of the buffer used. Additionally, for utf8 strings, the character length is cached when it needs to be calculated. At one point, there was some caching of additional character offset to byte offset information too, but I'm not certain that's still in place.
If the buffer needs to be increased, it reallocs it. Perl on many platforms knows the granularity of the system malloc, so it can allocate a, say, 14 byte buffer for a 11 byte string, knowing that that won't actually take any additional memory.
The initial offset allows O(1) removal of data from the beginning of the string.
精彩评论