How are basic data types (strings and integers) implemented in Python and Perl

2023-03-13 02:30 问答作者：

I've been wondering lately how various operations I perform on basic types like strings and integers work in terms of performance, and I figure I could get a much better idea of this if I knew how those basic types were implemented (i.e. I've heard stri开发者_高级运维ngs and integers are immutable in Python. Does that mean any operation that modifies one character in a string is O(n) because a completely new string has to be created? How about adding numbers?)

I'm curious about this in both Python and Perl, and felt silly asking basically the same question twice, so I'm just wrapping it into one.

If you can include some example operation costs with your answer, that would make it even more helpful.

In python, some_string[5] = 'a' would be an error, but the closest equivalent operation, some_string = some_string[5:] + 'a' + some_string[6:] would indeed be O(n). But that's not just true of immutable objects. The same is true for concatenating lists: [1,2,3] + [4,5,6] generates a new list and is O(n).

Adding numbers creates a new value, but generally the resulting value is always the same size in memory, so it's O(1). Of course that only holds with small ints. Once you hit some threshold (20 digits on my machine), suddenly ints take up a variable amount of space. I don't know how that effects asymptotic performance.

However, I found that it doesn't seem to have even a significant effect even near log10(n) == 1000:

>>> times = [timeit.timeit(stmt=stmt.format(10 ** i, 10 ** i), number=100) for i in range(1000)]
>>> sum(times) * 1.0 / len(times)
3.0851364135742186e-06
>>> times[-1]
3.0994415283203125e-06

For strings, the asymptotic performance hit is more immediately apparent:

>>> stmt = 's[:5] + "a" + s[6:]'
>>> setup = 's = "b" * {0}'
>>> times = [timeit.timeit(stmt=stmt, setup=setup.format(i), number=10) for i in range(100000)]
>>> sum(times) * 1.0 / len(times)
6.2434492111206052e-05
>>> times[-1]
0.0001220703125

The execution time for the last operation is well below the average. And the trend is pretty steady:

>>> for t in times[0:100000:10000]:
...     print t
... 
5.00679016113e-06
1.31130218506e-05
2.90870666504e-05
3.88622283936e-05
5.10215759277e-05
6.19888305664e-05
7.41481781006e-05
8.48770141602e-05
9.60826873779e-05
0.000108957290649

Still, operations like these on small strings are pretty cheap.

To expand on your other questions, indexed access is O(1) on both lists and strings.

>>> stmt = 'x = s[{0}] + s[{1}] + s[{2}]'
>>> setup = 's = "a" * {0}'
>>> times = [timeit.timeit(stmt=stmt.format(i / 2, i / 3, i / 4), setup=setup.format(i + 1), number=10) for i in range(1000000)]
>>> sum(times) * 1.0 / len(times)
3.6441037654876707e-06
>>> times[-1]
3.0994415283203125e-06

Likewise with lists:

>>> stmt = 'x = s[{0}] + s[{1}] + s[{2}]'
>>> setup = 's = ["a"] * {0}'
>>> times = [timeit.timeit(stmt=stmt.format(i / 2, i / 3, i / 4), setup=setup.format(i + 1), number=10) for i in range(100000)]
>>> sum(times) * 1.0 / len(times)
2.8617620468139648e-06
>>> times[-1]
1.9073486328125e-06

Slicing copies both strings and lists, and is therefore O(n) with n == len(slice). There is no "good" way to replace one letter of a string, although I want to emphasize that the "bad" way is good enough most of the time. If you want a "good" way, use a different data type; manipulate a list and join it when a string is required; or use a StringIO object. This page has some useful information on concatenating different built-in Python datatypes.

Finally, since you're really interested in the internals, I dug up the struct declaration of PyStringObject in stringobject.h (from version 2.7; 3+ probably looks different). It's about what you'd expect -- a c string with some extra bells and whistles:

typedef struct {
    PyObject_VAR_HEAD

(PyObject_VAR_HEAD is a c preprocessor macro that expands to something like the below depending on rules explained here.)

    Py_ssize_t ob_refcnt;
    PyTypeObject *ob_type;
    Py_ssize_t ob_size;

Continuing...

    long ob_shash;
    int ob_sstate;
    char ob_sval[1];

    /* Invariants:
     *     ob_sval contains space for 'ob_size+1' elements.
     *     ob_sval[ob_size] == 0.
     *     ob_shash is the hash of the string or -1 if not computed yet.
     *     ob_sstate != 0 iff the string object is in stringobject.c's
     *       'interned' dictionary; in this case the two references
     *       from 'interned' to this object are *not counted* in ob_refcnt.
     */
} PyStringObject;

Lists have a similar structure -- c arrays with extra bells and whistles -- but aren't null terminated and generally have extra preallocated storage space.

Needless to say... much of this this applies only to cPython -- PyPy, IronPython, and Jython probably all look totally different!

Perl strings definitely are not immutable. Each string has a buffer, the initial offset of the string in the buffer, the length of buffer, and the amount of the buffer used. Additionally, for utf8 strings, the character length is cached when it needs to be calculated. At one point, there was some caching of additional character offset to byte offset information too, but I'm not certain that's still in place.

If the buffer needs to be increased, it reallocs it. Perl on many platforms knows the granularity of the system malloc, so it can allocate a, say, 14 byte buffer for a 11 byte string, knowing that that won't actually take any additional memory.

The initial offset allows O(1) removal of data from the beginning of the string.

继续阅读：perl python

How are basic data types (strings and integers) implemented in Python and Perl

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？