Python内建类型str源码分析

1 unicode计算机存储的基本单位是字节，由8个比特位组成。由于英文只由26个字母加若干符号组成，因此英文字符可以直接用字节来保存。但是其他语言（例如中日韩等），由于字符众多，不得不使用多个字节来进行编码。
随着计算机技术的传播，非拉丁文字符编码技术不断发展，但是仍然存在两个比较大的局限性：
不支持多语言：一种语言的编码方案不能用于另外一种语言
没有统一标准：例如中文就有gbk、gb2312、gb18030等多种编码标准
由于编码方式不统一，开发人员就需要在不同编码之间来回转换，不可避免地会出现很多错误。为了解决这类不统一问题，unicode标准被提出了。unicode对世界上大部分文字系统进行整理、编码，让计算机可以用统一的方式处理文本。unicode目前已经收录了超过14万个字符，天然地支持多语言。（unicode的uni就是“统一”的词根）
2 python中的unicode2.1 unicode对象的好处python在3之后，str对象内部改用unicode表示，因此在源码中成为unicode对象。使用unicode表示的好处是：程序核心逻辑统一使用unicode，只需在输入、输出层进行解码、编码，可最大程度地避免各种编码问题。
图示如下：
2.2 python对unicode的优化问题：由于unicode收录字符已经超过14万个，每个字符至少需要4个字节来保存（这里应该是因为2个字节不够，所以才用4个字节，一般不会使用3个字节）。而英文字符用ascii码表示仅需要1个字节，使用unicode反而会使频繁使用的英文字符的开销变为原来的4倍。
首先我们来看一下python中不同形式的str对象的大小差异：
>>> sys.getsizeof('ab') - sys.getsizeof('a')1>>> sys.getsizeof('一二') - sys.getsizeof('一')2>>> sys.getsizeof('') - sys.getsizeof('')4
由此可见，python内部对unicode对象进行了优化：根据文本内容，选择底层存储单元。
unicode对象底层存储根据文本字符的unicode码位范围分成三类：
pyunicode_1byte_kind：所有字符码位在u+0000到u+00ff之间
pyunicode_2byte_kind：所有字符码位在u+0000到u+ffff之间，且至少有一个字符的码位大于u+00ff
pyunicode_1byte_kind：所有字符码位在u+0000到u+10ffff之间，且至少有一个字符的码位大于u+ffff
对应枚举如下：
enum pyunicode_kind {/* string contains only wstr byte characters. this is only possible when the string was created with a legacy api and _pyunicode_ready() has not been called yet. */ pyunicode_wchar_kind = 0,/* return values of the pyunicode_kind() macro: */ pyunicode_1byte_kind = 1, pyunicode_2byte_kind = 2, pyunicode_4byte_kind = 4};
根据不同的分类，选择不同的存储单元：
/* py_ucs4 and py_ucs2 are typedefs for the respective unicode representations. */typedef uint32_t py_ucs4;typedef uint16_t py_ucs2;typedef uint8_t py_ucs1;
对应关系如下：
文本类型字符存储单元字符存储单元大小（字节）
pyunicode_1byte_kind py_ucs1 1
pyunicode_2byte_kind py_ucs2 2
pyunicode_4byte_kind py_ucs4 4
由于unicode内部存储结构因文本类型而异，因此类型kind必须作为unicode对象公共字段进行保存。python内部定义了一些标志位，作为unicode公共字段：（介于笔者水平有限，这里的字段在后续内容中不会全部介绍，大家后续可以自行了解。抱拳~）
interned：是否为interned机制维护
kind：类型，用于区分字符底层存储单元大小
compact：内存分配方式，对象与文本缓冲区是否分离
asscii：文本是否均为纯ascii
通过pyunicode_new函数，根据文本字符数size以及最大字符maxchar初始化unicode对象。该函数主要是根据maxchar为unicode对象选择最紧凑的字符存储单元以及底层结构体：（源码比较长，这里就不列出了，大家可以自行了解，下面以表格形式展现）
maxchar < 128128 <= maxchar < 256256 <= maxchar < 6553665536 <= maxchar < max_unicode
kind pyunicode_1byte_kind pyunicode_1byte_kind pyunicode_2byte_kind pyunicode_4byte_kind
ascii 1 0 0 0
字符存储单元大小（字节） 1 1 2 4
底层结构体 pyasciiobject pycompactunicodeobject pycompactunicodeobject pycompactunicodeobject
3 unicode对象的底层结构体3.1 pyasciiobjectc源码：
typedef struct { pyobject_head py_ssize_t length; /* number of code points in the string */ py_hash_t hash; /* hash value; -1 if not set */ struct { unsigned int interned:2; unsigned int kind:3; unsigned int compact:1; unsigned int ascii:1; unsigned int ready:1; unsigned int :24; } state; wchar_t *wstr; /* wchar_t representation (null-terminated) */} pyasciiobject;
源码分析：
length：文本长度
hash：文本哈希值
state：unicode对象标志位
wstr：缓存c字符串的一个wchar_t指针，以“\0”结束（这里和我看的另一篇文章讲得不太一样，另一个描述是：ascii文本紧接着位于pyasciiobject结构体后面，我个人觉得现在的这种说法比较准确，毕竟源码结构体后面没有别的字段了）
图示如下：
（注意这里state字段后面有一个4字节大小的空洞，这是结构体字段内存对齐造成的现象，主要是为了优化内存访问效率）
ascii文本由wstr指向，以’abc’和空字符串对象’'为例：
3.2 pycompactunicodeobject如果文本不全是ascii，unicode对象底层便由pycompactunicodeobject结构体保存。c源码如下：
/* non-ascii strings allocated through pyunicode_new use the pycompactunicodeobject structure. state.compact is set, and the data immediately follow the structure. */typedef struct { pyasciiobject _base; py_ssize_t utf8_length; /* number of bytes in utf8, excluding the * terminating \0. */ char *utf8; /* utf-8 representation (null-terminated) */ py_ssize_t wstr_length; /* number of code points in wstr, possible * surrogates count as two code points. */} pycompactunicodeobject;
pycompactunicodeobject在pyasciiobject的基础上增加了3个字段：
utf8_length：文本utf8编码长度
utf8：文本utf8编码形式，缓存以避免重复编码运算
wstr_length：wstr的“长度”（这里所谓的长度没有找到很准确的说法，笔者也不太清楚怎么能打印出来，大家可以自行研究下）
注意到，pyasciiobject中并没有保存utf8编码形式，这是因为ascii本身就是合法的utf8，这也是ascii文本底层由pyasciiobject保存的原因。
结构图示：
3.3 pyunicodeobjectpyunicodeobject则是python中str对象的具体实现。c源码如下：
/* strings allocated through pyunicode_fromunicode(null, len) use the pyunicodeobject structure. the actual string data is initially in the wstr block, and copied into the data block using _pyunicode_ready. */typedef struct { pycompactunicodeobject _base; union { void *any; py_ucs1 *latin1; py_ucs2 *ucs2; py_ucs4 *ucs4; } data; /* canonical, smallest-form unicode buffer */} pyunicodeobject;
3.4 示例在日常开发时，要结合实际情况注意字符串拼接前后的内存大小差别：
>>> import sys>>> text = 'a' * 1000>>> sys.getsizeof(text)1049>>> text += ''>>> sys.getsizeof(text)4080
4 interned机制如果str对象的interned标志位为1，python虚拟机将为其开启interned机制，
源码如下：（相关信息在网上可以看到很多说法和解释，这里笔者能力有限，暂时没有找到最确切的答案，之后补充。抱拳~但是我们通过分析源码应该是能看出一些门道的）
/* this dictionary holds all interned unicode strings. note that references to strings in this dictionary are *not* counted in the string's ob_refcnt. when the interned string reaches a refcnt of 0 the string deallocation function will delete the reference from this dictionary. another way to look at this is that to say that the actual reference count of a string is: s->ob_refcnt + (s->state ? 2 : 0)*/static pyobject *interned = null;voidpyunicode_interninplace(pyobject **p){ pyobject *s = *p; pyobject *t;#ifdef py_debug assert(s != null); assert(_pyunicode_check(s));#else if (s == null || !pyunicode_check(s)) return;#endif /* if it's a subclass, we don't really know what putting it in the interned dict might do. */ if (!pyunicode_checkexact(s)) return; if (pyunicode_check_interned(s)) return; if (interned == null) { interned = pydict_new(); if (interned == null) { pyerr_clear(); /* don't leave an exception */ return; } } py_allow_recursion t = pydict_setdefault(interned, s, s); py_end_allow_recursion if (t == null) { pyerr_clear(); return; } if (t != s) { py_incref(t); py_setref(*p, t); return; } /* the two references in interned are not counted by refcnt. the deallocator will take care of this */ py_refcnt(s) -= 2; _pyunicode_state(s).interned = sstate_interned_mortal;}
可以看到，源码前面还是做一些基本的检查。我们可以看一下37行和50行：将s添加到interned字典中时，其实s同时是key和value（这里我不太清楚为什么会这样做），所以s对应的引用计数是+2了的（具体可以看pydict_setdefault()的源码），所以在50行时会将计数-2，保证引用计数的正确。
考虑下面的场景：
>>> class user: def __init__(self, name, age): self.name = name self.age = age>>> user = user('tom', 21)>>> user.__dict__{'name': 'tom', 'age': 21}
由于对象的属性由dict保存，这意味着每个user对象都要保存一个str对象‘name’，这会浪费大量的内存。而str是不可变对象，因此python内部将有潜在重复可能的字符串都做成单例模式，这就是interned机制。python具体做法就是在内部维护一个全局dict对象，所有开启interned机制的str对象均保存在这里，后续需要使用的时候，先创建，如果判断已经维护了相同的字符串，就会将新创建的这个对象回收掉。
示例：
由不同运算生成’abc’，最后都是同一个对象：
>>> a = 'abc'>>> b = 'ab' + 'c'>>> id(a), id(b), a is b(2752416949872, 2752416949872, true)
以上就是python内建类型str源码分析的详细内容。

Python内建类型str源码分析

推荐信息