PHP内核探索之变量（7）- 不平凡的字符串，内核不平凡

php内核探索之变量（7）- 不平凡的字符串，内核不平凡切，一个字符串有什么好研究的。
别这么说，看过《平凡的世界》么，平凡的字符串也可以有不平凡的故事。试看：
（1）       在c语言中，strlen计算字符串的时间复杂度是？php中呢？
（2）       在php中，怎样处理多字节字符串？php对unicode的支持如何？
同样是字符串，为什么c语言与c++/php/java的均不相同？
数据结构决定算法，这句话一点不假。
那么我们今天就来掰一掰，php中的字符串结构，以及相关字符串函数的实现。
一、字符串基础字符串可以说是php中遇到最多的数据结构之一了（另外一个比较常用的是数组，见php内核探索之变量（4）- 数组操作）。而由于php语言的特性和应用场景，使得我们日常的很多工作，实际上都是在处理字符串。也正是这个原因，php为开发者提供了丰富的字符串操作函数（初步统计约有100个，这个数量相当可观）。那么，在php中，字符串是怎样实现的呢？与c语言又有什么区别呢？
　1.php中字符串的表现形式
在php中使用字符串有四种常见的形式：
(1)    双引号
         这种形式比较常见：$str=”this is \0 a string”; 而且以双引号包含的字符串中可以包含变量、控制字符等：$str = this is $name, aha.\n;
(2)     单引号
单引号包含的字符都被认为是raw的，因此不会解析单引号中的变量，控制字符等：
$string = test;$str = 'this is $string, aha\n';echo $str;
(3) heredoc
heredoc比较适合较长的字符串表示，且对于多行的字符串表示更加灵活多样。与双引号表示形式类似，heredoc中也可以包含变量。常见的形式是：
$string =test string;$str = << (4) nowdoc（5.3+支持）
nowdoc和heredoc是如此的类似，以至于我们可以把它们当做是一对儿亲兄弟。nowdoc的起始标志符是用单引号括起来的，与单引号相似，它不会解析其中的变量，格式控制符等：
$s = < i [2] => am [3] => down [4] => and [5] => oh [6] => my [7] => soul [8] => so [9] => weary [10] => when [11] => troubles......)
这一特性有什么作用呢？比如英文分词。还记得“单词统计”的问题么？str_word_count可以轻松完成单词统计topk的问题：
$s = file_get_contents(./word);$a = array_count_values(str_word_count($s, 1)) ;arsort( $a );print_r( $a );/*array( [i] => 10 [me] => 7 [raise] => 6 [up] => 6 [you] => 6 [am] => 6 [on] => 6 [can] => 4 [and] => 4 [be] => 3 [so] => 3 ……);*/
（3）$format = 2
$format=2时，返回的是一个关联数组：
$a = str_word_count($s, 2);print_r($a);/*array( [0] => when [5] => i [7] => am [10] => down [15] => and [20] => oh [23] => my [26] => soul [32] => so [35] => weary [41] => when [46] => troubles [55] => come ...)*/
配合其他数组函数，可以实现更加多样化的功能.例如，配合array_flip，可以计算某个单词最后一次出现的位置：
$t = array_flip(str_word_count($s, 2));print_r($t);
而如果配合了array_unique之后再array_flip，则可以计算某个单词第一次出现的位置：
$t = array_flip( array_unique(str_word_count($s, 2)) );print_r($t);array( [when] => 0 [i] => 5 [am] => 7 [down] => 10 [and] => 15 [oh] => 20 [my] => 23 [soul] => 26 [so] => 32 [weary] => 35 [troubles] => 46 [come] => 55 [heart] => 67 ...)
3. similar_text
这是除了levenshtein()函数之外另一个计算两个字符串相似度的函数：
int similar_text ( string $first , string $second [, float &$percent ] )
$t1 = you raise me up, so i can stand on mountains;$t2 = you raise me up, to walk on stormy seas;$percent = 0;echo similar_text($t1, $t2, $percent).php_eol;//26echo $percent;// 62.650602409639
撇开具体的使用不谈，我很好奇底层对于字符串的相似度是如何定义的。
similar_text函数实现位于 ext/standard/string.c 中，摘取其关键代码：
php_function(similar_text){ char *t1, *t2; zval **percent = null; int ac = zend_num_args(); int sim; int t1_len, t2_len; /* 参数解析 */ if (zend_parse_parameters(zend_num_args() tsrmls_cc, ss|z, &t1, &t1_len, &t2, &t2_len, &percent) == failure) { return; } /* set percent to double type */ if (ac > 2) { convert_to_double_ex(percent); } /* t1_len == 0 && t2_len == 0 */ if (t1_len + t2_len == 0) { if (ac > 2) { z_dval_pp(percent) = 0; } return_long(0); } /* 计算字符串相同个数 */ sim = php_similar_char(t1, t1_len, t2, t2_len); /* 相似百分比 */ if (ac > 2) { z_dval_pp(percent) = sim * 200.0 / (t1_len + t2_len); } return_long(sim);}
可以看出，字符串相似个数是通过 php_similar_char 函数实现的，而相似百分比则是通过公式：
percent = sim * 200 / (t1串长度 + t2串长度)
来定义的。
php_similar_char的具体实现：
static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2){ int sum; int pos1 = 0, pos2 = 0, max; php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max); if ((sum = max)) { if (pos1 && pos2) { sum += php_similar_char(txt1, pos1,txt2, pos2); } if ((pos1 + max < len1) && (pos2 + max < len2)) { sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,txt2 + pos2 + max, len2 - pos2 - max); } } return sum;}
这个函数通过调用php_similar_str来完成字符串相似个数的统计，而php_similar_str返回字符串s1与字符串s2的最长相同字符串长度：
static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int *pos2, int *max){ char *p, *q; char *end1 = (char *) txt1 + len1; char *end2 = (char *) txt2 + len2; int l; *max = 0; /* 查找最长串 */ for (p = (char *) txt1; p < end1; p++) { for (q = (char *) txt2; q < end2; q++) { for (l = 0; (p + l < end1) && (q + l *max) { *max = l; *pos1 = p - txt1; *pos2 = q - txt2; } } }}
php_similar_str匹配完成之后，原始的串被划分为三个部分：
第一部分是最长串的左边部分，这一部分含有相似串，但是却不是最长的；
第二部分是最长相似串部分；
第三部分是最长串的右边部分，与第一部分相似，这一部分含有相似串，但是也不是最长的。因而要递归对第一部分和第三部分求相似串的长度：
/* 最长的串左边部分相似串 */if (pos1 && pos2) { sum += php_similar_char(txt1, pos1,txt2, pos2);}/* 右半部分相似串 */if ((pos1 + max < len1) && (pos2 + max < len2)) { sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max, txt2 + pos2 + max, len2 - pos2 - max);}
匹配的过程如下图所示：
对于字符串函数的更多解释，可以参考php的在线手册，这里不再一一列举。
三、多字节字符串迄今为止，我们讨论的所有的字符串和相关操作函数都是单字节的。然而这个世界是如此的丰富多彩，就好比有红瓤的西瓜也有黄瓤的西瓜一样，字符串也不例外。如我们常用的中文汉字在gbk编码的情况下，实际上是使用两个字节来编码的。多字节字符串不仅仅局限于中文汉字，还包括日文，韩文等等多个国家的文字。正因为如此，对于多字节字符串的处理显得异常重要。
字符和字符集是编程过程中不可避免总是要遇到的术语。如果有童鞋对于这一块的内容并不是特别清晰，建议移步《编码大事1字符编码基础-字符和字符集，》
由于我们日常中使用较多的是中文，因而我们以中文字符串截取为例，重点研究中文字符串的问题。
中文字符串的截取
中文字符串截取一直是个相对来说比较麻烦的问题，原因在于：
（1） php原生的substr函数只支持单字节字符串的截取，对于多字节的字符串略显无力
（2） php的扩展mbstring需要服务器的支持，事实上，很多开发环境中并没有开启mbstring扩展，对于习惯使用mbstring扩展的童鞋非常遗憾。
（3）一个更为复杂的问题是，在utf-8编码的情况下，虽然中文是3个字节的，但是中文的某些特殊字符（如脱字符·）实际上是双字节编码的。这无疑加大了中文字符串截取的难度（毕竟，中文字符串中不可能完全不包含特殊字符）。
头疼之余，还是要自己撸一个中文的字符串截取的库，这个字符串截取函数应该与substr有相似的函数参数列表，而且要支持中文gbk编码和utf-8编码情况下的截取，为了效率起见，如果服务器已经开启了mbstring扩展，那么就应该直接使用mbstring的字符串截取。
api：
string cnsubstr(string $str, int $start, int $len, [$encode=’gbk’]);//注意参数中$start, $len都是字符数而不是字节数。
我们以utf-8编码为例，来说明utf8编码下中文的截取思路。
(1) 编码范围：
utf-8的编码范围(utf-8使用1-6个字节编码字符，实际上只使用了1-4字节)：
1个字节：00——7f2个字节：c080——dfbf3个字符：e08080——efbfbf4个字符：f0808080——f7bfbfbf
据此，可以根据第一个字节的范围确定该字符所占的字节数：
$ord = ord($str{$i});$ord < 192 单字节和控制字符192 <= $ord < 224 双字节224<= $ord < 240 三字节中文并没有四个字节的字符
（2）$start为负的情况
if( $start < 0 ){ $start += cnstrlen_utf8( $str ); if( $start < 0 ){ $start = 0; }}
网上大多数字符串截取版本都没有处理$start< 0的情况，按照php substr的api设计，在$start <0 时，应该加上字符串的长度（多字节指字符数）。
其中cnstrlen_utf8用于获取字符串在utf8编码下的字符数：
function cnstrlen_utf8( $str ){ $len = 0; $i = 0; $slen = strlen( $str ); while( $i < $slen ){ $ord = ord( $str{$i} ); if( $ord < 127){ $i ++; }else if( $ord < 224 ){ $i += 2; }else{ $i += 3; } $len ++; } return $len;}
因此utf-8的截取算法为：
function cnsubstr_utf8( $str, $start, $len ){ if( $start < 0 ){ $start += cnstrlen_utf8( $str ); if( $start < 0 ){ $start = 0; } } $slen = strlen( $str ); if( $len < 0 ){ $len += $slen - $start; if($len < 0){ $len = 0; } } $i = 0; $count = 0; /* 获取开始位置 */ while( $i < $slen && $count < $start){ $ord = ord( $str{$i} ); if( $ord < 127){ $i ++; }else if( $ord < 224 ){ $i += 2; }else{ $i += 3; } $count ++; } $count = 0; $substr = ''; /* 截取$len个字符 */ while( $i < $slen && $count < $len){ $ord = ord( $str{$i} ); if( $ord < 127){ $substr .= $str{$i}; $i ++; }else if( $ord < 224 ){ $substr .= $str{$i} . $str{$i+1}; $i += 2; }else{ $substr .= $str{$i} . $str{$i+1} . $str{$i+2}; $i += 3; } $count ++; } return $substr;}
而最终的cnsubstr()可以设计如下（程序还有很多优化的余地）：
function cnsubstr( $str, $start, $len, $encode = 'gbk' ){ if( extension_loaded(mbstring) ){ //echo use mbstring; //return mb_substr( $str, $start, $len, $encode ); } $enc = strtolower( $encode ); switch($enc){ case 'gbk': case 'gb2312': return cnsubstr_gbk($str, $start, $len); break; case 'utf-8': case 'utf8': return cnsubstr_utf8($str, $start, $len); break; default: //do some warning or trigger error; }}
简单的测试一下：
$str = 这是中文的字符串string,还有abs· ;for($i = 0; $i < 10; $i++){ echo cnsubstr( $str, $i, 3, 'utf8').php_eol;}
最后贴一下thinkphp extend中提供的msubstr函数（这是用正则表达式做的substr）：
function msubstr($str, $start=0, $length, $charset=utf-8, $suffix=true) { if(function_exists(mb_substr)) $slice = mb_substr($str, $start, $length, $charset); elseif(function_exists('iconv_substr')) { $slice = iconv_substr($str,$start,$length,$charset); if(false === $slice) { $slice = ''; } }else{ $re['utf-8'] = /[\x01-\x7f]|[\xc2-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xff][\x80-\xbf]{3}/; $re['gb2312'] = /[\x01-\x7f]|[\xb0-\xf7][\xa0-\xfe]/; $re['gbk'] = /[\x01-\x7f]|[\x81-\xfe][\x40-\xfe]/; $re['big5'] = /[\x01-\x7f]|[\x81-\xfe]([\x40-\x7e]|\xa1-\xfe])/; preg_match_all($re[$charset], $str, $match); $slice = join(,array_slice($match[0], $start, $length)); } return $suffix ? $slice.'...' : $slice;}
由于文章篇幅问题，更多的问题，这里不再细说。还是那句话，有任何问题，欢迎指出。
参考文献：
http://www.bkjia.com/phpjc/976454.htmlwww.bkjia.comtruehttp://www.bkjia.com/phpjc/976454.htmltecharticlephp内核探索之变量（7）- 不平凡的字符串，内核不平凡切，一个字符串有什么好研究的。别这么说，看过《平凡的世界》么，平凡的字符...

PHP内核探索之变量（7）- 不平凡的字符串，内核不平凡_PHP教程

推荐信息