用php实现一个敏感词过滤功能

周末空余时间撸了一个敏感词过滤功能，下边记录下实现过程。
敏感词，一方面是我国网监限制，另一方面是我们自己可能也要过滤一些人身攻击或者广告信息等，具体词库可以google下，有很多。
过滤敏感词，使用简单的循环 str_replace是性能很低效的，还会随着词库的增加，性能指数下降，而且简单的替换，不能解决一些不是完全匹配的词。这时候就需要先构建一个字典树(trie)，单纯的字典树占用空间较大，使用 double-array trie或者 ternary search tree可以在保证性能的同时节省一部分空间，但是敏感词基本不会很多，几千甚至上万个词基本没压力，所以就实现就选择先构建一个字典树，然后逐字做匹配。
代码不多，就贴到这里。
dict = array(); $this->dictpath = $dictpath; $this->initdict(); } private function initdict() { $handle = fopen($this->dictpath, 'r'); if (!$handle) { throw new runtimeexception('open dictionary file error.'); } while (!feof($handle)) { $word = trim(fgets($handle, 128)); if (empty($word)) { continue; } $uword = $this->unicodesplit($word); $pdict = &$this->dict; $count = count($uword); for ($i = 0; $i dict[$ustr[$i]]; $matchindexes = array(); for ($j = $i + 1, $d = 0; $d = 2) { if ((ord($str[$i + 1]) & 0xc0) == 0x80) { $uc = substr($str, $i, 2); $ret[] = $uc; $i += 1; } } } else { $ret[] = $str[$i]; } } return $ret; }}
使用方法
filter('这是一个敏感词', 10);
性能没有具体详细的做测试，不过一般场景足够，主要是吃cpu，词库可以把生成好的字典json编码后存到redis或者memcached中，下次使用直接取出还原。
php写web的话，不是daemon这种，所以构建的数据结构不能永久驻留内存，相比来说，c、c++、java等可能更合适，如果对性能要求苛刻，可以用其他语言写个服务。当然，这里php还有个swoole可用，但是个人不是很看好。

用php实现一个敏感词过滤功能

推荐信息