Python实现代码统计工具（终极篇）

本文对于先前系列文章中实现的c/python代码统计工具(cplinecounter)，通过c扩展接口重写核心算法加以优化，并与网上常见的统计工具做对比。实测表明，cplinecounter在统计精度和性能方面均优于其他同类统计工具。以千万行代码为例评测性能，cplinecounter在cpython和pypy环境下运行时，比国外统计工具cloc1.64分别快14.5倍和29倍，比国内sourcecounter3.4分别快1.8倍和3.6倍。
运行测试环境
本文基于windows系统平台，运行和测试所涉及的代码实例。平台信息如下：
>>> import sys, platform>>> print '%s %s, python %s' %(platform.system(), platform.release(), platform.python_version())windows xp, python 2.7.11>>> sys.version'2.7.11 (v2.7.11:6d1b6a68f775, dec 5 2015, 20:32:19) [msc v.1500 32 bit (intel)]'
注意，python不同版本间语法存在差异，故文中某些代码实例需要稍作修改，以便在低版本python环境中运行。
一. 代码实现与优化
为避免碎片化，本节将给出完整的实现代码。注意，本节某些变量或函数定义与先前系列文章中的实现存在细微差异，请注意甄别。
1.1 代码实现
首先，定义两个存储统计结果的列表：
import os, sysrawcountinfo = [0, 0, 0, 0, 0]detailcountinfo = []
其中，rawcountinfo存储粗略的文件总行数信息，列表元素依次为文件行、代码行、注释行和空白行的总数，以及文件数目。detailcountinfo存储详细的统计信息，包括单个文件的行数信息和文件名，以及所有文件的行数总和。
以下将给出具体的实现代码。为避免大段粘贴代码，以函数为片段简要描述。
def calclinesch(line, isblockcomment): linetype, linelen = 0, len(line) if not linelen: return linetype line = line + '\n' #添加一个字符防止ichar+1时越界 ichar, islinecomment = 0, false while ichar >sys.stderr, 'unsupported sort order(%s)!' %sortarg return isreverse = sortarg[0]=='r' #false:升序(ascending); true:降序(decending) sort_order = (keyfunc, isreverse)def reportcounterinfo(israwreport=true, stream=sys.stdout): #代码注释率 = 注释行 / (注释行+有效代码行) print >>stream, 'filelines codelines commentlines blanklines commentpercent %s'\ %(not israwreport and 'filename' or '') if israwreport: print >>stream, '%-11d%-11d%-14d%-12d%-16.2f' %(rawcountinfo[0],\ rawcountinfo[1], rawcountinfo[2], rawcountinfo[3], \ safediv(rawcountinfo[2], rawcountinfo[2]+rawcountinfo[1]), rawcountinfo[4]) return total = [0, 0, 0, 0] #对detailcountinfo排序。缺省按第一列元素(文件名)升序排序，以提高输出可读性。 detailcountinfo.sort(key=sort_order[0], reverse=sort_order[1]) for item in detailcountinfo: print >>stream, '%-11d%-11d%-14d%-12d%-16.2f%s' %(item[1][0], item[1][1], item[1][2], \ item[1][3], item[1][4], item[0]) total[0] += item[1][0]; total[1] += item[1][1] total[2] += item[1][2]; total[3] += item[1][3] print >>stream, '-' * 90 #输出90个负号(minus)或连字号(hyphen) print >>stream, '%-11d%-11d%-14d%-12d%-16.2f' \ %(total[0], total[1], total[2], total[3], \ safediv(total[2], total[2]+total[1]), len(detailcountinfo))
reportcounterinfo()输出统计报告。注意，详细报告输出前，会根据指定的排序规则对输出内容排序。此外，空白行术语由emptylines改为blanklines。前者表示该行除行结束符外不含任何其他字符，后者表示该行只包含空白字符(空格、制表符和行结束符等)。
为支持同时统计多个目录和(或)文件，使用parsetargetlist()解析目录-文件混合列表，将其元素分别存入目录和文件列表：
def parsetargetlist(targetlist): filelist, dirlist = [], [] if targetlist == []: targetlist.append(os.getcwd()) for item in targetlist: if os.path.isfile(item): filelist.append(os.path.abspath(item)) elif os.path.isdir(item): dirlist.append(os.path.abspath(item)) else: print >>sys.stderr, '%s' is neither a file nor a directory! %item return [filelist, dirlist]
linecounter()函数基于目录和文件列表进行统计：
def countdir(dirlist, iskeep=false, israwreport=true, isshortname=false): for dir in dirlist: if iskeep: for file in os.listdir(dir): countfilelines(os.path.join(dir, file), israwreport, isshortname) else: for root, dirs, files in os.walk(dir): for file in files: countfilelines(os.path.join(root, file), israwreport, isshortname)def countfile(filelist, israwreport=true, isshortname=false): for file in filelist: countfilelines(file, israwreport, isshortname)def linecounter(iskeep=false, israwreport=true, isshortname=false, targetlist=[]): filelist, dirlist = parsetargetlist(targetlist) if filelist != []: countfile(filelist, israwreport, isshortname) if dirlist != []: countdir(dirlist, iskeep, israwreport, isshortname)
然后，添加命令行解析处理：
import argparsedef parsecmdargs(argv=sys.argv): parser = argparse.argumentparser(usage='%(prog)s [options] target', description='count lines in code files.') parser.add_argument('target', nargs='*', help='space-separated list of directories and/or files') parser.add_argument('-k', '--keep', action='store_true', help='do not walk down subdirectories') parser.add_argument('-d', '--detail', action='store_true', help='report counting result in detail') parser.add_argument('-b', '--basename', action='store_true', help='do not show file\'s full path')## sortwords = ['0', '1', '2', '3', '4', '5', 'file', 'code', 'cmmt', 'blan', 'ctpr', 'name']## parser.add_argument('-s', '--sort',## choices=[x+y for x in ['','r'] for y in sortwords],## help='sort order: {0,1,2,3,4,5} or {file,code,cmmt,blan,ctpr,name},' \## prefix 'r' means sorting in reverse order) parser.add_argument('-s', '--sort', help='sort order: {0,1,2,3,4,5} or {file,code,cmmt,blan,ctpr,name}, ' \ prefix 'r' means sorting in reverse order) parser.add_argument('-o', '--out', help='save counting result in out') parser.add_argument('-c', '--cache', action='store_true', help='use cache to count faster(unreliable when files are modified)') parser.add_argument('-v', '--version', action='version', version='%(prog)s 3.0 by xywang') args = parser.parse_args() return (args.keep, args.detail, args.basename, args.sort, args.out, args.cache, args.target)
注意parsecmdargs()函数中增加的-s选项。该选项指定输出排序方式，并由r前缀指定升序还是降序。例如，-s 0或-s file表示输出按文件行数升序排列，-s r0或-s rfile表示输出按文件行数降序排列。
-c缓存选项最适用于改变输出排序规则时。为支持该选项，使用json模块持久化统计报告：
cache_file = 'counter.dump'cache_dumper, cache_gen = none, nonefrom json import dump, jsondecoderdef counterdump(data): global cache_dumper if cache_dumper == none: cache_dumper = open(cache_file, 'w') dump(data, cache_dumper)def parsejson(jsondata): endpos = 0 while true: jsondata = jsondata[endpos:].lstrip() try: pyobj, endpos = jsondecoder().raw_decode(jsondata) yield pyobj except valueerror: breakdef counterload(): global cache_gen if cache_gen == none: cache_gen = parsejson(open(cache_file, 'r').read()) try: return next(cache_gen) except stopiteration, e: return []def shouldusecache(keep, detail, basename, cache, target): if not cache: #未指定启用缓存 return false try: (_keep, _detail, _basename, _target) = counterload() except (ioerror, eoferror, valueerror): #缓存文件不存在或内容为空或不合法 return false if keep == _keep and detail == _detail and basename == _basename \ and sorted(target) == sorted(_target): return true else: return false
注意，json持久化会涉及字符编码问题。例如，当源文件名包含gbk编码的中文字符时，文件名写入detailcountinfo前应通过unicode(os.path.basename(filepath), 'gbk')转换为unicode，否则dump时会报错。幸好，只有测试用的源码文件才可能包含中文字符。因此，通常不用考虑编码问题。
此时，可调用以上函数统计代码并输出报告：
def main(): global gisstdout, rawcountinfo, detailcountinfo (keep, detail, basename, sort, out, cache, target) = parsecmdargs() stream = sys.stdout if not out else open(out, 'w') setsortarg(sort); loadcextlib() cacheused = shouldusecache(keep, detail, basename, cache, target) if cacheused: try: (rawcountinfo, detailcountinfo) = counterload() except (eoferror, valueerror), e: #不太可能出现 print >>sys.stderr, 'unexpected cache corruption(%s), try counting directly.'%e linecounter(keep, not detail, basename, target) else: linecounter(keep, not detail, basename, target) reportcounterinfo(not detail, stream) counterdump((keep, detail, basename, target)) counterdump((rawcountinfo, detailcountinfo))
为测量行数统计工具的运行效率，还可添加如下计时代码：
if __name__ == '__main__': from time import clock starttime = clock() main() endtime = clock() print >>sys.stderr, 'time elasped: %.2f sec.' %(endtime-starttime)
为避免cprofile开销，此处使用time.clock()测量耗时。
1.2 代码优化
calclinesch()和calclinespy()除len()函数外并未使用其他python库函数，因此很容易改写为c实现。其c语言版本实现最初如下：
#include #include #define true 1#define false 0unsigned int calclinesch(char *line, unsigned char isblockcomment[2]) { unsigned int linetype = 0; unsigned int linelen = strlen(line); if(!linelen) return linetype; char *expandline = calloc(linelen + 1/*\n*/, 1); if(null == expandline) return linetype; memmove(expandline, line, linelen); expandline[linelen] = '\n'; //添加一个字符防止ichar+1时越界 unsigned int ichar = 0; unsigned char islinecomment = false; while(ichar >>> import cffi>>>> cffi.__version__'1.6.0'
若要cplinecounter在未安装python环境的主机上运行，应先将cpython版本的代码转换为exe并压缩后，连同压缩后的dll文件一并发布。使用者可将其放入同一个目录，再将该目录加入path环境变量，即可在windows命令提示符窗口中运行cplinecounter。例如：
d:\pytest>cplinecounter -d lctest -s codefilelines codelines commentlines blanklines commentpercent filename6 3 4 0 0.57 d:\pytest\lctest\hard.c27 7 15 5 0.68 d:\pytest\lctest\file27_code7_cmmt15_blank5.py33 19 15 4 0.44 d:\pytest\lctest\line.c44 34 3 7 0.08 d:\pytest\lctest\test.c44 34 3 7 0.08 d:\pytest\lctest\subdir\test.c243 162 26 60 0.14 d:\pytest\lctest\subdir\clinecounter.py------------------------------------------------------------------------------------------397 259 66 83 0.20 time elasped: 0.04 sec.
二. 精度与性能评测
为检验cplinecounter统计精度和性能，作者从网上下载几款常见的行数统计工具，即cloc1.64(10.9mb)、linecount3.7(451kb)、sourcecounter3.4(8.34mb)和sourcecount_1.0(644kb)。
首先测试统计精度。以line.c为目标代码，上述工具的统计输出如下表所示(-表示该工具未直接提供该统计项)：
经
人工检验，cplinecounter的统计结果准确无误。linecount和sourcecounter统计也较为可靠。
然后，统计82个源代码文件，上述工具的统计输出如下表所示：
通常，文件总行数和空行数统计规则简单，不易出错。因此，选取这两项统计重合度最高的工具作为基准，即cplinecounter和linecount。同时，对于代码行数和注释行数，cplinecounter和sourcecounter的统计结果重合。根据统计重合度，有理由认为cplinecounter的统计精度最高。
最后，测试统计性能。在作者的windows xp主机(pentium g630 2.7ghz主频2gb内存)上，统计5857个c源代码文件，总行数接近千万级。上述工具的性能表现如下表所示。表中仅显示总计项，实际上仍统计单个文件的行数信息。注意，测试时linecount要勾选目录统计时包含同名文件，cloc要添加--skip-uniqueness和--by-file选项。
其中，cplinecounter的性能因运行场景而异，统计耗时少则29秒，多则281秒。。需要注意的是，cloc仅统计出5733个文件。
以条形图展示上述工具的统计性能，如下所示：
图中opt-c表示cplinecounter以-c选项运行，cpython2.7+ctypes(o)表示以cpython2.7环境运行附带旧dll库的cplinecounter，pypy5.1+cffi1.6(n)表示以pypy5.1环境运行附带新dll库的cplinecounter，以此类推。
由于cplinecounter并非纯粹的cpu密集型程序，因此dll库算法本身的优化并未带来性能的显著提升(对比旧dll库和新dll库)。对比之下，pypy内置jit(即时编译)解释器，可从整体上极大地��升python脚本的运行速度，加速效果甚至可与c匹敌。此外，性能测试数据会受到目标代码、cpu架构、预热、缓存、后台程序等多方面因素影响，因此不同工具或组合的性能表现可能与作者给出的数据略有出入。
综合而言，cplinecounter统计速度最快且结果可靠，软件体积也小(exe1.3mb,dll11kb)。sourcecounter统计结果比较可靠，速度较快，且内置项目管理信息。cloc文件数目统计误差大，linecount代码行统计误差大，两者速度较慢。但cloc可配置项丰富，并且可自行编译以压缩体积。sourcecount统计速度最慢，结果也不太可靠。
了解python并行计算的读者也可修改cplinecounter源码实现，加入多进程处理，压满多核处理器；还可尝试多线程，以改善io性能。以下截取countfilelines()函数的部分line_profiler结果：
e:\pytest>kernprof -l -v cplinecounter.py source -d > out.txt140872 93736 32106 16938 0.26 wrote profile results to cplinecounter.py.lproftimer unit: 2.79365e-07 stotal time: 5.81981 sfile: cplinecounter.pyfunction: countfilelines at line 143line # hits time per hit % time line contents============================================================== 143 @profile 144 def countfilelines(filepath, israwreport=true, isshortname=false):... ... ... ... ... ... ... ... 162 82 7083200 86380.5 34.0 with open(filepath, 'r') as file: 163 140954 1851877 13.1 8.9 for line in file: 164 140872 6437774 45.7 30.9 linetype = calclines(filetype, line.strip(), isblockcomment) 165 140872 1761864 12.5 8.5 linecountinfo[0] += 1 166 140872 1662583 11.8 8.0 if linetype == 0: linecountinfo[3] += 1 167 123934 1499176 12.1 7.2 elif linetype == 1: linecountinfo[1] += 1 168 32106 406931 12.7 2.0 elif linetype == 2: linecountinfo[2] += 1 169 1908 27634 14.5 0.1 elif linetype == 3: linecountinfo[1] += 1; linecountinfo[2] += 1... ... ... ... ... ... ... ...
line_profiler可用pip install line_profiler安装。在待评估函数前添加装饰器@profile后，运行kernprof命令，将给出被装饰函数中每行代码所耗费的时间。-l选项指明逐行分析，-v选项则指明执行后屏显计时信息。hits(执行次数)或time(执行时间)值较大的代码行具有较大的优化空间。
由line_profiler结果可见，该函数偏向cpu密集型(75~80行占用该函数56.7%的耗时)。然而考虑到目录遍历等操作，很可能整体程序为io密集型。因此，选用多进程还是多线程加速还需要测试验证。最简单地，可将73~80行(即读文件和统计行数)均改为c实现。其他部分要么为io密集型要么使用python库，用c语言改写事倍功半。
最后，若仅仅统计代码行数，linux或mac系统中可使用如下shell命令：
find ./codedir -name *.c -or -name *.h | xargs wc -l #除空行外的总行数
find ./codedir -name *.c -or -name *.h | xargs wc -l #各文件行数及总和
以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持脚本之家。

Python实现代码统计工具（终极篇）

推荐信息