单位收集了很多word格式的调查表,领导需要收集表单里的信息,我就把所有调查表放一个文件里,写了个python小程序把所需的信息打印出来
#coding:utf-8 import osimport win32comfrom win32com.client import dispatch, constantsfrom docx import document def parse_doc(f): 读取doc,返回姓名和行业 doc = w.documents.open( filename = f ) t = doc.tables[0] # 根据文件中的图表选择信息 name = t.rows[0].cells[1].range.text situation = t.rows[0].cells[5].range.text people = t.rows[1].cells[1].range.text title = t.rows[1].cells[3].range.text print name, situation, people,title doc.close() def parse_docx(f): 读取docx,返回姓名和行业 d = document(f) t = d.tables[0] name = t.cell(0,1).text situation = t.cell(0,8).text people = t.cell(1,2).text title = t.cell(1,8).text print name, situation, people,title if __name__ == __main__: w = win32com.client.dispatch('word.application') # 遍历文件 path = h:\work\\aaa # windows文件路径 doc_files = os.listdir(path) for doc in doc_files: if os.path.splitext(doc)[1] == '.docx': try: parse_docx(path+'\\'+doc) except exception as e: print e elif os.path.splitext(doc)[1] == '.doc': try: parse_doc(path+'\\'+doc) except exception as e: print e
下载安装win32com
from win32com import client as wc word = wc.dispatch('word.application') doc = word.documents.open('c:/test') doc.saveas('c:/test.text', 2) doc.close() word.quit()
这种方式产生的text文档,不能用python用普通的r方式读取,为了让python可以用r方式读取,应当写成
doc.saveas('c:/test', 4)
注意:系统执行完成后,会自动产生文件后缀txt(虽然没有指明后缀)。
在xp系统下面,应当,
open(r'c:\text','r')wdformatdocument = 0 wdformatdocument97 = 0 wdformatdocumentdefault = 16 wdformatdostext = 4 wdformatdostextlinebreaks = 5 wdformatencodedtext = 7 wdformatfilteredhtml = 10 wdformatflatxml = 19 wdformatflatxmlmacroenabled = 20 wdformatflatxmltemplate = 21 wdformatflatxmltemplatemacroenabled = 22 wdformathtml = 8 wdformatpdf = 17 wdformatrtf = 6 wdformattemplate = 1 wdformattemplate97 = 1 wdformattext = 2 wdformattextlinebreaks = 3 wdformatunicodetext = 7 wdformatwebarchive = 9 wdformatxml = 11 wdformatxmldocument = 12 wdformatxmldocumentmacroenabled = 13 wdformatxmltemplate = 14 wdformatxmltemplatemacroenabled = 15 wdformatxps = 18
照着字面意思应该能对应到相应的文件格式,如果你是office 2003可能支持不了这么多格式。word文件转html有两种格式可选wdformathtml、wdformatfilteredhtml(对应数字 8、10),区别是如果是wdformathtml格式的话,word文件里面的公式等ole对象将会存储成wmf格式,而选用 wdformatfilteredhtml的话公式图片将存储为gif格式,而且目测可以看出用wdformatfilteredhtml生成的html 明显比wdformathtml要干净许多。
当然你也可以用任意一种语言通过com来调用office api,比如php.
from win32com import client as wc word = wc.dispatch('word.application') doc = word.documents.open(r'c:/test1.doc') doc.saveas('c:/test1.text', 4) doc.close() import re strings=open(r'c:\test1.text','r').read() result=re.findall('\(\s*[a-d]\s*\)|\(\xa1*[a-d]\xa1*\)|\(\s*[a-d]\s*\)|\(\xa1*[a-d]\xa1*\)',strings) chan=re.sub('\(\s*[a-d]\s*\)|\(\xa1*[a-d]\xa1*\)|\(\s*[a-d]\s*\)|\(\xa1*[a-d]\xa1*\)','()',strings) question=open(r'c:\question','a+') question.write(chan) question.close() answer=open(r'c:\answeronly','a+') for i,a in enumerate(result): m=re.search('[a-d]',a) answer.write(str(i+1)+' '+m.group()+'\n') answer.close()chan=re.sub(r'\xa3\xa8\s*[a-d]\s*\xa3\xa9','()',strings) #不要(),容易引起歧义。
