如何从python文件中提取信息？3分钟搞懂Python文本分析和提取

单位收集了很多word格式的调查表，领导需要收集表单里的信息，我就把所有调查表放一个文件里，写了个python小程序把所需的信息打印出来，这个小程序就能从python文本中分析信息并提取信息
#coding:utf-8 import osimport win32comfrom win32com.client import dispatch, constantsfrom docx import document def parse_doc(f): """读取doc，返回姓名和行业 """ doc = w.documents.open( filename = f ) t = doc.tables[0] # 根据文件中的图表选择信息 name = t.rows[0].cells[1].range.text situation = t.rows[0].cells[5].range.text people = t.rows[1].cells[1].range.text title = t.rows[1].cells[3].range.text print name, situation, people,title doc.close() def parse_docx(f): """读取docx，返回姓名和行业 """ d = document(f) t = d.tables[0] name = t.cell(0,1).text situation = t.cell(0,8).text people = t.cell(1,2).text title = t.cell(1,8).text print name, situation, people,title if __name__ == "__main__": w = win32com.client.dispatch('word.application') # 遍历文件 path = "h:\work\\aaa" # windows文件路径 doc_files = os.listdir(path) for doc in doc_files: if os.path.splitext(doc)[1] == '.docx': try: parse_docx(path+'\\'+doc) except exception as e: print e elif os.path.splitext(doc)[1] == '.doc': try: parse_doc(path+'\\'+doc) except exception as e: print e
下载安装win32com
from win32com import client as wc word = wc.dispatch('word.application') doc = word.documents.open('c:/test') doc.saveas('c:/test.text', 2) doc.close() word.quit()
这种方式产生的text文档，不能用python用普通的r方式读取，为了让python可以用r方式读取，应当写成
doc.saveas('c:/test', 4)
注意：系统执行完成后，会自动产生文件后缀txt（虽然没有指明后缀）。
在xp系统下面，应当，
open(r'c:\text','r')wdformatdocument = 0 wdformatdocument97 = 0 wdformatdocumentdefault = 16 wdformatdostext = 4 wdformatdostextlinebreaks = 5 wdformatencodedtext = 7 wdformatfilteredhtml = 10 wdformatflatxml = 19 wdformatflatxmlmacroenabled = 20 wdformatflatxmltemplate = 21 wdformatflatxmltemplatemacroenabled = 22 wdformathtml = 8 wdformatpdf = 17 wdformatrtf = 6 wdformattemplate = 1 wdformattemplate97 = 1 wdformattext = 2 wdformattextlinebreaks = 3 wdformatunicodetext = 7 wdformatwebarchive = 9 wdformatxml = 11 wdformatxmldocument = 12 wdformatxmldocumentmacroenabled = 13 wdformatxmltemplate = 14 wdformatxmltemplatemacroenabled = 15 wdformatxps = 18
照着字面意思应该能对应到相应的文件格式，如果你是office 2003可能支持不了这么多格式。word文件转html有两种格式可选wdformathtml、wdformatfilteredhtml（对应数字 8、10），区别是如果是wdformathtml格式的话，word文件里面的公式等ole对象将会存储成wmf格式，而选用 wdformatfilteredhtml的话公式图片将存储为gif格式，而且目测可以看出用wdformatfilteredhtml生成的html 明显比wdformathtml要干净许多。
以上就是如何从python文件中提取信息？3分钟搞懂python文本分析和提取的详细内容。

如何从python文件中提取信息？3分钟搞懂Python文本分析和提取

推荐信息