大多数app里面返回的是json格式数据,或者一堆加密过的数据 。这里以超级课程表app为例,抓取超级课程表里用户发的话题。
1、抓取app数据包
方法详细可以参考这篇博文:fiddler如何抓取手机app数据包
得到超级课程表登录的地址:http://120.55.151.61/v2/studentskip/logincheckv4.action
表单:
表单中包括了用户名和密码,当然都是加密过了的,还有一个设备信息,直接post过去就是。
另外必须加header,一开始我没有加header得到的是登录错误,所以要带上header信息。
2、登录
登录代码:
import urllib2from cookielib import cookiejarloginurl = 'http://120.55.151.61/v2/studentskip/logincheckv4.action'headers = {'content-type': 'application/x-www-form-urlencoded; charset=utf-8','user-agent': 'dalvik/1.6.0 (linux; u; android 4.1.1; m040 build/jro03h)','host': '120.55.151.61','connection': 'keep-alive','accept-encoding': 'gzip','content-length': '207',}logindata = 'phonebrand=meizu&platform=1&devicecode=868033014919494&account=fcf030e1f2f6341c1c93be5bbc422a3d&phoneversion=16&password=a55b48bb75c79200379d82a18c5f47d6&channel=mxmarket&phonemodel=m040&versionnumber=7.2.1&'cookiejar = cookiejar()opener = urllib2.build_opener(urllib2.httpcookieprocessor(cookiejar))req = urllib2.request(loginurl, logindata, headers)loginresult = opener.open(req).read()print loginresult
登录成功 会返回一串账号信息的json数据
和抓包时返回数据一样,证明登录成功
3、抓取数据
用同样方法得到话题的url和post参数
做法就和模拟登录网站一样。详见:python爬虫模拟登录带验证码网站
下见最终代码,有主页获取和下拉加载更新。可以无限加载话题内容。
#!/usr/local/bin/python2.7# -*- coding: utf8 -*-超级课程表话题抓取import urllib2from cookielib import cookiejarimport json''' 读json数据 '''def fetch_data(json_data):data = json_data['data']timestamplong = data['timestamplong']messagebo = data['messagebos']topiclist = []for each in messagebo:topicdict = {}if each.get('content', false):topicdict['content'] = each['content']topicdict['schoolname'] = each['schoolname']topicdict['messageid'] = each['messageid']topicdict['gender'] = each['studentbo']['gender']topicdict['time'] = each['issuetime']print each['schoolname'],each['content']topiclist.append(topicdict)return timestamplong, topiclist''' 加载更多 '''def load(timestamp, headers, url):headers['content-length'] = '159'loaddata = 'timestamp=%s&phonebrand=meizu&platform=1&gendertype=-1&topicid=19&phoneversion=16&selecttype=3&channel=mxmarket&phonemodel=m040&versionnumber=7.2.1&' % timestampreq = urllib2.request(url, loaddata, headers)loadresult = opener.open(req).read()loginstatus = json.loads(loadresult).get('status', false)if loginstatus == 1:print 'load successful!'timestamp, topiclist = fetch_data(json.loads(loadresult))load(timestamp, headers, url)else:print 'load fail'print loadresultreturn falseloginurl = 'http://120.55.151.61/v2/studentskip/logincheckv4.action'topicurl = 'http://120.55.151.61/v2/treehole/message/getmessagebytopicidv3.action'headers = {'content-type': 'application/x-www-form-urlencoded; charset=utf-8','user-agent': 'dalvik/1.6.0 (linux; u; android 4.1.1; m040 build/jro03h)','host': '120.55.151.61','connection': 'keep-alive','accept-encoding': 'gzip','content-length': '207',}''' ---登录部分--- '''logindata = 'phonebrand=meizu&platform=1&devicecode=868033014919494&account=fcf030e1f2f6341c1c93be5bbc422a3d&phoneversion=16&password=a55b48bb75c79200379d82a18c5f47d6&channel=mxmarket&phonemodel=m040&versionnumber=7.2.1&'cookiejar = cookiejar()opener = urllib2.build_opener(urllib2.httpcookieprocessor(cookiejar))req = urllib2.request(loginurl, logindata, headers)loginresult = opener.open(req).read()loginstatus = json.loads(loginresult).get('data', false)if loginresult:print 'login successful!'else:print 'login fail'print loginresult''' ---获取话题--- '''topicdata = 'timestamp=0&phonebrand=meizu&platform=1&gendertype=-1&topicid=19&phoneversion=16&selecttype=3&channel=mxmarket&phonemodel=m040&versionnumber=7.2.1&'headers['content-length'] = '147'topicrequest = urllib2.request(topicurl, topicdata, headers)topichtml = opener.open(topicrequest).read()topicjson = json.loads(topichtml)topicstatus = topicjson.get('status', false)print topicjsonif topicstatus == 1:print 'fetch topic success!'timestamp, topiclist = fetch_data(topicjson)load(timestamp, headers, topicurl)
结果: