实践Python的爬虫框架Scrapy来抓取豆瓣电影TOP250

安装部署scrapy
在安装scrapy前首先需要确定的是已经安装好了python（目前scrapy支持python2.5，python2.6和python2.7）。官方文档中介绍了三种方法进行安装，我采用的是使用 easy_install 进行安装，首先是下载windows版本的setuptools（下载地址：http://pypi.python.org/pypi/setuptools），下载完后一路next就可以了。
安装完setuptool以后。执行cmd，然后运行一下命令：
easy_install -u scrapy
同样的你可以选择使用pip安装，pip的地址：http://pypi.python.org/pypi/pip
使用pip安装scrapy的命令为
pip install scrapy
如果你的电脑先前装过visual studio 2008 或 visual studio 2010那么一起顺利，scrapy已经安装完成。如果出现下列报错：unable to find vcvarsall.bat 那么你需要折腾下。你可以安装visual studio 后进行安装或采用下面的方式进行解决：
首先安装mingw（mingw下载地址：http://sourceforge.net/projects/mingw/files/），在mingw的安装目录下找到bin的文件夹，找到mingw32-make.exe，复制一份更名为make.exe；
把mingw的路径添加到环境变量path中，比如我把mingw安装到d:\mingw\中，就把d:\mingw\bin添加到path中；
打开命令行窗口，在命令行窗口中进入到要安装代码的目录下；
输入如下命令 setup.py install build –compiler=mingw32 就可以安装了。
如果出现“xslt-config' 不是内部或外部命令，也不是可运行的程序或批处理文件。”错误，原因主要是lxml安装不成功，只要上http://pypi.python.org/simple/lxml/下载个exe文件进行安装就可以了。
下面就可以进入正题了。
新建工程
让我们来用爬虫获取豆瓣电影top 250的电影信息吧。开始之前，我们新建一个scrapy工程。因为我用的win7，所以在cmd中进入一个我希望保存代码的目录，然后执行：
d:\web\python>scrapy startproject doubanmoive
这个命令会在当前目录下创建一个新的目录doubanmoive，目录结构如下：
d:\web\python\doubanmoive>tree /ffolder path listing for volume datavolume serial number is 00000200 34ec:9cb9d:.│ scrapy.cfg│└─doubanmoive │ items.py │ pipelines.py │ settings.py │ __init__.py │ └─spiders __init__.py
这些文件主要为：
doubanmoive/items.py: 定义需要获取的内容字段，类似于实体类。 doubanmoive/pipelines.py: 项目管道文件，用来处理spider抓取的数据。 doubanmoive/settings.py: 项目配置文件 doubanmoive/spiders: 放置spider的目录定义项目(item)
item是用来装载抓取数据的容器，和java里的实体类（entity）比较像，打开doubanmoive/items.py可以看到默认创建了以下代码。
from scrapy.item import item, fieldclass doubanmoiveitem(item): pass
我们只需要在 doubanmoive 类中增加需要抓取的字段即可，如 name=field() ，最后根据我们的需求完成代码如下。
from scrapy.item import item, fieldclass doubanmoiveitem(item): name=field()#电影名 year=field()#上映年份 score=field()#豆瓣分数 director=field()#导演 classification=field()#分类 actor=field()#演员
编写爬虫(spider)
spider是整个项目中最核心的类，在这个类里我们会定义抓取对象（域名、url)以及抓取规则。scrapy官方文档中的教程是基于 basespider 的，但 basespider 只能爬取给定的url列表，无法根据一个初始的url向外拓展。不过除了 basespider ，还有很多可以直接继承 spider 的类，比如 scrapy.contrib.spiders.crawlspider 。
在 doubanmoive/spiders 目录下新建moive_spider.py文件，并填写代码。
# -*- coding: utf-8 -*-from scrapy.selector import selectorfrom scrapy.contrib.spiders import crawlspider,rulefrom scrapy.contrib.linkextractors.sgml import sgmllinkextractorfrom doubanmoive.items import doubanmoiveitemclass moivespider(crawlspider): name=doubanmoive allowed_domains=[movie.douban.com] start_urls=[http://movie.douban.com/top250] rules=[ rule(sgmllinkextractor(allow=(r'http://movie.douban.com/top250\?start=\d+.*'))), rule(sgmllinkextractor(allow=(r'http://movie.douban.com/subject/\d+')),callback=parse_item), ] def parse_item(self,response): sel=selector(response) item=doubanmoiveitem() item['name']=sel.xpath('//*[@id=content]/h1/span[1]/text()').extract() item['year']=sel.xpath('//*[@id=content]/h1/span[2]/text()').re(r'\((\d+)\)') item['score']=sel.xpath('//*[@id=interest_sectl]/div/p[1]/strong/text()').extract() item['director']=sel.xpath('//*[@id=info]/span[1]/a/text()').extract() item['classification']= sel.xpath('//span[@property=v:genre]/text()').extract() item['actor']= sel.xpath('//*[@id=info]/span[3]/a[1]/text()').extract() return item
代码说明： moivespider 继承scrapy中的 crawlspider ， name , allow_domains , start_url 看名字就知道什么含义，其中rules稍微复杂一些，定义了url的抓取规则，符合 allow 正则表达式的链接都会加入到scheduler（调度程序）。通过分析豆瓣电影top250的分页url http://movie.douban.com/top250?start=25&filter=&type= 可以得到以下规则
rule(sgmllinkextractor(allow=(r'http://movie.douban.com/top250\?start=\d+.*'))),
而我们真正要抓取的页面是每一个电影的详细介绍，如肖申克的救赎的链接为 http://movie.douban.com/subject/1292052/ ，那只有 subject 后面的数字是变化的，根据正则表达式得到如下代码。我们需要抓取这种类型链接中的内容，于是加入callback属性，将response交给parse_item函数来处理。
rule(sgmllinkextractor(allow=(r'http://movie.douban.com/subject/\d+')),callback=parse_item),
在 parse_item 函数中的处理逻辑非常简单，获取符合条件链接的代码，然后根据一定的规则抓取内容赋给item并返回 item pipeline 。获取大部分标签的内容不需要编写复杂的正则表达式，我们可以使用 xpath 。 xpath 是一门在 xml 文档中查找信息的语言，但它也可以用在html中。下表列出了常用表达式。
表达式描述
nodename 选取此节点的所有子节点。
/ 从根节点选取。
// 从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
. 选取当前节点。
.. 选取当前节点的父节点。
@ 选取属性。
如 //*[@id=content]/h1/span[1]/text() 获取的结果是在id为content的任意元素下h1元素下的span列表中第一个元素的文本内容。我们可以通过chrome开发者工具(f12)来获取某内容的xpath表达式，具体操作为在需要抓取的内容上点击审查元素，下方就会出现开发者工具，并定位到该元素，在内容上点击右键，选择复制xpath。
存储数据
爬虫获取到数据以后我们需要将其存储到数据库中，之前我们提到该操作需要靠项目管道（pipeline）来处理，其通常执行的操作为：
清洗html数据验证解析到的数据（检查项目是否包含必要的字段）检查是否是重复数据（如果重复就删除）将解析到的数据存储到数据库中由于我们获取的数据格式多种多样，有一些存储在关系型数据库中并不方便，所以我在写完mysql版本的pipeline之后又写了一个mongodb的。
mysql版本：
# -*- coding: utf-8 -*-from scrapy import logfrom twisted.enterprise import adbapifrom scrapy.http import requestimport mysqldbimport mysqldb.cursorsclass doubanmoivepipeline(object): def __init__(self): self.dbpool = adbapi.connectionpool('mysqldb', db = 'python', user = 'root', passwd = 'root', cursorclass = mysqldb.cursors.dictcursor, charset = 'utf8', use_unicode = false ) def process_item(self, item, spider): query = self.dbpool.runinteraction(self._conditional_insert, item) query.adderrback(self.handle_error) return item def _conditional_insert(self,tx,item): tx.execute(select * from doubanmoive where m_name= %s,(item['name'][0],)) result=tx.fetchone() log.msg(result,level=log.debug) print result if result: log.msg(item already stored in db:%s % item,level=log.debug) else: classification=actor='' lenclassification=len(item['classification']) lenactor=len(item['actor']) for n in xrange(lenclassification): classification+=item['classification'][n] if n mongodb版本：
# -*- coding: utf-8 -*-import pymongofrom scrapy.exceptions import dropitemfrom scrapy.conf import settingsfrom scrapy import logclass mongodbpipeline(object): #connect to the mongodb database def __init__(self): connection = pymongo.connection(settings['mongodb_server'], settings['mongodb_port']) db = connection[settings['mongodb_db']] self.collection = db[settings['mongodb_collection']] def process_item(self, item, spider): #remove invalid data valid = true for data in item: if not data: valid = false raise dropitem(missing %s of blogpost from %s %(data, item['url'])) if valid: #insert data into database new_moive=[{ name:item['name'][0], year:item['year'][0], score:item['score'][0], director:item['director'], classification:item['classification'], actor:item['actor'] }] self.collection.insert(new_moive) log.msg(item wrote to mongodb database %s/%s % (settings['mongodb_db'], settings['mongodb_collection']), level=log.debug, spider=spider) return item
可以看到其基本的处理流程是一样，但是mysql不太方便的一点就是需要将数组类型的数据通过分隔符转换。而mongodb支持存入list、dict等多种类型的数据。
配置文件
在运行爬虫之前还需要将在 settings.py 中增加一些配置信息。
bot_name = 'doubanmoive'spider_modules = ['doubanmoive.spiders']newspider_module = 'doubanmoive.spiders'item_pipelines={ 'doubanmoive.mongo_pipelines.mongodbpipeline':300, 'doubanmoive.pipelines.doubanmoivepipeline':400,}log_level='debug'download_delay = 2randomize_download_delay = trueuser_agent = 'mozilla/5.0 (macintosh; intel mac os x 10_8_3) applewebkit/536.5 (khtml, like gecko) chrome/19.0.1084.54 safari/536.5'cookies_enabled = truemongodb_server = 'localhost'mongodb_port = 27017mongodb_db = 'python'mongodb_collection = 'test'
item_pipelines 中定义了mysql和mongodb两个pipeline文件，后面的数字代表执行的优先级顺序，范围为0~1000。而中间的 download_delay 等信息是为了防止爬虫被豆瓣ban掉，增加了一些随机延迟，浏览器代理等。最后的就是mongodb的配置信息，mysql也可以参考这种方式来写。
至此为止，抓取豆瓣电影的爬虫就已经完成了。在命令行中执行 scrapy crawl doubanmoive 让蜘蛛开始爬行吧！

实践Python的爬虫框架Scrapy来抓取豆瓣电影TOP250

推荐信息