Python使用scrapy抓取网站sitemap信息的方法

本文实例讲述了python使用scrapy抓取网站sitemap信息的方法。分享给大家供大家参考。具体如下：
import refrom scrapy.spider import basespiderfrom scrapy import logfrom scrapy.utils.response import body_or_strfrom scrapy.http import requestfrom scrapy.selector import htmlxpathselectorclass sitemapspider(basespider): name = sitemapspider start_urls = [http://www.domain.com/sitemap.xml] def parse(self, response): nodename = 'loc' text = body_or_str(response) r = re.compile(r(])(.*?)()%(nodename,nodename),re.dotall) for match in r.finditer(text): url = match.group(2) yield request(url, callback=self.parse_page) def parse_page(self, response): hxs = htmlxpathselector(response) #mock item blah = item() #do all your page parsing and selecting the elemtents you want blash.divtext = hxs.select('//div/text()').extract()[0] yield blah
希望本文所述对大家的python程序设计有所帮助。

Python使用scrapy抓取网站sitemap信息的方法

推荐信息