怎么使用Java爬虫批量爬取图片

爬取思路对于这种图片的获取，其实本质上就是就是文件的下载（httpclient）。但是因为不只是获取一张图片，所以还会有一个页面解析的处理过程（jsoup）。
jsoup：解析html页面，获取图片的链接。
httpclient：请求图片的链接，保存图片到本地。
具体步骤首先进入首页分析，主要有以下几个分类（这里不是全部分类，但是这几个也足够了，这只是学习技术而已。），我们的目标就是获取每个分类下的图片。
这里来分析一下网站的结构，我这里就简单一点吧。下面这张图片是大致的结构，这里选取一个分类标签进行说明。一个分类标签页含有多个标题页，然后每个标题页含有多个图片页。（对应标题页的几十张图片）
具体代码导入项目依赖jar包坐标或者直接下载对应的jar包，导入项目也可。
<dependency> <groupid>org.apache.httpcomponents</groupid> <artifactid>httpclient</artifactid> <version>4.5.6</version></dependency> <dependency> <groupid>org.jsoup</groupid> <artifactid>jsoup</artifactid> <version>1.11.3</version></dependency>
实体类 picture 和工具类 headerutil实体类：把属性封装成一个对象，这样调用方便一点。
package com.picture;public class picture { private string title; private string url; public picture(string title, string url) { this.title = title; this.url = url; } public string gettitle() { return this.title; } public string geturl() { return this.url; }}
工具类：不断变换 ua（我也不知道有没有用，不过我是使用自己的ip，估计用处不大了）
package com.picture;public class headerutil { public static string[] headers = { "mozilla/5.0 (windows nt 6.3; wow64) applewebkit/537.36 (khtml, like gecko) chrome/39.0.2171.95 safari/537.36", "mozilla/5.0 (macintosh; intel mac os x 10_9_2) applewebkit/537.36 (khtml, like gecko) chrome/35.0.1916.153 safari/537.36", "mozilla/5.0 (windows nt 6.1; wow64; rv:30.0) gecko/20100101 firefox/30.0", "mozilla/5.0 (macintosh; intel mac os x 10_9_2) applewebkit/537.75.14 (khtml, like gecko) version/7.0.3 safari/537.75.14", "mozilla/5.0 (compatible; msie 10.0; windows nt 6.2; win64; x64; trident/6.0)", "mozilla/5.0 (windows; u; windows nt 5.1; it; rv:1.8.1.11) gecko/20071127 firefox/2.0.0.11", "opera/9.25 (windows nt 5.1; u; en)", "mozilla/4.0 (compatible; msie 6.0; windows nt 5.1; sv1; .net clr 1.1.4322; .net clr 2.0.50727)", "mozilla/5.0 (compatible; konqueror/3.5; linux) khtml/3.5.5 (like gecko) (kubuntu)", "mozilla/5.0 (x11; u; linux i686; en-us; rv:1.8.0.12) gecko/20070731 ubuntu/dapper-security firefox/1.5.0.12", "lynx/2.8.5rel.1 libwww-fm/2.14 ssl-mm/1.4.1 gnutls/1.2.9", "mozilla/5.0 (x11; linux i686) applewebkit/535.7 (khtml, like gecko) ubuntu/11.04 chromium/16.0.912.77 chrome/16.0.912.77 safari/535.7", "mozilla/5.0 (x11; ubuntu; linux i686; rv:10.0) gecko/20100101 firefox/10.0 " };}
下载类多线程实在是太快了，再加上我只有一个ip，没有代理ip可以用（我也不太了解），使用多线程被封ip是很快的。
package com.picture;import java.io.bufferedoutputstream;import java.io.file;import java.io.fileoutputstream;import java.io.ioexception;import java.io.outputstream;import java.util.random;import org.apache.http.httpentity;import org.apache.http.client.clientprotocolexception;import org.apache.http.client.methods.closeablehttpresponse;import org.apache.http.client.methods.httpget;import org.apache.http.impl.client.closeablehttpclient;import org.apache.http.util.entityutils;import com.m3u8.httpclientutil;public class singlepicturedownloader { private string referer; private closeablehttpclient httpclient; private picture picture; private string filepath; public singlepicturedownloader(picture picture, string referer, string filepath) { this.httpclient = httpclientutil.gethttpclient(); this.picture = picture; this.referer = referer; this.filepath = filepath; } public void download() { httpget get = new httpget(picture.geturl()); random rand = new random(); //设置请求头 get.setheader("user-agent", headerutil.headers[rand.nextint(headerutil.headers.length)]); get.setheader("referer", referer); system.out.println(referer); httpentity entity = null; try (closeablehttpresponse response = httpclient.execute(get)) { int statuscode = response.getstatusline().getstatuscode(); if (statuscode == 200) { entity = response.getentity(); if (entity != null) { file picfile = new file(filepath, picture.gettitle()); try (outputstream out = new bufferedoutputstream(new fileoutputstream(picfile))) { entity.writeto(out); system.out.println("下载完毕：" + picfile.getabsolutepath()); } } } } catch (clientprotocolexception e) { e.printstacktrace(); } catch (ioexception e) { e.printstacktrace(); } finally { try { //关闭实体，关于 httpclient 的关闭资源，有点不太了解。 entityutils.consume(entity); } catch (ioexception e) { e.printstacktrace(); } } }}
这是获取 httpclient 连接的工具类，避免频繁创建连接的性能消耗。（但是因为我这里是使用单线程来爬取，所以用处就不大了。我就是可以只使用一个httpclient连接来爬取，这是因为我刚开始是使用多线程来爬取的，但是基本获取几张图片就被禁掉了，所以改成单线程爬虫。所以这个连接池也就留下来了。）
package com.m3u8;import org.apache.http.client.config.requestconfig;import org.apache.http.impl.client.closeablehttpclient;import org.apache.http.impl.client.httpclients;import org.apache.http.impl.conn.poolinghttpclientconnectionmanager;public class httpclientutil { private static final int time_out = 10 * 1000; private static poolinghttpclientconnectionmanager pcm; //httpclient 连接池管理类 private static requestconfig requestconfig; static { requestconfig = requestconfig.custom() .setconnectionrequesttimeout(time_out) .setconnecttimeout(time_out) .setsockettimeout(time_out).build(); pcm = new poolinghttpclientconnectionmanager(); pcm.setmaxtotal(50); pcm.setdefaultmaxperroute(10); //这里可能用不到这个东西。 } public static closeablehttpclient gethttpclient() { return httpclients.custom() .setconnectionmanager(pcm) .setdefaultrequestconfig(requestconfig) .build(); }}
最重要的类：解析页面类 picturespiderpackage com.picture;import java.io.file;import java.io.ioexception;import java.util.list;import java.util.map;import java.util.stream.collectors;import org.apache.http.httpentity;import org.apache.http.client.clientprotocolexception;import org.apache.http.client.methods.closeablehttpresponse;import org.apache.http.client.methods.httpget;import org.apache.http.impl.client.closeablehttpclient;import org.apache.http.util.entityutils;import org.jsoup.jsoup;import org.jsoup.nodes.document;import org.jsoup.select.elements;import com.m3u8.httpclientutil;/** * 首先从顶部分类标题开始，依次爬取每一个标题（小分页），每一个标题（大分页。） * */public class picturespider { private closeablehttpclient httpclient; private string referer; private string rootpath; private string filepath; public picturespider() { httpclient = httpclientutil.gethttpclient(); } /** * 开始爬虫爬取！ * * 从爬虫队列的第一条开始，依次爬取每一条url。 * * 分页爬取：爬10页 * 每个url属于一个分类，每个分类一个文件夹 * */ public void start(list<string> urllist) { urllist.stream().foreach(url->{ this.referer = url; string dirname = url.substring(22, url.length()-1); //根据标题名字去创建目录 //创建分类目录 file path = new file("d:/dragonfile/dbc/mzt/", dirname); //硬编码路径，需要用户自己指定一个 if (!path.exists()) { path.mkdir(); rootpath = path.tostring(); } for (int i = 1; i <= 10; i++) { //分页获取图片数据，简单获取几页就行了 this.page(url + "page/"+ 1); } }); } /** * 标题分页获取链接 * */ public void page(string url) { system.out.println("url：" + url); string html = this.gethtml(url); //获取页面数据 map<string, string> picmap = this.extracttitleurl(html); //抽取图片的url if (picmap == null) { return ; } //获取标题对应的图片页面数据 this.getpicturehtml(picmap); } private string gethtml(string url) { string html = null; httpget get = new httpget(url); get.setheader("user-agent", "mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/60.0.3100.0 safari/537.36"); get.setheader("referer", url); try (closeablehttpresponse response = httpclient.execute(get)) { int statuscode = response.getstatusline().getstatuscode(); if (statuscode == 200) { httpentity entity = response.getentity(); if (entity != null) { html = entityutils.tostring(entity, "utf-8"); //关闭实体？ } } else { system.out.println(statuscode); } } catch (clientprotocolexception e) { e.printstacktrace(); } catch (ioexception e) { e.printstacktrace(); } return html; } private map<string, string> extracttitleurl(string html) { if (html == null) { return null; } document doc = jsoup.parse(html, "utf-8"); elements pictures = doc.select("ul#pins > li"); //不知为何，无法直接获取 a[0]，我不太懂这方面的知识。 //那我就多处理一步，这里先放下。 elements picturea = pictures.stream() .map(pic->pic.getelementsbytag("a").first()) .collect(collectors.tocollection(elements::new)); return picturea.stream().collect(collectors.tomap( pic->pic.getelementsbytag("img").first().attr("alt"), pic->pic.attr("href"))); } /** * 进入每一个标题的链接，再次分页获取图片的链接 * */ private void getpicturehtml(map<string, string> picmap) { //进入标题页，在标题页中再次分页下载。 picmap.foreach((title, url)->{ //分页下载一个系列的图片，每个系列一个文件夹。 file dir = new file(rootpath, title.trim()); if (!dir.exists()) { dir.mkdir(); filepath = dir.tostring(); //这个 filepath 是每一个系列图片的文件夹 } for (int i = 1; i <= 60; i++) { string html = this.gethtml(url + "/" + i); if (html == null) { //每个系列的图片一般没有那么多， //如果返回的页面数据为 null，那就退出这个系列的下载。 return ; } picture picture = this.extractpictureurl(html); system.out.println("开始下载"); //多线程实在是太快了（快并不是好事，我改成单线程爬取吧） singlepicturedownloader downloader = new singlepicturedownloader(picture, referer, filepath); downloader.download(); try { thread.sleep(1500); //不要爬的太快了，这里只是学习爬虫的知识。不要扰乱别人的正常服务。 system.out.println("爬取完一张图片，休息1.5秒。"); } catch (interruptedexception e) { e.printstacktrace(); } } }); } /** * 获取每一页图片的标题和链接 * */ private picture extractpictureurl(string html) { document doc = jsoup.parse(html, "utf-8"); //获取标题作为文件名 string title = doc.getelementsbytag("h3") .first() .text(); //获取图片的链接（img 标签的 src 属性） string url = doc.getelementsbyattributevalue("class", "main-image") .first() .getelementsbytag("img") .attr("src"); //获取图片的文件扩展名 title = title + url.substring(url.lastindexof(".")); return new picture(title, url); }}
启动类 bootstrap这里有一个爬虫队列，但是我最终连第一个都没有爬取完，这是因为我计算失误了，少算了两个数量级。但是，程序的功能是正确的。
package com.picture;import java.util.arraylist;import java.util.arrays;import java.util.list;/** * 爬虫启动类 * */public class bootstrap { public static void main(string[] args) { //反爬措施：ua、refer 简单绕过就行了。 //refer https://www.mzitu.com //使用数组做一个爬虫队列 string[] urls = new string[] { "https://www.mzitu.com/xinggan/", "https://www.mzitu.com/zipai/" }; // 添加初始队列，启动爬虫 list<string> urllist = new arraylist<>(arrays.aslist(urls)); picturespider spider = new picturespider(); spider.start(urllist); }}
爬取结果
注意事项这里有一个计算失误，代码如下：
for (int i = 1; i <= 10; i++) { //分页获取图片数据，简单获取几页就行了 this.page(url + "page/"+ 1); }
这个 i 的取值过大了，因为我计算的时候失误了。如果按照这个情况下载的话，总共会下载：4 * 10 * (30-5) * 60 = 64800 张。（每一页是含有30个标题页，大概5个是广告。）我一开始以为只有几百张图片！这是一个估计值，但是真实的下载量和这个不会差太多的（没有数量级的差距）。所以我下载了一会发现只下载了第一个队列里面的图片。当然了，作为一个爬虫学习的程序，它还是很合格的。
这个程序只是用来学习的，我设置每张图片的下载间隔时间是1.5秒，而且是单线程的程序，所以速度上会显得很慢。但是那样也没有关系，只要程序的功能正确就行了，应该没有人会真的等到图片下载完吧。
那估计要好久了：64800*1.5s = 97200s = 27h，这也只是一个粗略的估计值，没有考虑程序的其他运行时间，不过其他时间可以基本忽略了。
以上就是怎么使用java爬虫批量爬取图片的详细内容。

怎么使用Java爬虫批量爬取图片

推荐信息