怎么使用Java IO流和网络制作一个简单的图片爬虫

java io流和网络的简单应用最近看到了 url 类的用法，简单的做了一个java 版的爬虫。发现还挺有趣的，就拿出来分享一下。通过关键字爬取百度图片，这个和我们使用搜索引擎搜索百度图片是一样的，只是通过爬虫可以学习技术的使用。（这个程序只是用来学习使用的，没有其它用途！）
java io 流和 url 类java io流java 的 io 流是实现输入/输出的基础，它可以方便的实现数据的输入/输出操作，在 java 中把不同的输入/输出源（键盘、文件、网络连接等)抽象表述为”流“（stream），通过流的方法运行java 程序使用相同的方式来访问不同的输入/输出源。
因为 io流已经对各种输入输出源做了一个抽象处理，所以我们可以使用相对一致的代码处理各种的源，只需要把它们作为输入输出流来进行处理就行了，这就是面向抽象的好处。
url 类uri 和 url
先来了解一下什么是 url 吧，说 url 之前先简单了解uri。
**uri，统一资源标识符（uniform resource identifier）**是采用一种特定语法标识一个资源的字符串。所标识的资源可能是服务器上的一个文件或者其它任何内容。uri 的语法是由一个模式和一个模式特定部分组成，模式和模式特定部分用一个冒号分隔，如下所示：
模式:模式特定部分
uri 中的模式特定部分没有特定的语法，很多都采用一种层次结构形式，如：
//authority/path?query
**url，统一资源定位符（uniform resource location）**是uri的一个子集，它除了标识一个资源外，还会为资源提供一个特定的网络位置，客户端可以用它来获取这个资源的一个表示。
注意：url和uri并不是完全相同的，通用的uri可以告诉你一个资源是什么，但是无法告诉你它在哪里，以及如何得到这个资源。
在java中，这二者都有相应的实现，java.net.uri 类（只标识资源）与 java.net.url 类（既能标识资源，又能获取资源）
url 中的网络位置通常包括用来访问服务器的协议（ftp、http等）、服务器的主机名或ip地址，以及文件在该服务器上的路径。典型的 url 类似于 https://www.baidu.com/。它表示百度服务器上的一个 html 文件（百度搜索的首页），它可以通过 http 协议访问虽然没有直接在 url 后面加上 html 文件的名字。如果使用 tomcat 的话，通常是 http://127.0.0.1:8080/foods/index.html 这种形式，其实二者是相同的。
好了，简单的了解就到此为止了，感兴趣的话，可以查阅相关书籍了解更详细的知识，上面只是提到一些基础的概念。
url类
java.net.url类是对统一资源定位符的抽象表示。它不依赖于继承来配置不同类型的url的实例，而使用了策略设计模式。协议处理器就是策略，url 类构成上下文，通过它来选择不同的策略。（值得一提的是：
java 的 io流也是使用了一种设计模式：装饰器模式。
例如如下代码：
dataoutputstream dos = new dataoutputstream(new bufferedoutputstream(new fileoutputstream(new file())))。
url 类包含很多的构造方法，我也只是第一次使用，就使用了最简单的一种形式：（刚开始学习，根本不需要了解这么多，先用着再说，慢慢掌握知识。）
public url(string url) throws malformedurlexception
java 爬虫talk is cheap, show me the code!
前面主要是一下简单的基础知识，如果已经了解可以直接看下面这部分。
项目的基本结构：
clientpackage dragon;import java.io.file;import java.io.ioexception;public class client { public static final string downloadfilepath = "d:\\dragondatafile\\cat"; public static void main(string[] args) throws ioexception { //初始化创建文件下载目录 file dir = new file(client.downloadfilepath); if (!dir.exists()) { dir.mkdirs(); } //启动下载窗口 new window("龙"); }}
dataprocessutilpackage dragon;import java.io.bufferedinputstream;import java.io.ioexception;import java.net.url;import java.net.urlconnection;import java.util.linkedlist;import java.util.list;import java.util.regex.matcher;import java.util.regex.pattern;import java.util.stream.collectors;public class dataprocessutil { //根据链接获取 html 文件数据。 public static string getdata(string link) throws ioexception { url url = new url(link); urlconnection connection = url.openconnection(); stringbuilder strbuilder = new stringbuilder(); try ( bufferedinputstream bis = new bufferedinputstream(connection.getinputstream())){ int hasread = 0; byte[] b = new byte[1024]; while ((hasread = bis.read(b)) != -1) { strbuilder.append(new string(b, 0, hasread)); } } return strbuilder.tostring(); } public static list<string> getlinklist(string str){ string regx = "\"objurl\":\"(.*?)\","; pattern p = pattern.compile(regx); matcher m = p.matcher(str); list<string> strs = new linkedlist<>(); while (m.find()) { strs.add(m.group(0)); } //使用 stream api 进行处理并返回。 return strs.stream() .map(s->s.substring(10, s.length()-2)) .collect(collectors.tolist()); }}
说明：
获取html页面的信息，并进行处理，使用正则表达式从html中提取图片的链接。
（正则表达式是参考其它人的实现，这个涉及到对html内容的分析）
string regx = "\"objurl\":\"(.*?)\",";
//使用 stream api 进行处理并返回。 return strs.stream() .map(s->s.substring(10, s.length()-2)) .collect(collectors.tolist());
使用java 8新增加的 stream 对数据进行遍历，提取所有的图片的 url 组成一个列表集合返回。
downloadutilpackage dragon;import java.io.bufferedinputstream;import java.io.bufferedoutputstream;import java.io.file;import java.io.fileoutputstream;import java.io.ioexception;import java.net.url;import java.util.date;import java.util.list;import java.util.random;public class downloadutil { public static void download(list<string> strs) { strs.stream().foreach(u->{ try { url url = new url(u); string contenttype = url.openconnection().getcontenttype(); if (contenttype != null && contenttype.contains("image/")) { //获取图片的类型：content type string filetype = null; if (contenttype.contains("jpeg")) { filetype = ".jpeg"; } else if (contenttype.contains("jpg")) { filetype = ".jpg"; } else{ filetype = ".png"; } //gif 格式图片，似乎无法正常显示 //使用当前日期的毫秒数+随机数+contenttype 作为文件名 random rand = new random(system.currenttimemillis()); string filename = new date().gettime()+rand.nextint(10000)+filetype; runnable r = ()->{ int flag = 0; file imagefile = new file(client.downloadfilepath, filename); try( bufferedinputstream bis = new bufferedinputstream(url.openconnection().getinputstream()); bufferedoutputstream bos = new bufferedoutputstream(new fileoutputstream(imagefile))){ int hasread = 0; byte[] b = new byte[1024]; while ((hasread = bis.read(b)) != -1) { bos.write(b, 0, hasread); } } catch (ioexception e) { system.out.println("下载失败！"); //对于下载失败的图片进行删除，不然会出现错误！图片只能正常现实一部分 if (imagefile.exists()) { boolean b = imagefile.delete(); system.out.println("下载失败，删除图片"+b); } flag = 1; e.printstacktrace(); } if (flag == 0) system.out.println("下载完成："+filename); }; thread t = new thread(r); t.start(); //启动下载线程。 } } catch (ioexception e) { e.printstacktrace(); system.out.println("链接错误！"); } }); }}
注意：这里遇到一个问题，就是图片的下载过程受到网络因素的影响，有时候会下载失败，但是如果图片已经开始下载，仍然提示下载失败，那么这张图片可以能会出现异常，比如出现一下奇怪的颜色，我对下载失败的图片，进行了处理，发现，似乎没有效果。
单纯的判断大小无法解决图片变形的问题，还有一种情况需要考虑！在最下面，会有详细说明解决方法。
windowpackage dragon;import java.awt.flowlayout;import java.io.ioexception;import java.util.list;import javax.swing.box;import javax.swing.jbutton;import javax.swing.jframe;import javax.swing.jlabel;import javax.swing.joptionpane;import javax.swing.jtextfield;public class window extends jframe{ /** * 自动生成的序列化版本号 */ private static final long serialversionuid = 7809323808831342296l; private jlabel label_keyword, label_page; private jtextfield textfield, textpage; private jbutton download; public window(string name) { super(name); this.init(); //设置布局 this.setlayout(new flowlayout()); this.setbounds(400, 400, 250, 150); this.setdefaultcloseoperation(jframe.exit_on_close); this.setvisible(true); } private void init() { label_keyword = new jlabel("关键字"); label_page = new jlabel("页数"); textfield = new jtextfield(10); textpage = new jtextfield(10); download = new jbutton("下载"); download.addactionlistener(e->{ string keyword = textfield.gettext().trim(); string page = textpage.gettext().trim(); int download_page = 0; if (keyword.length() == 0 || page.length() == 0) { joptionpane.showmessagedialog(null, "关键字或页数不能为空！", "警告", joptionpane.warning_message); return ; } try { download_page = integer.parseint(page); //匹配单个数字，如果数字很多使用正则表达式 } catch (numberformatexception exp) { joptionpane.showmessagedialog(null, "页数必须为数字！", "警告", joptionpane.warning_message); return ; } string link = null; for (int i = 1; i <= download_page; i++) { //分页下载图片！ link = "http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word="+keyword+"&pn="+i*20; this.download(link); } }); box boxh1 = box.createhorizontalbox(); boxh1.add(label_keyword); boxh1.add(box.createhorizontalstrut(10)); boxh1.add(textfield); box boxh2 = box.createhorizontalbox(); boxh2.add(label_page); boxh2.add(box.createhorizontalstrut(23)); boxh2.add(textpage); box boxh3 = box.createhorizontalbox(); boxh3.add(download); box boxv = box.createverticalbox(); boxv.add(boxh1); boxv.add(box.createverticalstrut(10)); boxv.add(boxh2); boxv.add(box.createverticalstrut(10)); boxv.add(boxh3); this.add(boxv); } private void download(string link) { try { string str = dataprocessutil.getdata(link); list<string> links = dataprocessutil.getlinklist(str); //尝试下载！使用线程进行下载，防止阻塞！ thread t = new thread(()->{ downloadutil.download(links); }); t.start(); } catch (ioexception e1) { e1.printstacktrace(); joptionpane.showmessagedialog(null, "啥都没有！", "警告", joptionpane.warning_message); } }}
说明：
当图片没有下载完成时，不要再次点击下载按钮，否则会报错。因为线程不能被再次启动。
运行结果
基本原理我来简单画一个示意图，大家凑合着看：
说明：首先通过百度图片的url来获取百度图片那个页面的信息（html的内容），我们平时在浏览器使用，看到的都是浏览器处理好的页面，如果使用爬虫爬取的就是原始的html页面，在浏览器按 f12 也可以看到。因为图片的链接都在html 中，所以我们需要取出这些图片，这里就用到了**正则表达式（regular expression）**的知识了，通过正则表达式可以取出需要的信息（资源的url或者说资源的地址）。其实获取html的过程和获取图片的过程，都是一样的。
这里说一下，这个步骤：
//根据链接获取 html 文件数据。 public static string getdata(string link) throws ioexception { url url = new url(link); urlconnection connection = url.openconnection(); stringbuilder strbuilder = new stringbuilder(); try ( bufferedinputstream bis = new bufferedinputstream(connection.getinputstream())){ int hasread = 0; byte[] b = new byte[1024]; while ((hasread = bis.read(b)) != -1) { strbuilder.append(new string(b, 0, hasread)); } } return strbuilder.tostring(); }
通过参数 link，创建一个 url 对象，然后通过使用urlconnection connection = url.openconnection();获取 urlconnection 对象，在通过 urlconnection 对象的getinputstream() 方法，获取输入流即可。这样就完成了对资源的获取。我这里强调资源，因为下载图片其实和这个过程是一样的。
总结这个小软件虽然功能很简单，但是也用到了很多知识点，比较适合初学者进行学习（java io流、网络、stream、线程的知识），知识虽然用到的都不难（一些基础知识），但是融合起来使用，还是很有意思的。
附
对于图片的奇怪颜色问题，可以确定是图片的大小和原来图片的大小不一致导致的，至于为什么是这样的，估计需要具备一定的图形学知识，才能解答，这个超出了这个东西的范围了。所以为了判断哪些图片出错，我就使用大小判断的方法，对最后生成的文件大小和网络图片文件大小进行比对，删除了一些无法下载的图片，但是有一些图片居然无法删除，我查阅了资料，大多说它被另一个进程占用，但是我这个图片应该是没有的。后来，经过检查发现是多线程惹得祸，因为是多线程，并且代码执行速度太快了（对的，和这个也有关系），因为我的文件命名是当前时间的毫秒数+一个种子为当前时间的随机数，在多线程的情况下，重复的概率居然还挺高的。
所以，原因就出现了，当发现图片大小不对，试图删除图片时，图片被另一个线程占用，无法删除。（关于名字重复的问题，就是两个线程在同一个毫秒启动了，所以随机数也是相等的（种子相等），因此有些图片就会和其它图片写入同一个图片文件，导致出现异常情况。）
总结一下：
图片异常的情况有两种：
1.网络原因，导致图片无法完整下载，这是无法解决的，只能删除。
2.图片名字重复，导致多张图片数据被写入同一张图片当中，这是程序错误，可以避免的。
解决方法：
对于第一种情况，只需要把错误的图片删除即可；
对于第二种情况，要避免图片名字重复，所以我重新设计了图片的命名方法，
采用：当前时间的毫秒数+uuid随机数（查阅资料，这个挺好用的）作为文件的命名方式。注：uuid 也有一个缺点，就是名字太长了。
修改后的源文件：
package dragon;import java.io.bufferedinputstream;import java.io.bufferedoutputstream;import java.io.file;import java.io.fileoutputstream;import java.io.ioexception;import java.net.url;import java.net.urlconnection;import java.util.list;import java.util.uuid;public class downloadutil { public static void download(list<string> strs) { strs.stream().foreach(u->{ try { url url = new url(u); urlconnection urlconnection = url.openconnection(); string contenttype = urlconnection.getcontenttype(); //获取资源文件的大小 long size = urlconnection.getcontentlengthlong(); if (contenttype != null && contenttype.contains("image/")) { //获取图片的类型：content type string filetype = null; if (contenttype.contains("jpeg")) { filetype = ".jpeg"; } else if (contenttype.contains("jpg")) { filetype = ".jpg"; } else{ filetype = ".png"; } //gif 格式图片，似乎无法正常显示 //使用当前时间戳+随机数+contenttype 作为文件名 string filename = system.currenttimemillis()+uuid.randomuuid().tostring()+filetype; //使用线程进行下载 runnable r = ()->{ file imagefile = new file(client.downloadfilepath, filename); try( bufferedinputstream bis = new bufferedinputstream(urlconnection.getinputstream()); bufferedoutputstream bos = new bufferedoutputstream(new fileoutputstream(imagefile))){ int hasread = 0; byte[] b = new byte[1024]; while ((hasread = bis.read(b)) != -1) { bos.write(b, 0, hasread); } } catch (ioexception e) { system.out.println("下载失败！"); e.printstacktrace(); } //对下载失败的图片进行删除。 if (imagefile.length() != size) { boolean result = imagefile.delete(); system.out.println(imagefile.length()+" "+size+" "+filename+" 删除结果："+result); //大小不符合，说明图片下载有问题，删除图片。 } else { system.out.println("下载完成："+filename); } }; thread t = new thread(r); t.start(); //启动下载线程。 } } catch (ioexception e) { e.printstacktrace(); system.out.println("链接错误！"); } }); }}
运行截图
这样网络原因错误的图片直接删除，代码原因的错误，已经改正了。
注：还有一些图片无法显示，这个可能是官方不允许我们进行爬取，有的图片，爬取的就是不允许爬取那种图片，还有一些图片，不支持格式。
以上就是怎么使用java io流和网络制作一个简单的图片爬虫的详细内容。

怎么使用Java IO流和网络制作一个简单的图片爬虫

推荐信息