倒排索引构建算法BSBI和SPIMI

算法介绍在信息搜索领域，构建索引一直是是一种非常有效的方式，但是当搜索引擎面对的是海量数据的时候，你如果要从茫茫人海的数据中去找出数据，显然这不是一个很好的办法。于是倒排索引这个概念就被提了出来。再说倒排索引概念之前，先要理解一下，一般的
算法介绍在信息搜索领域，构建索引一直是是一种非常有效的方式，但是当搜索引擎面对的是海量数据的时候，你如果要从茫茫人海的数据中去找出数据，显然这不是一个很好的办法。于是倒排索引这个概念就被提了出来。再说倒排索引概念之前，先要理解一下，一般的索引检索信息的方式。比如原始的数据源假设都是以文档的形式被分开，文档1拥有一段内容，文档2也富含一段内容，文档3同样如此。然后给定一个关键词，要搜索出与此关键词相关的文档，自然而然我们联想到的办法就是一个个文档的内容去比较，判断是否含有此关键词，如果含有则返回这个文档的索引地址，如果不是接着用后面的文档去比，这就有点类似于字符串的匹配类似。很显然，当数据量非常巨大的时候，这种方式并不适用。原来的这种方式可以理解为是索引-->关键词，而倒排索引的形式则是关键词--->索引位置，也就是说，给出一个关键词信息，我能立马根据倒排索引的信息得出他的位置。当然，这里说的是倒排索引最后要达到的效果，至于是用什么方式实现，就不止一种了，本文所述的就是其中比较出名的bsbi和spimi算法。
算法的原理这里首先给出一个具体的实例来了解一般的构造过程，先避开具体的实现方式，给定下面一组词句。
doc1：mike spoken english frequently at home.and he can write english every day.
doc2:：mike plays football very well.
首先我们必须知道，我们需要的是一些关键的信息，诸如一些修饰词等等都需要省略，动词的时态变化等都需要还原，如果代词指的是同个人也能够省略，于是上面的句子可以简化成
doc1：mike spoken english home.write english.
doc2：mike play football.
下面进行索引的倒排构建，因为mike出现在文档1和文档2 中，所以mike:{1, 2}后面的词的构造同样的道理。最后的关系就会构成词对应于索引位置的映射关系。理解了这个过程之后呢，可以介绍一下本文主要要说的bsbi(基于磁盘的外部排序构建索引)和spimi(内存单遍扫描构建索引)算法了，一般来说，后者比前者常用。
bsbi此算法的主要步骤如下：
1、将文档中的词进行id的映射，这里可以用hash的方法去构造
2、将文档分割成大小相等的部分。
3、将每部分按照词id对上文档id的方式进行排序
4、将每部分排序好后的结果进行合并，最后写出到磁盘中。
5、然后递归的执行，直到文档内容全部完成这一系列操作。
这里有一张示意图:
在算法的过程中会用到读缓冲区和写缓冲区，至于期间的大小多少如何配置都是看个人的，我在后面的代码实现中也有进行设置。至于其中的排序算法的选择，一般建议使用效果比较好的快速排序算法，但是我在后面为了方便，直接用了自己更熟悉的冒泡排序算法，这个也看个人。
spimi接下来说说spimi算法，就是内存单遍扫描算法，这个算法与上面的算法一上来就有直接不同的特点就是他无须做id的转换，还是采用了词对索引的直接关联。还有1个比较大的特点是他不经过排序，直接按照先后顺序构建索引，算法的主要步骤如下：
1、对每个块构造一个独立的倒排索引。
2、最后将所有独立的倒排索引进行合并就ok了。
本人为了方便就把这个算法的实现简洁化了，直接在内存中完成所有的构建工作。望读者稍加注意。spimi相对比较的简单，这里就不给出截图了。
算法的代码实现首先是文档的输入数据，采用了2个一样的文档，我也是实在想不出有更好的测试数据了
doc1.txt:
mike studyed english hardly yesterdayhe got the 100 at the last examhe thinks english is very interesting
doc2.txt:
mike studyed english hardly yesterdayhe got the 100 at the last examhe thinks english is very interesting
下面是文档信息预处理类pretreattool.java:
package invertedindex;import java.io.bufferedreader;import java.io.file;import java.io.filenotfoundexception;import java.io.fileoutputstream;import java.io.filereader;import java.io.ioexception;import java.io.printstream;import java.util.arraylist;import java.util.regex.matcher;import java.util.regex.pattern;/** * 文档预处理工具类 * * @author lyq * */public class pretreattool { // 一些无具体意义的过滤词 public static string[] filter_words = new string[] { at, at, the, the, is, very }; // 批量文档的文件地址 private arraylist docfilepaths; // 输出的有效词的存放路径 private arraylist effectwordpaths; public pretreattool(arraylist docfilepaths) { this.docfilepaths = docfilepaths; } /** * 获取文档有效词文件路径 * * @return */ public arraylist getefwpaths() { return this.effectwordpaths; } /** * 从文件中读取数据 * * @param filepath * 单个文件 */ private arraylist readdatafile(string filepath) { file file = new file(filepath); arraylist dataarray = new arraylist(); arraylist words = new arraylist(); try { bufferedreader in = new bufferedreader(new filereader(file)); string str; string[] temparray; while ((str = in.readline()) != null) { temparray = str.split( ); dataarray.add(temparray); } in.close(); } catch (ioexception e) { e.getstacktrace(); } // 将每行词做拆分加入到总列表容器中 for (string[] array : dataarray) { for (string word : array) { words.add(word); } } return words; } /** * 对文档内容词汇进行预处理 */ public void pretreatwords() { string baseoutputpath = ; int endpos = 0; arraylist tempwords = null; effectwordpaths = new arraylist(); for (string filepath : docfilepaths) { tempwords = readdatafile(filepath); filterwords(tempwords, true); // 重新组装出新的输出路径 endpos = filepath.lastindexof(.); baseoutputpath = filepath.substring(0, endpos); writeoutoperation(tempwords, baseoutputpath + -efword.txt); effectwordpaths.add(baseoutputpath + -efword.txt); } } /** * * 对文档中的词语进行过滤操作 * * @param words * 待处理文档词语 * @param canrepeated * 有效词是否可以重复 */ private void filterwords(arraylist words, boolean canrepeated) { boolean isfilterword; // 做形容词匹配 pattern adjpattern; // 做动词时态的匹配 pattern formerpattern; // 数字匹配 pattern numberpattern; matcher adjmatcher; matcher formermatcher; matcher numbermatcher; arraylist deletewords = new arraylist(); adjpattern = pattern.compile(.*(ly$|ful$|ing$)); formerpattern = pattern.compile(.*ed$); numberpattern = pattern.compile([0-9]+(.[0-9]+)?); string w; for (int i = 0; i < words.size(); i++) { w = words.get(i); isfilterword = false; for (string fw : filter_words) { if (fw.equals(w)) { deletewords.add(w); isfilterword = true; break; } } if (isfilterword) { continue; } adjmatcher = adjpattern.matcher(w); formermatcher = formerpattern.matcher(w); numbermatcher = numberpattern.matcher(w); // 将词语统一小写字母化 w = w.tolowercase(); // 如果是形容词,副词形式的或是纯数字的词，则进行过滤 if (adjmatcher.matches() || numbermatcher.matches()) { deletewords.add(w); } else if (formermatcher.matches()) { // 如果是ed结尾表明是动词的在时态方面的变化，进行变化，转为原有动词的形式，截去最末尾2个额外添加的后缀词 w = w.substring(0, w.length() - 2); } words.set(i, w); } // 进行无效词的过滤 words.removeall(deletewords); deletewords.clear(); string s1; string s2; // 进行词语的去重 for (int i = 0; i < words.size() - 1; i++) { s1 = words.get(i); for (int j = i + 1; j < words.size(); j++) { s2 = words.get(j); // 找到存在相同的词了，就挑出循环 if (s1.equals(s2)) { deletewords.add(s1); break; } } } // 删除多余重复的词语 words.removeall(deletewords); words.addall(deletewords); } /** * 将数据写出到磁盘文件操作，如果文件已经存在，则在文件尾部进行内容追加 * * @param buffer * 当前写缓冲中的数据 * @param filepath * 输出地址 */ private void writeoutoperation(arraylist buffer, string filepath) { stringbuilder strbuilder = new stringbuilder(); // 将缓冲中的数据组成字符写入到文件中 for (string word : buffer) { strbuilder.append(word); strbuilder.append(\n); } try { file file = new file(filepath); printstream ps = new printstream(new fileoutputstream(file)); ps.print(strbuilder.tostring());// 往文件里写入字符串 } catch (filenotfoundexception e) { // todo auto-generated catch block e.printstacktrace(); } }}
文档类document.java:
package invertedindex;import java.util.arraylist;/** * 文档类 * @author lyq * */public class document { //文档的唯一标识 int docid; //文档的文件地址 string filepath; //文档中的有效词 arraylist effectwords; public document(arraylist effectwords, string filepath){ this.effectwords = effectwords; this.filepath = filepath; } public document(arraylist effectwords, string filepath, int docid){ this(effectwords, filepath); this.docid = docid; }}
bsbi算法工具类bsbitool.java:
package invertedindex;import java.io.bufferedreader;import java.io.file;import java.io.filenotfoundexception;import java.io.fileoutputstream;import java.io.filereader;import java.io.ioexception;import java.io.printstream;import java.util.arraylist;import java.util.hashmap;import java.util.map;/** * bsbi基于磁盘的外部排序算法 * * @author lyq * */public class bsbitool { // 文档唯一标识id public static int doc_id = 0; // 读缓冲区的大小 private int readbuffersize; // 写缓冲区的大小 private int writebuffersize; // 读入的文档的有效词文件地址 private arraylist effectivewordfiles; // 倒排索引输出文件地址 private string outputfilepath; // 读缓冲 1 private string[][] readbuffer1; // 读缓冲2 private string[][] readbuffer2; // 写缓冲区 private string[][] writebuffer; // 有效词与hashcode的映射 private map code2word; public bsbitool(arraylist effectivewordfiles, int readbuffersize, int writebuffersize) { this.effectivewordfiles = effectivewordfiles; this.readbuffersize = readbuffersize; this.writebuffersize = writebuffersize; initbuffers(); } /** * 初始化缓冲区的设置 */ private void initbuffers() { readbuffer1 = new string[readbuffersize][2]; readbuffer2 = new string[readbuffersize][2]; writebuffer = new string[writebuffersize][2]; } /** * 从文件中读取有效词并进行编码替换 * * @param filepath * 返回文档 */ private document readeffectwords(string filepath) { long hashcode = 0; string w; document document; code2word = new hashmap(); arraylist words; words = readdatafile(filepath); for (int i = 0; i < words.size(); i++) { w = words.get(i); hashcode = bkdrhash(w); hashcode = hashcode % 10000; // 将有效词的hashcode取模值作为对应的代表 code2word.put(hashcode + , w); w = hashcode + ; words.set(i, w); } document = new document(words, filepath, doc_id); doc_id++; return document; } /** * 将字符做哈希值的转换 * * @param str * 待转换字符 * @return */ private long bkdrhash(string str) { int seed = 31; /* 31 131 1313 13131 131313 etc.. */ long hash = 0; int i = 0; for (i = 0; i < str.length(); i++) { hash = (hash * seed) + (str.charat(i)); } return hash; } /** * 根据输入的有效词输出倒排索引文件 */ public void outputinvertedfiles() { int index = 0; string basefilepath = ; outputfilepath = ; document doc; arraylist temppaths; arraylist inverteddata1; arraylist inverteddata2; temppaths = new arraylist(); for (string filepath : effectivewordfiles) { doc = readeffectwords(filepath); writeoutfile(doc); index = doc.filepath.lastindexof(.); basefilepath = doc.filepath.substring(0, index); writeoutoperation(writebuffer, basefilepath + -temp.txt); temppaths.add(basefilepath + -temp.txt); } outputfilepath = basefilepath + -bsbi-inverted.txt; // 将中间产生的倒排索引数据进行总的合并并输出到一个文件中 for (int i = 1; i < temppaths.size(); i++) { if (i == 1) { inverteddata1 = readinvertedfile(temppaths.get(0)); } else { inverteddata1 = readinvertedfile(outputfilepath); } inverteddata2 = readinvertedfile(temppaths.get(i)); mergeinverteddata(inverteddata1, inverteddata2, false, outputfilepath); writeoutoperation(writebuffer, outputfilepath, false); } } /** * 将文档的最终的倒排索引结果写出到文件 * * @param doc * 待处理文档 */ private void writeoutfile(document doc) { // 在读缓冲区中是否需要再排序 boolean ifsort = true; int index = 0; string basefilepath; string[] temp; arraylist tempwords = (arraylist) doc.effectwords .clone(); arraylist inverteddata1; arraylist inverteddata2; inverteddata1 = new arraylist(); inverteddata2 = new arraylist(); // 将文档的数据平均拆分成2份，用于读入后面的2个缓冲区中 for (int i = 0; i < tempwords.size() / 2; i++) { temp = new string[2]; temp[0] = tempwords.get(i); temp[1] = doc.docid + ; inverteddata1.add(temp); temp = new string[2]; temp[0] = tempwords.get(i + tempwords.size() / 2); temp[1] = doc.docid + ; inverteddata2.add(temp); } // 如果是奇数个，则将最后一个补入 if (tempwords.size() % 2 == 1) { temp = new string[2]; temp[0] = tempwords.get(tempwords.size() - 1); temp[1] = doc.docid + ; inverteddata2.add(temp); } index = doc.filepath.lastindexof(.); basefilepath = doc.filepath.substring(0, index); mergeinverteddata(inverteddata1, inverteddata2, ifsort, basefilepath + -temp.txt); } /** * 合并读缓冲区数据写到写缓冲区中，用到了归并排序算法 * * @param outputpath * 写缓冲区的写出的路径 */ private void mergewordbuffers(string outputpath) { int i = 0; int j = 0; int num1 = 0; int num2 = 0; // 写缓冲区下标 int writeindex = 0; while (readbuffer1[i][0] != null && readbuffer2[j][0] != null) { num1 = integer.parseint(readbuffer1[i][0]); num2 = integer.parseint(readbuffer2[j][0]); // 如果缓冲1小，则优先存缓冲1到写缓冲区中 if (num1 0 && inverteddata2.size() > 0) { readbuffer1[rindex1][0] = inverteddata1.get(0)[0]; readbuffer1[rindex1][1] = inverteddata1.get(0)[1]; readbuffer2[rindex2][0] = inverteddata2.get(0)[0]; readbuffer2[rindex2][1] = inverteddata2.get(0)[1]; inverteddata1.remove(0); inverteddata2.remove(0); rindex1++; rindex2++; if (rindex1 == readbuffersize) { if (ifsort) { wordbuffersort(readbuffer1); wordbuffersort(readbuffer2); } mergewordbuffers(outputpath); initbuffers(); } } if (ifsort) { wordbuffersort(readbuffer1); wordbuffersort(readbuffer2); } mergewordbuffers(outputpath); readbuffer1 = new string[readbuffersize][2]; readbuffer2 = new string[readbuffersize][2]; if (inverteddata1.size() == 0 && inverteddata2.size() > 0) { readremaindatatorb(inverteddata2, outputpath); } else if (inverteddata1.size() > 0 && inverteddata2.size() == 0) { readremaindatatorb(inverteddata1, outputpath); } } /** * 剩余的有效词数据读入读缓冲区 * * @param remaindata * 剩余数据 * @param outputpath * 输出文件路径 */ private void readremaindatatorb(arraylist remaindata, string outputpath) { int rindex = 0; while (remaindata.size() > 0) { readbuffer1[rindex][0] = remaindata.get(0)[0]; readbuffer1[rindex][1] = remaindata.get(0)[1]; remaindata.remove(0); rindex++; // 读缓冲区写满，进行写入到写缓冲区中 if (readbuffer1[readbuffersize - 1][0] != null) { wordbuffersort(readbuffer1); writeremainreadbuffer(readbuffer1, 0, outputpath); initbuffers(); } } wordbuffersort(readbuffer1); writeremainreadbuffer(readbuffer1, 0, outputpath); } /** * 缓冲区数据进行排序 * * @param buffer * 缓冲空间 */【本文来自鸿网互联 (http://www.68idc.cn)】 private void wordbuffersort(string[][] buffer) { string[] temp; int k = 0; long num1 = 0; long num2 = 0; for (int i = 0; i < buffer.length - 1; i++) { // 缓冲区可能没填满 if (buffer[i][0] == null) { continue; } k = i; for (int j = i + 1; j < buffer.length; j++) { // 缓冲区可能没填满 if (buffer[j][0] == null) { continue; } // 获取2个缓冲区小块的起始编号值 num1 = long.parselong(buffer[k][0]); num2 = long.parselong(buffer[j][0]); if (num2 < num1) { k = j; } } if (k != i) { temp = buffer[k]; buffer[k] = buffer[i]; buffer[i] = temp; } } } /** * 从文件中读取倒排索引数据 * * @param filepath * 单个文件 */ private arraylist readinvertedfile(string filepath) { file file = new file(filepath); arraylist dataarray = new arraylist(); try { bufferedreader in = new bufferedreader(new filereader(file)); string str; string[] temparray; while ((str = in.readline()) != null) { temparray = str.split( ); dataarray.add(temparray); } in.close(); } catch (ioexception e) { e.getstacktrace(); } return dataarray; } /** * 从文件中读取数据 * * @param filepath * 单个文件 */ private arraylist readdatafile(string filepath) { file file = new file(filepath); arraylist dataarray = new arraylist(); arraylist words = new arraylist(); try { bufferedreader in = new bufferedreader(new filereader(file)); string str; string[] temparray; while ((str = in.readline()) != null) { temparray = str.split( ); dataarray.add(temparray); } in.close(); } catch (ioexception e) { e.getstacktrace(); } // 将每行词做拆分加入到总列表容器中 for (string[] array : dataarray) { for (string word : array) { if (!word.equals()) { words.add(word); } } } return words; }}
spimi算法工具类spimitool.java:
package invertedindex;import java.io.bufferedreader;import java.io.file;import java.io.filenotfoundexception;import java.io.fileoutputstream;import java.io.filereader;import java.io.ioexception;import java.io.printstream;import java.util.arraylist;/** * spimi内存式单边扫描构建算法 * @author lyq * */public class spimitool { //倒排索引输出文件地址 private string outputfilepath; // 读入的文档的有效词文件地址 private arraylist effectivewordfiles; // 内存缓冲区，不够还能够在增加空间 private arraylist buffers; public spimitool(arraylist effectivewordfiles){ this.effectivewordfiles = effectivewordfiles; } /** * 从文件中读取数据 * * @param filepath * 单个文件 */ private arraylist readdatafile(string filepath) { file file = new file(filepath); arraylist dataarray = new arraylist(); arraylist words = new arraylist(); try { bufferedreader in = new bufferedreader(new filereader(file)); string str; string[] temparray; while ((str = in.readline()) != null) { temparray = str.split( ); dataarray.add(temparray); } in.close(); } catch (ioexception e) { e.getstacktrace(); } // 将每行词做拆分加入到总列表容器中 for (string[] array : dataarray) { for (string word : array) { words.add(word); } } return words; } /** * 根据已有的文档数据进行倒排索引文件的构建 * @param docs * 文档集合 */ private void writeinvertedindex(arraylist docs){ arraylist datas; string[] recorddata; buffers = new arraylist(); for(document tempdoc: docs){ datas = tempdoc.effectwords; for(string word: datas){ recorddata = new string[2]; recorddata[0] = word; recorddata[1] = tempdoc.docid + ; addrecordtobuffer(recorddata); } } //最后将数据写出到磁盘中 writeoutoperation(buffers, outputfilepath); } /** * 将新读入的数据记录读入到内存缓冲中，如果存在则加入到倒排记录表中 * @param inserteddata * 待插入的数据 */ private void addrecordtobuffer(string[] inserteddata){ boolean iscontained = false; string wordname; wordname = inserteddata[0]; for(string[] array: buffers){ if(array[0].equals(wordname)){ iscontained = true; //添加倒排索引记录，以：隔开 array[1] += : + inserteddata[1]; break; } } //如果没有包含，则说明是新的数据,直接添加 if(!iscontained){ buffers.add(inserteddata); } } /** * 将数据写出到磁盘文件操作，如果文件已经存在，则在文件尾部进行内容追加 * @param buffer * 当前写缓冲中的数据 * @param filepath * 输出地址 */ private void writeoutoperation(arraylist buffer, string filepath) { stringbuilder strbuilder = new stringbuilder(); //将缓冲中的数据组成字符写入到文件中 for(string[] array: buffer){ strbuilder.append(array[0]); strbuilder.append( ); strbuilder.append(array[1]); strbuilder.append(\n); } try { file file = new file(filepath); printstream ps = new printstream(new fileoutputstream(file)); ps.println(strbuilder.tostring());// 往文件里写入字符串 } catch (filenotfoundexception e) { // todo auto-generated catch block e.printstacktrace(); } } /** * 构造倒排索引文件 */ public void createinvertedindexfile(){ int docid = 1; string basefilepath; string filename; string p; int index1 = 0; int index2 = 0; document tempdoc; arraylist words; arraylist docs; outputfilepath = spimi; docs = new arraylist(); p = effectivewordfiles.get(0); //提取文件名称 index1 = p.lastindexof(\\); basefilepath = p.substring(0, index1+1); outputfilepath = basefilepath + spimi; for(string path: effectivewordfiles){ //获取文档有效词 words = readdatafile(path); tempdoc = new document(words, path, docid); docid++; docs.add(tempdoc); //提取文件名称 index1 = path.lastindexof(\\); index2 = path.lastindexof(.); filename = path.substring(index1+1, index2); outputfilepath += - + filename; } outputfilepath += .txt; //根据文档数据进行倒排索引文件的创建 writeinvertedindex(docs); }}
算法测试类client.java:
package invertedindex;import java.util.arraylist;/** * 倒排索引测试类 * @author lyq * */public class client { public static void main(string[] args){ //读写缓冲区的大小 int readbuffersize; int writebuffersize; string basefilepath; pretreattool pretool; //bsbi基于磁盘的外部排序算法 bsbitool btool; //spimi内存式单边扫描构建算法 spimitool stool; //有效词文件路径 arraylist efwfilepaths; arraylist docfilepaths; readbuffersize = 10; writebuffersize = 20; basefilepath = c:\\users\\lyq\\desktop\\icon\\; docfilepaths = new arraylist(); docfilepaths.add(basefilepath + doc1.txt); docfilepaths.add(basefilepath + doc2.txt); //文档预处理工具类 pretool = new pretreattool(docfilepaths); pretool.pretreatwords(); //预处理完获取有效词文件路径 efwfilepaths = pretool.getefwpaths(); btool = new bsbitool(efwfilepaths, readbuffersize, writebuffersize); btool.outputinvertedfiles(); stool = new spimitool(efwfilepaths); stool.createinvertedindexfile(); }}
算法的输出：
为了模拟出真实性，算法的输出都是以文件的形式。
首先是预处理类处理之后的有效词文件doc1-efword.txt和doc2-efword.txt:
mikestudyyesterdaygotlastexamthinksenglishhe
可以看见，一些修饰词什么的已经被我过滤掉了。
下面是bsbi算法生成的中间文件，就是映射成编码的文件，也许你看了这些数值真实表示的是什么词语：
1426 01542 02540 03056 03325 04326 04897 06329 07327 0
还有文档2的临时文件：
1426 11542 12540 13056 13325 14326 14897 16329 17327 1
将这2个文档的信息进行合并最终输出的倒排索引文件为：
yesterday 0:1mike 0:1got 0:1english 0:1he 0:1last 0:1thinks 0:1study 0:1exam 0:1
同样的spimi算法输出的结果：
mike 1:2study 1:2yesterday 1:2got 1:2last 1:2exam 1:2thinks 1:2english 1:2he 1:2
算法小结
我在实现算法的过程中无疑低估了此算法的难度，尤其是bsbi的实现，因为中间读写缓冲区在做数据操作的时候，各种情况需要判断，诸如写缓冲区满了的时候要刷出到磁盘上，读缓冲区满的时候要通过归并排序移入读缓冲区中，这里面的判断实在过多，加上之前早期没有想到这个问题，导致算法可读性不是很好，就索性把缓冲区设大，先走通这个流程，所以这个算法大家还是以理解为主，就不要拿来实际运用了，同样对于spimi算法一样的道理，算法实现在这里帮助大家更好的理解吧，还有很多不足的地方。还有1点是文档内容预处理的时候，我只是象征性的进行过滤，真实的信息过滤实现复杂程度远远超过我所写的，这里包括了修饰词，时态词的变化，副词等等，这些有时还需要语义挖掘的一些知识来解决，大家意会即可。

倒排索引构建算法BSBI和SPIMI

推荐信息