FaceBook开源向量检索库Faiss_随笔

FaceBook开源向量检索库Faiss

文章目录

前言
安装
向量无压缩检索
- 暴力检索
- 聚类检索
向量压缩检索

前言

faiss是FaceBook开源的大规模向量检索库，相似度为L2距离(欧式距离)或内积，底层为C++，内置的大部分算法支持GPU加速检索，包含了C++及Python两种API，且与Numpy库无缝衔接。当数据量较少时，可以将原始向量集合全部装载进内存，当然数据量非常庞大如数十亿的级别时，可以进行数据的压缩以变内存能够装载的下，当然压缩完后在进行检索，会损失检索的精度。

安装

由于电脑上没有GPU，因此本文都是在CPU下的 *** 作。

 # cpu版本
 conda install faiss-cpu -c pytorch

向量无压缩检索

本篇文章的代码都由python编写，大都来自faiss的官方教程。
首先设定超参数以及张量集合。

import numpy as np
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

暴力检索

所有的检索算法，都需要显示的传递张量的维度。

# 导入faiss库，简历L2索引
import faiss                   # make faiss available
index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(xb)                  # add vectors to the index
print(index.ntotal)

检索

k = 4                          # we want to see 4 nearest neighbors
D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
D, I = index.search(xq, k)     # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries

其中search()函数接收两个参数，第一个为检索向量集合，第二个为寻找的最近邻个数，返回值K近邻的距离以及对应的索引。

聚类检索

除了暴力的L2距离检索，对于大部分的检索算法，都需要显示的进行训练。当训练完成后，可以向向量库中新增向量以及执行检索 *** 作。聚类检索的算法原理简单直观，首先将向量库进行聚类(nlist个)，得出每个聚类的中心。当进行K近邻 *** 作时，首先找出距离查询向量最近的聚类中心，然后找出离该中心最近的nprobe个聚类，进行查找。这就有两个超参数——nlist和nprobe，当nlist一定时，nprobe越大查询越精确，当然速度越慢。其中nprobe默认为1。

faiss定义了两种衡量相似度的方法(metrics)，分别为faiss.METRIC_L2、faiss.METRIC_INNER_PRODUCT。一个是欧式距离，一个是向量内积(余弦相似度，默认使用)。
该方法中，除了计算向量之间的相似度，还要计算聚类中心的距离，对应有下面代码中的quantizer。

nlist = 100
k = 4
quantizer = faiss.IndexFlatL2(d)  # the other index
index = faiss.IndexIVFFlat(quantizer, d, nlist)
# 使用L2距离计算相似度
# index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
assert not index.is_trained
index.train(xb)
assert index.is_trained
index.add(xb)                  # add may be a bit slower as well
D, I = index.search(xq, k)     # actual search
print(I[-5:])                  # neighbors of the 5 last queries
index.nprobe = 10              # default nprobe is 1, try a few more
D, I = index.search(xq, k)
print(I[-5:])                  # neighbors of the 5 last queries

向量压缩检索

由于进行向量检索时，需要将所有的向量全部载入内存，当数据不能一次性全部载入内存时，就需要进行向量的压缩，该种情况下，向量之间的距离时近似的，检索的结果也没有上一种方法准确，压缩的方法以及力度都可以配置。
除了向量会被压缩之外，其他与无压缩的聚类检索相同。

# 代码示例
nlist = 100
m = 8                             # number of subquantizers
k = 4
quantizer = faiss.IndexFlatL2(d)  # this remains the same
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
                                    # 8 specifies that each sub-vector is encoded as 8 bits
index.train(xb)
index.add(xb)
D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
index.nprobe = 10              # make comparable with experiment above
D, I = index.search(xq, k)     # search
print(I[-5:])

需要注意的一点，使用余弦相似度作为向量检索时，一定要将所有的向量进行单位化。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5480795.html

FaceBook开源向量检索库Faiss

发表评论

评论列表（0条）