编程 Python

Python利用Faiss库实现ANN近邻搜索的方法详解

Posted in Python onAugust 03, 2020

Embedding的近邻搜索是当前图推荐系统非常重要的一种召回方式，通过item2vec、矩阵分解、双塔DNN等方式都能够产出训练好的user embedding、item embedding，对于embedding的使用非常的灵活：

输入user embedding，近邻搜索item embedding，可以给user推荐感兴趣的items
输入user embedding，近邻搜搜user embedding，可以给user推荐感兴趣的user
输入item embedding，近邻搜索item embedding，可以给item推荐相关的items

然而有一个工程问题，一旦user embedding、item embedding数据量达到一定的程度，对他们的近邻搜索将会变得非常慢，如果离线阶段提前搜索好在高速缓存比如redis存储好结果当然没问题，但是这种方式很不实时，如果能在线阶段上线几十MS的搜索当然效果最好。

Faiss是Facebook AI团队开源的针对聚类和相似性搜索库，为稠密向量提供高效相似度搜索和聚类，支持十亿级别向量的搜索，是目前最为成熟的近似近邻搜索库。

接下来通过jupyter notebook的代码，给大家演示下使用faiss的简单流程，内容包括：

读取训练好的Embedding数据
构建faiss索引，将待搜索的Embedding添加进去
取得目标Embedding，实现搜索得到ID列表
根据ID获取电影标题，返回结果

对于已经训练好的Embedding怎样实现高速近邻搜索是一个工程问题，facebook的faiss库可以构建多种embedding索引实现目标embedding的高速近邻搜索，能够满足在线使用的需要

安装命令：

conda install -c pytorch faiss-cpu

提前总结下faiss使用经验：

1. 为了支持自己的ID，可以用faiss.IndexIDMap包裹faiss.IndexFlatL2即可

2. embedding数据都需要转换成np.float32，包括索引中的embedding以及待搜索的embedding

3. ids需要转换成int64类型

1. 准备数据

import pandas as pd
import numpy as np

df = pd.read_csv("./datas/movielens_sparkals_item_embedding.csv")
df.head()

	id	features
0	10	[0.25866490602493286, 0.3560594320297241, 0.15…
1	20	[0.12449632585048676, -0.29282501339912415, -0…
2	30	[0.9557555317878723, 0.6764761805534363, 0.114…
3	40	[0.3184879720211029, 0.6365472078323364, 0.596…
4	50	[0.45523127913475037, 0.34402626752853394, -0….

构建ids

ids = df["id"].values.astype(np.int64)
type(ids), ids.shape
(numpy.ndarray, (3706,))
ids.dtype
dtype('int64')
ids_size = ids.shape[0]
ids_size
3706

构建datas

import json
import numpy as np
datas = []
for x in df["features"]:
 datas.append(json.loads(x))
datas = np.array(datas).astype(np.float32)
datas.dtype
dtype('float32')
datas.shape
(3706, 10)
datas[0]
array([ 0.2586649 , 0.35605943, 0.15589039, -0.7067125 , -0.07414215,
 -0.62500805, -0.0573845 , 0.4533663 , 0.26074877, -0.60799956],
 dtype=float32)
# 维度
dimension = datas.shape[1]
dimension
10

2. 建立索引

import faiss
index = faiss.IndexFlatL2(dimension)
index2 = faiss.IndexIDMap(index)
ids.dtype
dtype('int64')
index2.add_with_ids(datas, ids)
index.ntotal
3706

4. 搜索近邻ID列表

df_user = pd.read_csv("./datas/movielens_sparkals_user_embedding.csv")
df_user.head()
id features

	id	features
0	10	[0.5974288582801819, 0.17486965656280518, 0.04…
1	20	[1.3099910020828247, 0.5037978291511536, 0.260…
2	30	[-1.1886241436004639, -0.13511677086353302, 0….
3	40	[1.0809299945831299, 1.0048035383224487, 0.986…
4	50	[0.42388680577278137, 0.5294889807701111, -0.6…

user_embedding = np.array(json.loads(df_user[df_user["id"] == 10]["features"].iloc[0]))
user_embedding = np.expand_dims(user_embedding, axis=0).astype(np.float32)
user_embedding
array([[ 0.59742886, 0.17486966, 0.04345559, -1.3193961 , 0.5313592 ,
 -0.6052168 , -0.19088413, 1.5307966 , 0.09310367, -2.7573566 ]],
 dtype=float32)
user_embedding.shape
(1, 10)
user_embedding.dtype
dtype('float32')
topk = 30
D, I = index.search(user_embedding, topk) # actual search
I.shape
(1, 30)
I
array([[3380, 2900, 1953, 121, 3285, 999, 617, 747, 2351, 601, 2347,
 42, 2383, 538, 1774, 980, 2165, 3049, 2664, 367, 3289, 2866,
 2452, 547, 1072, 2055, 3660, 3343, 3390, 3590]])

5. 根据电影ID取出电影信息

target_ids = pd.Series(I[0], name="MovieID")
target_ids.head()
0 3380
1 2900
2 1953
3 121
4 3285
Name: MovieID, dtype: int64
df_movie = pd.read_csv("./datas/ml-1m/movies.dat",
  sep="::", header=None, engine="python",
  names = "MovieID::Title::Genres".split("::"))
df_movie.head()

	MovieID	Title	Genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

df_result = pd.merge(target_ids, df_movie)
df_result.head()

	MovieID	Title	Genres
0	3380	Railroaded! (1947)	Film-Noir
1	2900	Monkey Shines (1988)	Horror\|Sci-Fi
2	1953	French Connection, The (1971)	Action\|Crime\|Drama\|Thriller
3	121	Boys of St. Vincent, The (1993)	Drama
4	3285	Beach, The (2000)	Adventure\|Drama

总结

到此这篇关于Python利用Faiss库实现ANN近邻搜索的文章就介绍到这了,更多相关Python用Faiss库ANN近邻搜索内容请搜索三水点靠木以前的文章或继续浏览下面的相关文章希望大家以后多多支持三水点靠木！

Python利用Faiss库实现ANN近邻搜索的方法详解

- Author -

蚂蚁学Python

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python日期操作学习笔记

Oct 07 Python

python getopt 参数处理小示例

Jun 09 Python

Python中不同进制的语法及转换方法分析

Jul 27 Python

python bottle框架支持jquery ajax的RESTful风格的PUT和DELETE方法

May 24 Python

python 读取DICOM头文件的实例

May 07 Python

不知道这5种下划线的含义,你就不算真的会Python！

Oct 09 Python

padas 生成excel 增加sheet表的实例

Dec 11 Python

pyqt5、qtdesigner安装和环境设置教程

Sep 25 Python

python实现发送form-data数据的方法详解

Sep 27 Python

TensorFlow tensor的拼接实例

Jan 19 Python

python3从网络摄像机解析mjpeg http流的示例

Nov 13 Python

python绘制雷达图实例讲解

Jan 03 Python

Python pexpect模块及shell脚本except原理解析

Aug 03 #Python

python爬虫使用正则爬取网站的实现

Aug 03 #Python

python获取整个网页源码的方法

Aug 03 #Python

flask开启多线程的具体方法

Aug 02 #Python

基于opencv实现简单画板功能

Aug 02 #Python

django下创建多个app并设置urls方法

Aug 02 #Python

Django如何在不停机的情况下创建索引

Aug 02 #Python

You might like

深入PHP empty(),isset(),is_null()的实例测试详解

2013/06/06 PHP

PHP实现把MySQL数据库导出为.sql文件实例（仿PHPMyadmin导出功能）

2014/05/10 PHP

学习php开源项目的源码指南

2014/12/21 PHP

php 与 nginx 的处理方式及nginx与php-fpm通信的两种方式

2018/09/28 PHP

jQuery获取css z-index在各种浏览器中的返回值

2010/09/15 Javascript

jquery必须知道的一些常用特效方法及使用示例(整理)

2013/06/24 Javascript

Angular中$compile源码分析

2016/01/28 Javascript

js ajaxfileupload.js上传报错的解决方法

2016/05/05 Javascript

自动化测试读写64位操作系统的注册表

2016/08/15 Javascript

bootstrap表格分页实例讲解

2016/12/30 Javascript

vue2.0组件之间传值、通信的多种方式(干货)

2018/02/10 Javascript

详谈vue+webpack解决css引用图片打包后找不到资源文件的问题

2018/03/06 Javascript

vue.js 图片上传并预览及图片更换功能的实现代码

2018/08/27 Javascript

小程序从手动埋点到自动埋点的实现方法

2019/01/24 Javascript

JavaScript 引用类型实例详解【数组、对象、严格模式等】

2020/05/13 Javascript

ES6 十大特性简介

2020/12/09 Javascript

使用Python获取Linux系统的各种信息

2014/07/10 Python

Python程序中使用SQLAlchemy时出现乱码的解决方案

2015/04/24 Python

Python设计模式编程中解释器模式的简单程序示例分享

2016/03/02 Python

Python使用gensim计算文档相似性

2016/04/10 Python

Python 编码规范(Google Python Style Guide)

2018/05/05 Python

Python实现读取SQLServer数据并插入到MongoDB数据库的方法示例

2018/06/09 Python

Python-Tkinter Text输入内容在界面显示的实例

2019/07/12 Python

python修改FTP服务器上的文件名

2019/09/11 Python

给Python学习者的文件读写指南(含基础与进阶)

2020/01/29 Python

浅谈matplotlib.pyplot与axes的关系

2020/03/06 Python

Python中flatten( ),matrix.A用法说明

2020/07/05 Python

Python Selenium实现无可视化界面过程解析

2020/08/25 Python

蔻驰意大利官网：COACH意大利

2019/01/16 全球购物

GUESS Factory加拿大：牛仔裤、服装及配饰

2019/09/20 全球购物

ruby如何进行集成操作？Ruby能进行多重继承吗？

2013/10/16 面试题

幼儿园亲子活动方案

2014/01/29 职场文书

集团公司党的群众路线教育实践活动工作总结

2014/03/03 职场文书

生日主持词

2014/03/20 职场文书

在Windows下安装配置CPU版的PyTorch的方法

2021/04/02 Python

MySQL数据库之内置函数和自定义函数 function

2022/06/16 MySQL