中文 Emebedding & Reranker 模型选型

Author: ninehills
Labels: blog
Created: 2023-12-28T04:58:44Z
Link and comments: https://github.com/ninehills/blog/issues/111

选型建议：

Embedding 模型

重点优化检索能力。

有一些微调的小tips

地址：https://huggingface.co/infgrad/stella-large-zh-v2 博客文章： https://zhuanlan.zhihu.com/p/655322183

基于piccolo 模型fine-tuning，支持1024 序列长度。博客文章记录了一些训练思路。

作者：智源研究院地址：https://huggingface.co/BAAI/bge-large-zh-v1.5 论文：https://arxiv.org/pdf/2309.07597.pdf Github：https://github.com/FlagOpen/FlagEmbedding

开放信息最多的模型，也提供了fine-tuning 示例代码。同时也是 C-MTEB 榜单的维护者。

作者：MokaAI 地址：https://huggingface.co/moka-ai/m3e-large Github：https://github.com/wangyuxinwhy/uniem

研究的比较早，算是中文通用 Embedding 模型、数据集以及评测比较早的开拓者。

地址：https://huggingface.co/intfloat/multilingual-e5-large 论文：https://arxiv.org/pdf/2212.03533.pdf

多语言支持。

支持8192 序列长度，但是信息很少。

作者：智源研究院地址：https://huggingface.co/BAAI/bge-reranker-large Github：GitHub - FlagOpen/FlagEmbedding: Dense Retrieval and Retrieval-augmented LLMs

基于 xlm-roberta 模型。

信息很少。也是基于 xlm-roberta 模型。

我们只关心 Rerank 和 Retrieval 评测，结果见 mteb