티스토리

archive

검색하기

[NLP] Passage Retrieval: Scaling Up

NLP/Concept

[NLP] Passage Retrieval: Scaling Up

iamzieun 2023. 6. 7. 20:42

포스팅 개요

본 포스팅은 현실의 large document corpus에 Passage Retrieval을 적용하기 위한 두 가지 방법인 compression과 pruning, 그리고 이를 적용해볼 수 있는 라이브러리인 FAISS에 대해 정리한 글입니다.

1. Passage Retrieval and Similarity Search

Similarity Search
- brute-force(exhaustive) search
  - query와 모든 passage간의 유사도를 구함으로써 유사도가 가장 큰 passage를 찾는 방법
  - MIPS (Maximum Inner Product Search)
    - query 벡터 q에 대하여 passage 벡터 v들 중 가장 query와 유사한(=내적값이 큰) 벡터를 찾음으로써 query와 유사한 passage를 찾는 방법
  ⇒ 이렇나 brute-force search 방법은 모든 passage의 embedding을 구해야 하며, query와 모든 passage간의 유사도를 구해야 한다는 점에서 많은 리소스를 소모한다는 문제가 있음
- Compression
- Pruning

Tradeoffs of similarity search
- Accuracy vs Search Speed
  - 속도(search time)와 재현율(1-recall)간의 관계
    - 더 정확한 검색을 위해서는 더 오랜 시간이 소요됨

- Memory
  - 전체 corpus의 크기가 커질수록 탐색 공간이 커짐으로써 검색이 어려워짐
  - 전체 corpus의 크기가 커질수록 corpus를 저장하기 위한 memory space가 많이 요구됨
  - sparse embedding의 경우 차원이 크기 때문에 위와 같은 문제가 더욱 심각함

2. Approximating Similarity Search

Compression - Scalar Quantization (SQ)
- compression
  - vector를 압축하여, vector 하나의 용량을 감소시킴
  - 압축량이 클수록 메모리 사용량은 감소하고 정보 손실량은 증가함
- Scalar Quantization (SQ)
  - 4-byte floating point를 1-byte (8bit) unsigned integer로 압축

Pruning - Inverted File (IVF)
- pruning
  - search space를 줄여 전체 dataset이 아닌 dataset의 subset만을 대상으로 탐색하게 함으로써, search speed를 개선
- searching with clusturing and Inverted File
  1. clustering: 전체 vector space를 k개의 cluster로 군집화 (ex. k-means clustering)
  2. Inverted File (IVF): 각 cluster의 centroid id와 해당 cluster 소속의 vector들이 연결된 형태
  → 주어진 query vector와 근접한 centroid vector를 찾은 후, 해당 cluster의 inverted list 내 vector들에 대하여 search 수행

3. Introduction to FAISS

Faiss
- dense vector의 similarity search와 clustering을 위한 library
Passage Retrieval with FAISS
1. Train index and map vectors
  - IVF 또는 PQ를 적용한 index를 생성하기 위해서는 train 과정 필요
2. Search based on FAISS index

저작자표시