# Postgres에서 PDF 전문 검색하기

> Clean Markdown view of GeekNews topic #17535. Use the original source for factual precision when an external source URL is present.

## Metadata

- GeekNews HTML: [https://news.hada.io/topic?id=17535](https://news.hada.io/topic?id=17535)
- GeekNews Markdown: [https://news.hada.io/topic/17535.md](https://news.hada.io/topic/17535.md)
- Type: news
- Author: [xguru](https://news.hada.io/@xguru)
- Published: 2024-11-01T11:03:02+09:00
- Updated: 2024-11-01T11:03:02+09:00
- Original source: [tselai.com](https://tselai.com/full-text-search-pdf-postgres)
- Points: 18
- Comments: 1

## Summary

pgPDF는 Postgres에서 PDF 파일을 SQL로 읽을 수 있게 해주는 확장 기능으로, PDF 내용을 텍스트와 바이너리 형태로 저장하고, 효율적인 검색을 위해 tsvector를 사용합니다. tsvector는 문서를 텍스트 검색에 최적화된 형태로 변환하며, FTS 쿼리를 통해 빠르고 정확한 검색을 지원합니다. 또한, tsvector 컬럼에 GIN 인덱스를 생성하면 검색 성능을 더욱 향상시킬 수 있습니다.

## Topic Body

- pgPDF는 PDF 파일을 SQL로 읽을 수 있는 Postgres 확장임(poppler의 래퍼)  
  `SELECT pdf_read_file('/path/file.pdf') → text`  
- 데이터 저장 방식  
  - PDF 파일 내용은 텍스트(txt)와 바이너리(bytes) 형태로 테이블에 저장함  
  - 각 PDF의 tsvector도 저장함. tsvector는 문서를 텍스트 검색에 최적화된 형태로 나타냄  
  - tsvector 생성은 비용이 크지만 1회만 수행하면 되므로 생성(generated) 컬럼에 저장하는 것이 좋음  
  - FTS 쿼리는 txt 컬럼이 아닌 tsvector에 대해 수행됨  
- FTS 쿼리 실행하기   
  - FTS는 일반적으로 `tsvector @@ tsquery` 연산자를 사용함  
  - tsquery는 tsvector에 대한 매칭 필터를 정의함  
  - 이 외에도 다양한 종류의 tsquery가 있음: `plainto_tsquery`, `phraseto_tsquery`, `websearch_to_tsquery`   
  - `SELECT name FROM pdfs WHERE tsvec_en @@ to_tsquery('english', 'Postgres & Sharding');`  
- tsvector 컬럼에 GIN 인덱스를 생성하여 성능을 개선할 수 있음

## Comments


### Comment 30638

- Author: cosine20
- Created: 2024-11-01T17:01:54+09:00
- Points: 1

오.....