# GPT Tokenizer 이해하기

> Clean Markdown view of GeekNews topic #9379. Use the original source for factual precision when an external source URL is present.

## Metadata

- GeekNews HTML: [https://news.hada.io/topic?id=9379](https://news.hada.io/topic?id=9379)
- GeekNews Markdown: [https://news.hada.io/topic/9379.md](https://news.hada.io/topic/9379.md)
- Type: news
- Author: [xguru](https://news.hada.io/@xguru)
- Published: 2023-06-12T10:57:14+09:00
- Updated: 2023-06-12T10:57:14+09:00
- Original source: [simonwillison.net](https://simonwillison.net/2023/Jun/8/gpt-tokenizers/)
- Points: 15
- Comments: 0

## Topic Body

- GPT/LLaMA/PaLM 같은 LLM 모델은 토큰 기반으로 동작   
- 텍스트를 받아서 토큰들(Integers)으로 변환하고, 다음에 어떤 토큰이 나올지를 예측함   
- OpenAI가 Tokenizer를 공개했지만, 필자는 Observable notebook 으로 자신의 버전을 공개(GPT-2 기반의 교육용 )   
  - 텍스트-to-토큰, 토큰-to-Text 및 전체 토큰 테이블 검색 지원   
- > The dog eats the apples  
  > El perro come las manzanas  
  > 片仮名  
- 위 문장을 토큰으로 변환한 결과를 가지고 설명   
  - The 와 the 는 서로 다른 토큰   
  - 많은 단어들이 앞에 빈칸이 포함된 토큰이 있음(전체 문장 인코딩에 훨씬 효율적)  
  - 영어 이외의 단어들은 비효율적인 토큰화가 이뤄짐

## Comments


_No public comments on this page._