# LLaMA: INT8 에디션

> Clean Markdown view of GeekNews topic #8662. Use the original source for factual precision when an external source URL is present.

## Metadata

- GeekNews HTML: [https://news.hada.io/topic?id=8662](https://news.hada.io/topic?id=8662)
- GeekNews Markdown: [https://news.hada.io/topic/8662.md](https://news.hada.io/topic/8662.md)
- Type: news
- Author: [xguru](https://news.hada.io/@xguru)
- Published: 2023-03-10T11:02:01+09:00
- Updated: 2023-03-10T11:02:01+09:00
- Original source: [github.com/tloen](https://github.com/tloen/llama-int8)
- Points: 8
- Comments: 0

## Topic Body

- Meta의 LLaMA-13B를 24 GiB램만으로 돌릴 수 있게 해주는 포크 버전   
  - 즉, RTX4090/3090 한대만으로 운영이 가능   
- 이론상 LLaMA-65B 를 80GB A100 하나로 운영 가능   
- 변경 내역   
  - 병렬 처리 구조체 제거   
  - 호스트 머신의 Weights를 정량화   
  - 메모리 문제 방지를 위해 Weights를 점진적으로 로드   
  - `bitsandbytes` 와 `tqdm` 이용   
  - 반복 페널티 설정(기본값 1.15)  
- RTX4090 + 64GB Ubuntu 머신에서 모델 로드하고 정량화 하는데 약 25초 소요

## Comments


_No public comments on this page._