# FlexGen - ChatGPT 같은 LLM을 싱글 GPU에서 운영하기

> Clean Markdown view of GeekNews topic #8539. Use the original source for factual precision when an external source URL is present.

## Metadata

- GeekNews HTML: [https://news.hada.io/topic?id=8539](https://news.hada.io/topic?id=8539)
- GeekNews Markdown: [https://news.hada.io/topic/8539.md](https://news.hada.io/topic/8539.md)
- Type: news
- Author: [xguru](https://news.hada.io/@xguru)
- Published: 2023-02-22T10:16:02+09:00
- Updated: 2023-02-22T10:16:02+09:00
- Original source: [github.com/FMInference](https://github.com/FMInference/FlexGen)
- Points: 14
- Comments: 0

## Topic Body

- 16GB T4 / 24GB RTX3090 같은 제한된 GPU 환경에서 LLM을 운영하는 고성능 생성 엔진   
- 약 100배까지 엄청 빠른 오프로딩으로 175B 모델을 싱글 GPU에서 운영 가능   
- 파라미터와 어텐션 캐쉬를 최대한 압축(정확도 손실이 거의 없는 4비트까지 낮춤)  
- 분산 병렬 런타임으로 GPU 추가시에 쉽게 확장 가능

## Comments


_No public comments on this page._