# llm.c, 이제 멀티GPU 트레이닝을 지원하며 PyTorch보다 ~7% 빠름

> Clean Markdown view of GeekNews topic #14658. Use the original source for factual precision when an external source URL is present.

## Metadata

- GeekNews HTML: [https://news.hada.io/topic?id=14658](https://news.hada.io/topic?id=14658)
- GeekNews Markdown: [https://news.hada.io/topic/14658.md](https://news.hada.io/topic/14658.md)
- Type: news
- Author: [xguru](https://news.hada.io/@xguru)
- Published: 2024-05-06T09:49:01+09:00
- Updated: 2024-05-06T09:49:01+09:00
- Original source: [twitter.com/karpathy](https://twitter.com/karpathy/status/1786461447654125625)
- Points: 12
- Comments: 1

## Topic Body

- Andrej Karpathy가 순수 C/CUDA로 만든 간단한 LLM 훈련 코드  
- 이제 멀티 GPU 트레이닝을 bfloat16으로 Flash Attention과 함께 수행   
- ~3000 라인의 C/CUDA 코드로 구현되었으며, 전반적으로 PyTorch보다 7% 정도까지 빠름   
- 지금까지 작업한 내용들   
  -  혼합 정밀도 훈련(bfloat16)  
  - 정규화된 로그를 구체화하지 않는 (현재의 torch.compile과 달리) FusedClassifier를 포함한 많은 커널 최적화  
  - Flash Attention(cuDNN에서 바로)  
  - A100이 128비트 로드(LDG.128) 및 저장(STS.128) 명령어를 사용하도록 강제하는 Packed128 데이터 구조  
- 이제 멀티 GPU 트레이닝도 가능  
  - MPI+NCCL을 사용한 멀티 GPU 트레이닝의 첫 번째 버전  
  - NVIDIA Nsight Compute의 전체 트레이닝 실행 프로파일링  
  - ZeRO(옵티마이저 상태 샤딩) 1단계 머지 PR   
- 목표는 124M에서 1.6B에 이르는 모든 모델 크기의 GPT-2 미니시리즈를 C/CUDA에서 직접 재현하는   
안정적이고, 깨끗하며, 테스트를 거친 최소한의, 강화된, 충분히 최적화된 LLM 스택을 만드는 것

## Comments


### Comment 24979

- Author: xguru
- Created: 2024-05-06T09:50:03+09:00
- Points: 1

[llm.c - raw C/CUDA 로 LLM 훈련하기](https://news.hada.io/topic?id=14228)