# 속도와 파이썬, 두 마리 토끼 잡기: 딥러닝 시 빠른 파이썬 코드 실행을 위한 CUDA 그래프 사용법

> Clean Markdown view of GeekNews topic #10813. Use the original source for factual precision when an external source URL is present.

## Metadata

- GeekNews HTML: [https://news.hada.io/topic?id=10813](https://news.hada.io/topic?id=10813)
- GeekNews Markdown: [https://news.hada.io/topic/10813.md](https://news.hada.io/topic/10813.md)
- Type: news
- Author: [ninebow](https://news.hada.io/@ninebow)
- Published: 2023-09-10T23:46:48+09:00
- Updated: 2023-09-10T23:46:48+09:00
- Original source: [discuss.pytorch.kr](https://discuss.pytorch.kr/t/cuda-speed-python-pick-two-how-cuda-graphs-enable-fast-python-code-for-deep-learning/2441?utm_source=geeknews)
- Points: 15
- Comments: 0

## Topic Body

지난 몇 년간 GPU 속도가 폭발적으로 증가하며 딥러닝 워크로드의 최적화 방법 또한 변화하고 있습니다. PyTorch에서도 `torch.compile()`과 같이 최적화 기능들을 추가하고 있지만, LLM을 비롯한 일부 워크로드에서는 개선이 진행 중입니다.  
  
(`torch.compile()`의 개선을 기다리는 동안) 바로 적용할 수 있는 최적화 방법인 CUDA Graph를 소개하고 적용한  글을 발견하여 번역해보았습니다. (⚠️주의: 글 말미에 원문을 작성한 LLM 추론 플랫폼 개발/서비스 기업 Fireworks.ai의 홍보가 일부 포함되어 있습니다.)  
  
이 글에서는 아래와 같은 순서로 CUDA Graph을 소개하고 있습니다:  
  
- 기존의 최적화 방법인 CPU/GPU 중첩(overlap)에 대한 소개  
  
- CPU 오버헤드가 발생하는 구간들  
  
- CPU 오버헤드 최적화를 위한 기법들 및 CUDA Graph  
  
- LLaMA2-7B 모델에 CUDA Graph 적용 사례 소개  
  
- CUDA Graph로 인한 성능 이득 소개  
  
- 부록: 현 시점(PyTorch 2.0.1)에서의 `torch.compile()` 사용 시의 이슈와 해결 방법

## Comments


_No public comments on this page._