# FlashAttention-2: 더 나은 병렬처리와 작업 분할로 더 빨라진 Attention

> Clean Markdown view of GeekNews topic #9892. Use the original source for factual precision when an external source URL is present.

## Metadata

- GeekNews HTML: [https://news.hada.io/topic?id=9892](https://news.hada.io/topic?id=9892)
- GeekNews Markdown: [https://news.hada.io/topic/9892.md](https://news.hada.io/topic/9892.md)
- Type: news
- Author: [xguru](https://news.hada.io/@xguru)
- Published: 2023-07-20T10:31:01+09:00
- Updated: 2023-07-20T10:31:01+09:00
- Original source: [crfm.stanford.edu](https://crfm.stanford.edu/2023/07/17/flash2.html)
- Points: 9
- Comments: 0

## Topic Body

- GPT-4(32k), MPT(65k), Calude(100k) 등 더 긴 컨텍스트를 가진 언어모델이 출현   
- 트랜스포머의 컨텍스트 길이를 확장하는 것은 런타임&메모리 요구사항이 4제곱으로 증가하기 때문에 어려움   
- 작년에 출시한 FlashAttention은 메모리 사용량을 줄이고 어텐션 속도를 증가시켜서 다양한 곳에서 이용됨   
- 출시 당시에 이미 2-4배 빨랐지만, 아직 개선할 여지가 있음. 최적화된 행렬 곱 연산(GEMM)에 비해 여전히 빠르지 않고, 이론상 최대 FLOPs/s 의 25-40%에 불과(A100 GPU에서 최대 124 TFLOPs/s)  
- FlashAttention-2는 이전 버전보다 2배 빠르고, A100 GPU에서 최대 230 TFLOP/s 의 성능을 제공   
- GPT 형태의 언어모델 훈련에서는 최대 225 TFLOPS까지 도달했음(72% 모델 FLOP 활용도)  
- 알고리듬을 조정하여 non-matmul FLOPs를 줄였음   
- 더 나은 병렬화, 각 스레드 블록에서의 작업 분할방법 변경  
- Head Dimensions 개수를 128에서 256개로 확장

## Comments


_No public comments on this page._