# ChatGPT는 어떻게 학습되었을까 - RLHF

> Clean Markdown view of GeekNews topic #8424. Use the original source for factual precision when an external source URL is present.

## Metadata

- GeekNews HTML: [https://news.hada.io/topic?id=8424](https://news.hada.io/topic?id=8424)
- GeekNews Markdown: [https://news.hada.io/topic/8424.md](https://news.hada.io/topic/8424.md)
- Type: news
- Author: [xguru](https://news.hada.io/@xguru)
- Published: 2023-02-08T10:42:16+09:00
- Updated: 2023-02-08T10:42:16+09:00
- Original source: [littlefoxdiary.tistory.com](https://littlefoxdiary.tistory.com/111)
- Points: 15
- Comments: 0

## Topic Body

- 모델이 생성한 결과의 <좋음>을 판단하기에 가장 적절한 지표는 인간의 선호 점수  
- 사람이 모델의 결과에 대해 평가한 피드백을 생성된 텍스트의 우수성 지표로 사용하고, 더 나아가 그 피드백을 반영한 loss를 설계해 모델을 최적화하는 것이 RLHF(Reinforcement Learning from Human Feedback)  
- RLHF: Step by Step  
  - #1 Language Model 학습하기 (pre-training)  
  - #2 Reward Model 학습을 위한 데이터 수집 및 모델 학습  
  - #3 Reinforcement Learning을 통해 Language Model fine-tuning 하기   
- RLHF, 생각해야 할 것들  
  - 현재로써의 한계

## Comments


_No public comments on this page._