# VALL-E - Microsoft가 만든 음성합성을 위한 언어 모델

> Clean Markdown view of GeekNews topic #8217. Use the original source for factual precision when an external source URL is present.

## Metadata

- GeekNews HTML: [https://news.hada.io/topic?id=8217](https://news.hada.io/topic?id=8217)
- GeekNews Markdown: [https://news.hada.io/topic/8217.md](https://news.hada.io/topic/8217.md)
- Type: news
- Author: [xguru](https://news.hada.io/@xguru)
- Published: 2023-01-10T10:24:40+09:00
- Updated: 2023-01-10T10:24:40+09:00
- Original source: [valle-demo.github.io](https://valle-demo.github.io/)
- Points: 17
- Comments: 3

## Topic Body

- 트랜스포머 기반의 Text-to-Speech 모델  
- 어떤 음성이든 3초만 있으면 그 음성으로 합성 가능   
- 최신 Zero-shot TTS들보다 훨씬 자연스럽고 화자와 유사하며, 화자의 감정 및 음향 환경도 보존   
- 예전 파이프라인은 phoneme(음소) → mel-spectrogram → waveform 였는데,   
VALL-E는 phoneme → discrete code → waveform  
- 다양한 음성 합성 어플리케이션 및 GPT-3 같은 AI 모델과 결합 가능

## Comments


### Comment 14113

- Author: openmind
- Created: 2023-01-10T18:04:52+09:00
- Points: 1

머신러닝의 발전으로 TTS 기술의 진입 장벽도 낮아진 것 같네요. 오픈 소스 저장소들을 찾아보면 스스로 음성을 녹음해서 내 목소리용 자작 TTS를 만들수도 있더라구요.

### Comment 14082

- Author: jjpark78
- Created: 2023-01-10T10:40:18+09:00
- Points: 2

이제 음성파형은 더이상 지문처럼 개인을 특정할 수 없게 되었군요. -_-;  
  
어디서는 도청을 할때 대규모 서버에서 특정 사람의 음문을 활용해서 그 음문의 특정 키워드에 반응하도록 한다는걸 들은것 같기도 한데...  
  
이정도로 합성해낼 수 있으면 그런 시스템은 이제 물건너 갔네요...

### Comment 14080

- Author: xguru
- Created: 2023-01-10T10:24:47+09:00
- Points: 1

- [거대 AI 모델의 발전과 zero-shot의 의미](http://cloudinsight.net/ai/%EA%B1%B0%EB%8C%80-%EB%AA%A8%EB%8D%B8%EC%9D%98-%EB%B0%9C%EC%A0%84%EA%B3%BC-zero-shot%EC%9D%98-%EC%9D%98%EB%AF%B8/)  
- [Mel spectrogram 설명](https://judy-son.tistory.com/6)