# Lossless Compression of English Short Messages

> Clean Markdown view of GeekNews topic #84. Use the original source for factual precision when an external source URL is present.

## Metadata

- GeekNews HTML: [https://news.hada.io/topic?id=84](https://news.hada.io/topic?id=84)
- GeekNews Markdown: [https://news.hada.io/topic/84.md](https://news.hada.io/topic/84.md)
- Type: news
- Author: [lifthrasiir](https://news.hada.io/@lifthrasiir)
- Published: 2019-07-16T15:10:31+09:00
- Updated: 2019-07-16T15:10:31+09:00
- Original source: [textsynth.org](http://textsynth.org/sms.html)
- Points: 3
- Comments: 1

## Topic Body

Fabrice Bellard 얘기가 요즘 자주 들리는데 이 양반의 이전 프로젝트는 신경망을 사용한 무손실 압축 알고리즘(https://bellard.org/nncp/ 참고)이었죠. 마침 요즘 GPT-2(https://openai.com/blog/better-language-models/)가 공개되었으니까 신경망을 그걸로 대체해서 압축 알고리즘을 돌리면 어떨까? 하는 착상에서 나온 게 이 페이지입니다. 짧은 영문을 15% 정도, 즉 문자 하나당 1.2비트만을 사용하는데, 이 정도면 영문 한 글자당 추정되는 정보 엔트로피(0.6~1.3비트)에 근접하는 수준입니다. URL에서 볼 수 있듯이 SMS를 통해서 보내라는 게 의도인 것 같군요.

* 신경망을 사용한 압축 알고리즘은 이게 처음은 아닙니다. PAQ를 위시한 최상위 압축 알고리즘은 모두 통계적인 방법을 사용하고 있으며, 신경망도 드물지 않게 사용합니다. 당장 이들의 근간을 이루는 context mixing(https://en.wikipedia.org/wiki/Context_mixing)이 신경망을 응용한 것이며, Bellard가 사용한 LSTM도 이미 사례가 있습니다(https://github.com/byronknoll/lstm-compress). Bellard의 기여는 성능 최적화에 가깝습니다.

## Comments


### Comment 82

- Author: iolothebard
- Created: 2019-07-16T16:52:00+09:00
- Points: 1

유니코드 cjk와 한글 영역을 사용한다니...

2바이트 조합형/완성형 시절 확장 ascii 문자가 한글/한자로 보이는 악몽이 떠오르면.. (노땅인증)