Apple이 Multimodal LLM인 MM1에 대해 공개한 논문을 정리해 보았습니다.

(discuss.pytorch.kr)

Apple에서 MM1이라는 멀티모달 LLM에 대한 연구 결과를 공개하였습니다. (모델 코드나 가중치는 공개하지 않았고, 앞으로도 안 할 것 같습니다)

Image Encoder와 VL-Connector, 그리고 데이터셋과 학습 방법 등에서 모델을 직접 학습하시거나 튜닝하시는 분들께서는 한 번쯤 살펴보셔도 좋을 것 같아 ChatGPT와 함께 정리한 내용을 공유합니다.

인코더 레슨: 이미지 해상도가 가장 큰 영향을 미치며, 모델 크기와 학습 데이터 구성이 그 뒤를 따릅니다.

Encoder lesson: Image resolution has the highest impact, followed by model size and training data composition.

VL 커넥터 레슨: 비주얼 토큰의 수와 이미지 해상도가 가장 중요하며, VL 커넥터 유형은 거의 영향을 미치지 않습니다.

VL Connector Lesson: Number of visual tokens and image resolution matters most, while the type of VL connector has little effect.

데이터 레슨 1: 인터리브 데이터는 적은 수의 샷과 텍스트 전용 성능에 도움이 되고, 캡션 데이터는 제로-샷 성능을 향상시킵니다.

Data lesson 1: interleaved data is instrumental for few-shot and textonly performance, while captioning data lifts zero-shot performance.

데이터 레슨 2: 텍스트 전용 데이터는 퓨-샷 및 텍스트 전용 성능에 도움이 됩니다.

Data lesson 2: text-only data helps with few-shot and text-only performance.

데이터 레슨 3: 이미지 데이터와 텍스트 데이터를 신중하게 혼합하면 최적의 멀티모달 성능을 얻을 수 있고 강력한 텍스트 성능을 유지할 수 있습니다.

Data lesson 3: Careful mixture of image and text data can yield optimal multimodal performance and retain strong text performance.

데이터 레슨 4: 합성 데이터는 퓨-샷 학습에 도움이 됩니다.

Data lesson 4: Synthetic data helps with few-shot learning.