Thanks for the great work. The paper mentions MMC4 during pre-training. As for the supervised finetuning stage, I'm curious if the authors have considered using the MIMIC-IT dataset from Otter which is an interleaved image text dataset. Would adding interleaved image text during supervised finetuning potentially further boost the overall performance?