To break the plateau, the authors implement a two-stage Reinforcement Learning (RL) process [11].
The paper demonstrates that MSRL significantly outperforms pure SFT models by optimizing for both textual structure and visual fidelity, effectively surpassing the performance limit reached at 2.8M SFT samples [11, 25]. MSRL Stage Max Dataset Size 2.8 million samples [11, 22] 33k curated samples [11] GPU Requirement 16 H800 GPUs [11] 24 H800 GPUs [11] Training Goal Min. Negative Log-Likelihood [22] Hybrid Text-Visual Reward [11] Outcome Performance Plateaus [22] Breaks SFT Performance Limit [11]
: Uses 22k data pairs focusing on textual accuracy (
2.8m Gmail.txt -
To break the plateau, the authors implement a two-stage Reinforcement Learning (RL) process [11].
The paper demonstrates that MSRL significantly outperforms pure SFT models by optimizing for both textual structure and visual fidelity, effectively surpassing the performance limit reached at 2.8M SFT samples [11, 25]. MSRL Stage Max Dataset Size 2.8 million samples [11, 22] 33k curated samples [11] GPU Requirement 16 H800 GPUs [11] 24 H800 GPUs [11] Training Goal Min. Negative Log-Likelihood [22] Hybrid Text-Visual Reward [11] Outcome Performance Plateaus [22] Breaks SFT Performance Limit [11]
: Uses 22k data pairs focusing on textual accuracy (
We use cookies
We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.
Read more about our privacy policy