Tweet Overview
View this X/Twitter post from @iScienceLuvr published on 2025年5月7日 10:02. This post contains 1 images.
Rewriting Pre-Training Data Boosts LLM Performance in Math and Code Introduces two openly licensed datasets: 1. SwallowCode (≈16.1 billion tokens) refines Python snippets from The-Stack-v2 2. SwallowMath (≈2.3 billion tokens) enhances Finemath-4+ by removing boilerplate, restoring context, and reformatting solutions into concise, step-by-step explanations







