Tanishq Mathew Abraham, Ph.D. · @iScienceLuvr

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code Introduces two openly licensed datasets:...

View this X/Twitter post from @iScienceLuvr published on 7 مئی، 2025 کو 10:02 AM. This post contains 1 images.

Published
7 مئی، 2025 کو 10:02 AM
Thread Items
2
Media Items
1

Tweet Overview

View this X/Twitter post from @iScienceLuvr published on 7 مئی، 2025 کو 10:02 AM. This post contains 1 images.

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Introduces two openly licensed datasets:
1. SwallowCode (≈16.1 billion tokens) refines Python snippets from The-Stack-v2
2. SwallowMath (≈2.3 billion tokens) enhances Finemath-4+ by removing boilerplate, restoring context, and reformatting solutions into concise, step-by-step explanations
Tanishq Mathew Abraham, Ph.D. media

Related Creators

TwitFast

v1.4.88

Free Twitter video downloader. Top Twitter trends and hashtags list, Monitor, track hottest trending topics, hashtags.

© 2024 TwitFast تمام حقوق محفوظ ہیں۔