Tanishq Mathew Abraham, Ph.D. · @iScienceLuvr

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code Introduces two openly licensed datasets:...

查看 @iScienceLuvr 在 2025年5月7日 10:02 发布的这条 X/Twitter 推文。这条内容包含 1 张图片。

重新发布查看创作者主页

发布时间

2025年5月7日 10:02

线程条目数

媒体数量

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

2025年5月7日 10:02

推文概览

查看 @iScienceLuvr 在 2025年5月7日 10:02 发布的这条 X/Twitter 推文。这条内容包含 1 张图片。

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Introduces two openly licensed datasets:
1. SwallowCode (≈16.1 billion tokens) refines Python snippets from The-Stack-v2
2. SwallowMath (≈2.3 billion tokens) enhances Finemath-4+ by removing boilerplate, restoring context, and reformatting solutions into concise, step-by-step explanations

abs: https://arxiv.org/abs/2505.02881
datasets: https://huggingface.co/datasets/tokyotech-llm/swallow-code
https://huggingface.co/datasets/tokyotech-llm/swallow-math

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code Introduces two openly licensed datasets:...

推文概览

相关创作者

Free Twitter video downloader. Top Twitter trends and hashtags list, Monitor, track hottest trending topics, hashtags.

其他链接

下载器

相关产品

© 2024 TwitFast 保留所有权利。