AutoMathext: A 200GB mathematical text dataset

The dataset contains data from different sources, such as arXiv’s scientific papers, programming code fragments, and web page data. The data has been specifically filtered and processed to adapt to various application scenarios such as mathematical reasoning, reasoning training, and fine-tuning.

Supports tasks such as text generation and question and answer, and is especially suitable for developing and testing models that can understand and generate math-related content.

Main features:

Task type: Focus on text generation and question and answer tasks, suitable for developing and testing models involving mathematical reasoning and reasoning capabilities.
Language support: Currently only English is supported and is suitable for scenarios that require a large amount of English training data.
Data level: The data level is between 1 billion and 10 billion, providing rich resources for large-scale model training.
Diversified subsets: Contains subsets of data from different sources and under different filtering conditions, such as arXiv’s scientific papers and programming code snippets, as well as web data, which are suitable for a variety of different training and testing needs.
Domain labeling: Dataset labeling covers mathematical reasoning, reasoning, fine-tuning, etc., which helps accurately select data that meets specific task requirements.
Data set download:https://huggingface.co/datasets/math-ai/AutoMathText

They also have a collective dataset of 2 million math questions and answers: StackMathQA
It is full of mathematical questions and answers. It can allow AI to better learn how to solve mathematical problems.
Simply put, it is a large problem set specifically to train AI to solve mathematical problems.
For example, there will be questions explaining why the volume formula of the sphere is 4/3πr³. It can help researchers train AI to solve more complex mathematical problems.

StackMathQA dataset:https://huggingface.co/datasets/math-ai/StackMathQA