NVIDIA releases open source synthetic data generation pipeline for training large language models

Source: @kuchaev

Nemotron-4 340B, a family of models optimized for use with NVIDIA NeMo and NVIDIA TensorRT-LLM, includes state-of-the-art instruction models and reward models, as well as datasets for generative AI training.

NVIDIA today announced the launch of Nemotron-4340B, a set of open source models that developers can use to generate synthetic data used to train large language models (LLMs) for applications in industries ranging from medical, finance, manufacturing, retail and more. Commercial applications.

High-quality training data is critical in the performance, accuracy, and response quality of custom large language models, but powerful data sets can be expensive and difficult to obtain.

Through a unique open model license, Nemotron-4 340B provides developers with a free, extensible way to generate synthetic data to help build powerful large language models.

The Nemotron-4 340B series includes a base model, an instruction model, and a reward model that form a pipeline for generating and improving training data for large language models. These models are optimized to work with NVIDIA NeMo, an open source framework for end-to-end model training, including data planning, customization, and evaluation. They are also optimized to use the open source NVIDIA TensorRT-LLM library for reasoning.

Nemotron-4 340B is now available for download from Hugging Face. Developers will soon be able to access these models at http://ai.nvidia.com, and they will be packaged as NVIDIA NIM microservices and provide a standard application programming interface that can be deployed anywhere.

Generating synthetic data using Nemotron In situations where access to large, diverse labeled data sets is limited, large language models can help developers generate synthetic training data.

The Nemotron-4 340B instruction model creates diverse synthetic data and simulates the characteristics of real-world data, thereby improving data quality and enhancing the performance and robustness of custom large language models in various fields.

Then, to further improve the quality of AI-generated data, developers can use the Nemotron-4 340B reward model to screen for high-quality responses. The Nemotron-4 340B reward model scores responses based on five attributes: usefulness, correctness, consistency, complexity, and verbosity. It currently ranks first on Hugging Face’s RewardBench ranking, created by AI2 to assess the capabilities, security and potential issues of reward models.
In this synthetic data generation pipeline,(1) First, the Nemotron-4 340B instruction model is used to generate synthetic text output. Then,(2) the evaluation model Nemotron-4 340B reward model evaluates the generated text-providing feedback to guide iterative improvement and ensuring that the synthetic data is accurate, relevant, and meets specific requirements. Researchers can also combine their proprietary data with the included HelpSteer2 dataset to create their own command or reward models by customizing the Nemotron-4 340B base model.

Using NeMo for fine-tuning, TensorRT-LLM for optimizing reasoning Using the open source NVIDIA NeMo and NVIDIA TensorRT-LLM, developers can optimize the efficiency of instruction and reward models to generate synthetic data and score responses.

All Nemotron-4 340B models are optimized by TensorRT-LLM to take advantage of Tensor parallelism, a type of model parallelism in which individual weight matrices are split between multiple GPUs and servers to enable efficient reasoning on a large scale.

The Nemotron-4 340B basic model is trained on 900 trillion Tokens and can be customized using the NeMo framework to adapt to specific use cases or domains. This fine-tuning process benefits from extensive pre-training data to provide more accurate output for specific downstream tasks.
The NeMo framework provides multiple customization methods, including supervised fine-tuning and efficient fine-tuning methods for parameters, such as Low-Rank Adaptation (LoRA).

To improve model quality, developers can align models using NeMo Aligner and a dataset annotated by the Nemotron-4 340B reward model. Alignment is a key step in training large language models, and the model’s behavior is fine-tuned through algorithms such as Human Feedback Reinforcement Learning (RLHF) to ensure that its output is safe, accurate, contextually appropriate, and consistent with its intended goals.
Enterprises seeking enterprise-class support and production environment security can also access NeMo and TensorRT-LLM through the cloud-native NVIDIA AI Enterprise software platform, which provides an accelerated and efficient runtime environment for generative AI fundamentals.

Assessing model security and beginning to use the Nemotron-4 340B command model for extensive security assessments, including adversarial testing, and performed well on a wide range of risk indicators. Users should still carefully evaluate the output of the model to ensure that the resulting synthetic data is appropriate, safe, and accurate.

For more information on model security and security assessments, read the model card.

For more detailed information, you can read the original text, which can be found in the following link
Thank you for watching this video. If you like it, please subscribe and like it. thank

Download the Nemotron-4 340B model via Hugging Face: https://huggingface.co/collections/nvidia/nemotron-4-340b-666b7ebaf1b3867caf2f1911 。
For more details, read About Models:https://research.nvidia.com/publication/2024-06_nemotron-4-340b
and datasets:https://arxiv.org/abs/2406.08673 research paper.

Oil tubing: