See the link below for the original text. This article mainly translates the original text
RefuelAI recently launched two new versions of the large language model RefuelLLM-2 and RefuelLLM-2-small.
RefuelLLM-2 and RefuelLLM-2-small are language models specifically designed for data annotation, cleaning, and enrichment tasks.
Purpose: RefuelLLM-2 is mainly used for automated data annotation, data cleaning and data enrichment, which are basic tasks when processing and analyzing large-scale data sets, especially in scenarios where you need to convert unstructured data into a structured format.
Main functions:
High-performance data annotation: This model can automatically identify and label key information in data, such as classifying data, parsing specific attributes, etc.
Data cleaning: Automatically identify and correct errors or inconsistencies in data, such as spelling errors, format issues, etc.
Data-rich: Automatically supplement missing information or provide additional context based on existing data to increase the value and usability of the data.
High accuracy: RefuelLLM-2 (83.82%) outperformed all state-of-the-art LLMs, including GPT-4-Turbo (80.88%), Claude-3-Opus (79.19%), and Gemini-1.5-Pro (74.59%) in benchmark tests on approximately 30 data annotation tasks.
Results
Baseline data set
Compared to the previously launched Refuel LLM, we added 10 datasets to the benchmark:
Long context datasets: Datasets such as Quality and NaturalQuestions have been added to specifically assess the quality of tasks with long input contexts.
Private evaluation datasets: Due to concerns about the (validity) of the data, many researchers and practitioners have recently highlighted the limitations of evaluating LLMs only on public datasets (read [1],[2],[3]) pollution. To test the generalization and execution of LLMs on real-world data labeling and rich tasks, we also added non-public datasets to our benchmark.
We used Autolabel, our open source library for LLM-supported data tagging, to run all the experiments in this report.
quality
Output quality measures the consistency of the output generated by LLM with the authentic label provided.
RefuelLLM-2 (83.82%) outperforms all current state-of-the-art LLMs in terms of data labeling and richness, including GPT-4-Turbo (80.88%), Claude-3-Opus (79.19%) and Gemini-1.5-Pro (74.59%)
RefuelLLM-2-small (79.67%) outperforms LLMs with similar size/inference costs, including Claude-3-Sonnet (70.99%), Haiku (69.23%) and GPT-3.5-Turbo (68.13%)
We see a significant improvement in quality compared to the base LLMs we started with for each of the above models (Mixtral-8x7B, Llama3-8B, respectively).Long context dataset
As mentioned in the Benchmarks section, we include some datasets specifically used to evaluate LLM performance over long input contexts.
RefuelLLM-2 is a Mixtral-8x7B basic model that itself supports a maximum input context length of 32K. RefuelLLM-2-small is the Llama3-8B basic model that supports a maximum input context length of 8K.
On two types of inputs (4K and =4K input contexts), we see that RefuelLLM-2 performs better than all LLMs. As expected, we did see significant performance degradation for long context inputs for all LLMs.
Non-public dataset
As described in the Benchmarks section, we evaluated all LLMs on a collection of non-public datasets, covering areas such as recruitment, financial services, STEM and e-commerce. These datasets were not used as part of any training or validation splits of the Refuel-LLM2 model series. While including these into benchmarks undermines repeatability, we believe it is crucial to evaluate LLMs on non-public, task-specific datasets to understand their reliability and quality in real-world environments.
The superior quality of RefuelLLM-2 is enhanced in the performance comparisons shown above. In addition, for both models, the improved quality of retained datasets compared to their respective base LLMs is a good indication of their generalization capabilities.
Domain-specific datasets
To further understand the reliability and quality of models in real-world environments, we also report on the LLM quality of datasets from specific industry/problem areas.
We have observed that in various vertical areas, Refuel-LLM-2 is competitive or superior in terms of output quality compared to currently state-of-the-art LLMs such as GPT-4, Turbo and Claude-3-Opus, and the size is less than 1/10 of the model.
Confidence quality score
Based on our understanding of “confidence labeling” research, we use the average label generation probability as a heuristic to estimate the confidence of LLM output. To benchmark the quality of these confidence scores, we use AUROC. AUROC is a total score that measures the classifier’s ability to distinguish between positive classes (“LLM output is correct”) and negative classes (“LLM output is incorrect”) across all score thresholds:
We observed that the calibration confidence scores of RefuelLLM-2 and RefuelLLM-2-small outputs were much better than GPT-4 and Llama-3- 70B. Previous work in this area has shown that post-training of LLMs based on RLHF-can severely compromise logprob calibration. The RLHF training process may cause large peaks in the KL divergence between the model’s output distribution and the original pre-trained distribution. This may cause the model to deviate significantly from its original “world priors”, thereby compromising its ability to accurately estimate probabilities. Note that the models provided by Claude and Google do not support returning log probabilities at the token level, so there are no scores assigned to them.
Training and hyperparameters
We train the model in two stages. The first phase is responsible for making the model good at data labeling and enrichment tasks, while the second phase helps improve the performance of longer-context examples. Both stages of training were completed on an 8xH100 80GB GPU cluster.
Phase 1-This is the stage where most of the instruction adjustments of the model occur. The maximum length of rows used for training is 4096 tags. We train the model in 21k steps with a batch size of 32. We use a cosine learning rate scheduler with an initial learning rate of 1e-5, decaying to 10% of its value.
Phase 2-In this phase, we add longer contextual input to the training set to further train the model. We train the model in an additional 5k steps, with a batch size of 16, and 2 gradient accumulation steps. We found that the model was more sensitive to learning rates at this stage and used a cosine learning rate scheduler with an initial learning rate of 2e-6, which decays to 10% of its value.
Data set
Although the distribution of examples used in the two phases is different, they are sampled from the same set of more than 2750 unique tasks. Our training series mainly includes:
Manually annotated datasets, such as the Flan, Task Source, and Aya collections
Comprehensive datasets such as OpenOrca, OpenHermes and WizardLM
Proprietary data sets developed or licensed by Refuel
The final instruction tuning dataset (after data deduplication, sampling, and cleanup) consists of approximately 4B tokens in two phases. We also utilize multiple packaging to package multiple sequences into one batch to improve training throughput.
View https://labs.refuel.ai/playground, an interactive playground used to target other people LLMs test models.
Sign up for Fuel Cloud to access models and fine-tune support:https://www.refuel.ai/ https://www.refuel.ai/get-started
We are open-sourcing RefuelLLM-2-small (aka Llama-3-Refueled) under the CC BY-NC 4.0 license. Model weights can be found on Hugging Face:https://huggingface.co/refuelai/Llama-3-Refueled
If you want to learn more, you can click on the link below the video.
Thank you for watching this video. If you like it, please subscribe and like it. thank
Detailed introduction:https://www.refuel.ai/blog-posts/announcing-refuel-llm-2
Playground:https://labs.refuel.ai/playground
Model download:https://huggingface.co/refuelai/Llama-3-Refueled
Oil tubing: