Bloom: A security research framework for automated evaluation of large language model behavior

Bloom is a free and open-source tool that automatically detects bad behavior in AI models, such as biased output, flattery, and more. You only need to define the types of behaviors to be detected in the simple configuration file, add dialog examples as needed, and the tool will automatically perform four steps: behavioral intent analysis→ generation of diverse test scenarios→ target model interaction simulation (support integration with mainstream models such as Claude and GPT through APIs), →and quantitative scoring of results (evaluation based on the frequency of problems and other indicators). Interactive conversation transcripts during testing are also readily available.
This tool saves hours of manual testing work, allows you to quickly compare the performance of different models based on new test sets, and effectively avoids overfitting issues. It also provides reliable and reproducible AI security analysis conclusions, making it ideal for researchers working to build trusted AI systems.

Today, as large language models (LLMs) become more and more powerful, “under what circumstances the model will exhibit unsafe, misaligned, or biased behavior” has become a question that must be answered systematically.

Bloom is an open-source model behavior evaluation framework by AI security research teams, which is not a new language model, but a toolchain for “testing models”, with the goal of making model security assessment automated, scalable, and reproducible.

What problem does Bloom want to solve?

Before Bloom, model safety assessment often had several obvious pain points:

Evaluation use cases are highly dependent on human design
Limited scene coverage makes it difficult to detect “out-of-control behavior at the edge”
It is difficult to reproduce experimental results between different researchers
The assessment process is not scalable and costly

Bloom’s core goals can be summed up in one sentence:

Turn “model behavior evaluation” itself into a process that can be automated, combined, and extended.

Bloom’s overall workflow

Bloom breaks down model evaluation into a clear pipeline rather than a one-time conversation test.

Behavior Specification

The investigator begins by defining the types of behaviors to be assessed, such as:

Sycophancy
Self-preservation tendencies
Political or value bias
Stability in rejecting inappropriate requests
Role consistency breaking

These behaviors are not prompts, but abstract goals.

Ideation

Bloom automatically generates a large number of test scenarios, including:

Different contexts
Different ways to ask questions
Different emotions, roles, or induction paths

This step solves the problem of “too narrow coverage” for manual design use cases.

Model Interaction (Rollout)

Bloom feeds these scenarios into the target model (like different versions of LLMs) in batches:

Run multiple rounds of conversations automatically
Record the full context
You can compare multiple models or multiple checkpoints

Judgment

The final step is to analyze the model output, such as:

Whether the target behavior is triggered
The frequency of the behavior
The intensity or stability of the behavior

The judgment itself can also be done by a model or rule system, rather than relying entirely on manual annotation.

Core features of Bloom

Automation first

Instead of “testing once,” Bloom is designed to:

Can be run repeatedly
CI-like
Regression testing can be performed on model updates

Research-oriented

Bloom is clearly not a “conversational bot framework”, but:

AI security research tool
Model Alignment Analysis Tool
Early warning tool for out-of-control behavior

This also determines that the threshold for its use is biased towards researchers.

Reproducible and scalable

All assessment configurations are structured
Experiments can be fully reproduced by others
New behavior types can be added modularly

Summary of Bloom

To sum up Bloom in one sentence:

Bloom is not “teaching the model to speak”, but “interrogating the model under what circumstances it will say the wrong thing”.

It represents a very important trend:
The next step for AI is not just to be stronger, but to be more understandable, restrained, and verified.

Github：https://github.com/safety-research/bloom
Tubing: