BioMaster is a sophisticated, multi-agent framework that leverages large language models (LLMs) and dynamic knowledge retrieval to automate and streamline complex bioinformatics workflows. Designed specifically to tackle the challenges of modern bioinformatics, BioMaster improves accuracy, efficiency, reproducibility, and scalability across diverse omics data types, including RNA-seq, ChIP-seq, single-cell analysis, spatial transcriptomics, and Hi-C data processing.
- 2025-05-15: Update the code to support ollama.
- 2025-04-30: Update the code to support config.yaml to run the example.
-
โจ Fully Automated Bioinformatics Pipelines
- Seamlessly automates data preprocessing, alignment, variant calling, and comprehensive downstream analysis.
-
๐ค Role-Based Multi-Agent System
- Specialized agents (Plan, Task, Debug, and Check Agents) collaboratively handle task decomposition, execution, validation, and error recovery.
-
๐ Dynamic Retrieval-Augmented Generation (RAG)
- Dynamically retrieves and integrates domain-specific knowledge, allowing BioMaster to adapt rapidly to emerging bioinformatics tools and specialized workflows.
-
๐ Advanced Error Handling & Recovery
- Robust error detection and automated debugging mechanisms minimize propagation of errors across workflow steps, ensuring reliability and reproducibility.
-
๐ง Optimized Memory Management
- Efficiently manages memory, enabling stable and consistent performance even in complex, long-running workflows.
-
โ๏ธ Extensible & Customizable
- Supports easy integration of custom bioinformatics tools, scripts, and workflows, empowering researchers to extend BioMaster according to their specific analysis needs.
-
๐ฅ๏ธ Interactive UI
- User-friendly graphical interface allows users without extensive computational expertise to effortlessly manage, execute, and monitor bioinformatics workflows.
BioMaster autonomously handles a diverse range of bioinformatics analyses across multiple omics modalities:
- DEG analysis (Differentially Expressed Genes)
- DEG analysis (WGS-based)
- Fusion gene detection
- APA analysis (Alternative Polyadenylation)
- RNA editing
- Splicing analysis
- Expression quantification
- Novel transcript identification
- Functional enrichment
- Circular RNA identification
- Peak calling
- Motif discovery
- Functional enrichment
- DEG analysis
- Marker gene identification
- Cell clustering
- Top marker genes identification
- Neighborhood enrichment
- Cell type annotation
- Spatially Variable Gene (SVG) detection
- Clustering
- Ligand-Receptor interactions
- Mapping & sorting conversion
- Pair parsing & cleaning
- Contact matrix generation
- DNA methylation identification
- De novo assembly
- Alignment
- Quality control
- Host removal
- Transcript quantification analysis
- Isoform quantification (RNA-seq)
- microRNA prediction
- microRNA quantification
- DNA methylation (Bisulfite-Seq)
- DNase-seq hypersensitive site identification
- PAS (Polyadenylation Site) identification (3โend-seq)
- Protein-RNA cross-links identification
- Ribo-seq analysis (RBP-bound enriched genes)
- Metagenomic analysis and composition plotting
- TSS identification (CAGE-seq)
- Protein expression quantification
- Isoform quantification for PacBio RNA-seq
- Translated ORFs identification (Ribo-seq)
You can find the read the doc documentation in the docs folder.
You can install BioMaster using the following steps:
- Clone the repository:
git clone https://github.com/yourusername/BioMaster.git
cd BioMaster- Install the required dependencies:
conda create -n agent python=3.12
# you can install other version of python,suggest use 3.10-3.12
conda activate agent
pip install -r requirements.txt- download data and move to
data/:
google drive link:
https://drive.google.com/drive/folders/1vA3WIAVXVo4RZSqXKsItEZHVaBgIIv_E?usp=sharingPS: Linux has been tested for direct installation, but Windows and Mac have not been tested, so it is uncertain whether any issues might arise.
Biomaster uses two types of Retrieval-Augmented Generation (RAG) systems:
- PLAN RAG: Used during the planning phase.
- EXECUTE RAG: Used during the execution phase.
If you want to add a new analysis workflow, you need to update the PLAN RAG.
-
Collect the analysis workflow, e.g.: ChIP-seq data analysis
-
Write the workflow based on the following format:
- You can use LLMs (e.g., ChatGPT) to help you draft the workflow content.
- If new scripts, tools, or functions are required, they can also be referenced in PLAN RAG.
- PLAN RAG focuses on the steps, required input, and expected output, not detailed usage.
- When describing input/output, it's strongly recommended to mention data formats, especially if the workflow is custom, uncommon, or newly developed. For example:
Input Required: sample1.fastq.gz, sample2.fastq.gz Expected Output: sample.sam
Fully ChIP-seq Peak Calling with IgG Control Workflow:
Step 1: Quality Control โ Conduct quality checks on raw sequencing data to assess data quality.
Input Required: Raw FASTQ files.
Expected Output: Cleaned and quality-checked FASTQ files.
Tools Used: FastQC, Trimmomatic, Cutadapt.
Step 2: Alignment โ Align reads to the reference genome.
Input Required: Cleaned FASTQ files and the reference genome.
Expected Output: Sorted BAM file.
Tools Used: BWA-MEM, Bowtie2, STAR.
Step 3: SAM/BAM Conversion & Processing โ Convert SAM to BAM, sort, and remove PCR duplicates.
Input Required: SAM file.
Expected Output: De-duplicated BAM file.
Tools Used: SAMtools, Picard.
Step 4: Signal Track Generation โ Generate BigWig files for visualization.
Input Required: De-duplicated BAM file.
Expected Output: BigWig signal track file.
Tools Used: deeptools, bedGraphToBigWig.
Step 5: Peak Calling โ Identify enriched genomic regions using IgG as a control.
Input Required: De-duplicated BAM file and IgG control BAM file.
Expected Output: NarrowPeak file.
Tools Used: MACS3
- Add to PLAN RAG:
- Edit
./doc/Plan_Knowledge.json - Use the following JSON format:
{ "content": "Fully ChIP-seq Peak Calling with IgG Control Workflow: Step 1: Quality Control โ Conduct quality checks on raw sequencing data to assess data quality. Input Required: Raw FASTQ files. Expected Output: Cleaned and quality-checked FASTQ files. Tools Used: FastQC, Trimmomatic, Cutadapt. Step 2: Alignment โ Align reads to the reference genome. Input Required: Cleaned FASTQ files and the reference genome. Expected Output: Sorted BAM file. Tools Used: BWA-MEM, Bowtie2, STAR. Step 3: SAM/BAM Conversion & Processing โ Convert SAM to BAM, sort, and remove PCR duplicates. Input Required: SAM file. Expected Output: De-duplicated BAM file. Tools Used: SAMtools, Picard. Step 4: Signal Track Generation โ Generate BigWig files for visualization. Input Required: De-duplicated BAM file. Expected Output: BigWig signal track file. Tools Used: deeptools, bedGraphToBigWig. Step 5: Peak Calling โ Identify enriched genomic regions using IgG as a control. Input Required: De-duplicated BAM file and IgG control BAM file. Expected Output: NarrowPeak file. Tools Used: MACS3.", "metadata": { "source": "workflow", "page": 16 } }
- Edit
-
Collect the tools, scripts, functions, etc.
- Scripts: It is recommended to store all scripts in
./scripts/. - Tools: If a tool is newly introduced, please install it in advance and ensure it's accessible.
- Functions: Suggest placing them in
./scripts/functions.py.
- Scripts: It is recommended to store all scripts in
-
Document the usage of these tools, scripts, and functions
- Focus on organizing usage examples, such as:
Provide example commands and parameter explanations. The more detailed, the better.
samtools view -S -b ./output/001/aligned_reads.sam > ./output/001/aligned_reads.bam - If a tool is already installed and difficult to set up, note in the knowledge base that no additional installation is required.
- If it's a script or function, specify where it is stored and how to call it. For instance, to use
run-cooler.shin a Hi-C task, write:bash ./scripts/run-cooler.sh ...
- You can use an LLM to help you write and organize content for the EXECUTE RAG.
- Focus on organizing usage examples, such as:
-
Knowledge entry recommendations
- Add the tool name in the
sourcefield to help the PLAN AGENT locate the correct tool. - If a script or function is specific to one workflow, append a note like:
run-sort-bam.sh only used in hic workflow.
- Add the tool name in the
-
Add entries to the EXECUTE RAG in
./doc/Task_Knowledge.json- Example format:
{
"content": "2. run-sort-bam.sh:\nData-type-independent, generic bam sorting module\nInput : any unsorted bam file (.bam)\nOutput : a bam file sorted by coordinate (.sorted.bam) and its index (.sorted.bam.bai).\nUsage\nRun the following in the container.\nrun-sort-bam.sh <input_bam> <output_prefix>\n# input_bam : any bam file to be sorted\n# output_prefix : prefix of the output bam file.\n\nSet parameters according to the example: Suppose the input file is: ./output/GM12878_bwa_1.bam and ./output/GM12878_bwa_2.bam, the target is./output/GM12878_bwa_sorted.bam and ./output/GM12878_bwa_sorted, Generate the following sample script:\nbash ./scripts/run-sort-bam.sh ./output/GM12878_bwa_1.bam ./output/GM12878_bwa_sorted \n\nbash ./scripts/run-sort-bam.sh ./output/GM12878_bwa_2.bam ./output/GM12878_bwa_sorted\n\nYou can install the tool, but do not do any additional operations.You can install the tool, but do not do any additional operations.Please follow the example to generate rather than copy and paste completely, especially for folder names, file names, etc.",
"metadata": {
"source": "run-sort-bam.sh",
"page": 6
}
},Notes:
-
To delete or update existing knowledge, please modify the corresponding entries in
./doc/Task_Knowledge.jsonand./doc/Plan_Knowledge.json. After that, delete the./chroma_dbfolder, which stores the embedding vector database of the current knowledge. -
If you notice that certain knowledge is not being utilized in the PLAN or EXECUTE phase, consider refining or expanding the knowledge or goals. You can achieve this by either:
- Adding relevant information to the knowledge files, or
- Making the task goal more specific.
-
Do not use all available knowledge by default. Itโs recommended to selectively use only the knowledge relevant to your task to ensure efficiency and relevance.
-
Keep the knowledge concise and high quality. Biomaster is designed to handle most tasks out-of-the-box, and does not require additional installation steps (e.g., installing R packages via
sudo).
The example code is located in the ./examples/ folder.
-
Add your API key and base URL
- Open
run.pyorexamples/file.pyand insert your OpenAI API key and base URL.
- Open
-
Move the example script
- Move
examples/file.pyinto the root directory of Biomaster:mv examples/file.py ./BioMaster/
- Move
-
Download the data
- Download the required dataset using the Google Drive link provided in
README.md, and place the data in the./data/directory.
- Download the required dataset using the Google Drive link provided in
-
Set the task ID
- In
run.pyorfile.py, set a uniqueidfor your task.- This
idcan be any string, but must not be duplicated.
- This
- In
-
Run the example
- Use the following command to execute:
conda activate agent bash run.sh
- Use the following command to execute:
# BioMaster settings
# support main model:
# o3-mini, o1, gpt-4o, o3-mini-2025-01-31, o1-2024-12-17
# claude-3-7-sonnet-thinking, claude-3-7-sonnet-20250219, claude-3-5-sonnet-20241022
# DeepSeek-V3, Deepseek-R1
# Qwen/QWQ-32B
# LLAMA3-70B
# All other models can be tried, but it is suggested that if the main model chooses a better model,
# the best model tested at present is o3-mini-2025-01-31
# support tool model:
# All LLMS can be tried, and you can choose some small models here.
# support emmbedding model
# text-embedding-004,
# text-embedding-3-large, text-embedding-3-small,
# text-embedding-ada-002
# BAAI/bge-m3
# suggest base url:
# https://api.bltcy.ai/v1
# https://gpt-api.hkust-gz.edu.cn/v1
# https://dashscope.aliyuncs.com/compatible-mode/v1
# https://api.openai.com/v1
# https://api.siliconflow.cn/v1
# https://sg.uiuiapi.com/v1
api:
main:
key: ''
base_url: 'https://api.siliconflow.cn/v1/'
embedding:
key: ''
base_url: 'https://api.siliconflow.cn/v1/'
# Ollama settings
ollama:
enabled: true
base_url: 'http://localhost:11434'
# model settings
models:
main: "deepseek-ai/DeepSeek-V3"
tool: "deepseek-ai/DeepSeek-V3"
embedding: "BAAI/bge-m3"
# Biomaster settings
biomaster:
executor: true
id: '005'
generate_plan: true
use_ollama: false
# datalist and goal
data:
files:
- './data/rnaseq_1.fastq.gz: RNA-Seq read 1 data (left read)'
- './data/rnaseq_2.fastq.gz: RNA-Seq read 2 data (right read)'
- './data/minigenome.fa: small genome sequence consisting of ~750 genes.'
goal: 'please do WGS/WES data analysis Somatic SNV+indel calling.' - start ollama server:
ollama Serve- Download the model:
ollama run llama3:70b- If you want to use local LLM, you can set the following settings in
config.yaml:
ollama:
enabled: true
base_url: 'http://localhost:11434'
# model settings
models:
main: "llama3:70b"
tool: "llama3:70b"
embedding: "bge-m3"
# Biomaster settings
biomaster:
executor: true
id: '005'
generate_plan: true
use_ollama: true Note: If you want to use local LLM, suggest you choose the model which more than 30B.
Biomaster stores all output files in the ./output/ directory.
-
./output/{id}_PLAN.json
Contains the full execution plan. Biomaster will follow this step-by-step. -
./output/{id}_Step_{step_number}.sh
The shell script generated for a specific step.
Example:001_Step_1.shis the script for the first step of taskid="001". -
./output/{id}_DEBUG_Input_{step_number}.json
Internal input for script execution. Can usually be ignored. -
./output/{id}_DEBUG_Output_{step_number}.json
Contains execution output and status for a specific step:"shell": If the step succeeded, this is usually empty.
If the step failed, this contains a new shell command generated by the Debug Agent to fix the issue."analyze": Analysis summary of the stepโs output."output_filename": Name of the output file produced in this step."stats": Indicates whether the step succeeded (true) or failed (false).
Iffalse, the Debug Agent will attempt to fix the error and regenerate the command.
-
./output/{id}/
All generated output files for this task will be stored in this folder.
-
Stop the running task.
-
Comment out the following line in
config.yaml:generate_plan: true->false
This will prevent Biomaster from generating a new plan.
-
Manually edit the plan in:
./output/{id}_PLAN.json -
Run the script again:
python run.py
-
Stop the running task.
-
If you want to roll back previous steps (Step 1โN):
- Either set
"stats": falsein the correspondingDEBUG_Outputfile:./output/{id}_DEBUG_Output_{step_number}.json - Or simply delete the
DEBUG_Outputfile to re-trigger execution.
- Either set
-
If you want to modify the current step:
- Edit the corresponding shell script:
./output/{id}_Step_{step_number}.sh - Do not delete the related
DEBUG_OutputJSON file โ Biomaster will reuse it and execute the updated script.
- Edit the corresponding shell script:
-
Run the script again:
python run.py
example:
If you use task 001 to generate a result:
current result:
./output/001/example.h5adyou want use this result to visualize:
-
Stop the running task.
-
Modify the goal in
config.yaml:goal:'I want to visualize the result, this result is ./output/001/example.h5ad, which is a h5ad file, single cell data which is after normalization and quality control.'
-
Modify the input data in
config.yaml:
data:
files:
- './output/001/example.h5ad: a h5ad file, single cell data which is after normalization and quality control.'-
Modify the task id in
config.yaml:ids='002'
-
Run the script again:
python run.py
Run the following command to launch the Biomaster UI:
conda activate agent
python runv.pyOnce started, open the following URL in your browser:
http://127.0.0.1:7860/
The UI interface looks like this:
Make sure your analysis workflows and tools are already added to the corresponding RAGs.
Set your Base URL and API Key in the designated fields in the UI.
Provide the following inputs:
- Task ID: A unique identifier for this task.
- Input Data Path: Path to your data.
- Goal: A description of the analysis you want Biomaster to perform.
Click the "Generate Plan" button to allow Biomaster to generate a task execution plan.
After the plan is ready, click the "Execute Plan" button to start the automated execution process.
If you need to interrupt execution, click the "Stop PLAN" button.
Click the "Load and Show" button to:
- Load and review results of a previous task, or
- Display outputs from the current task.
agents/: Contains agent classes for task management and execution.scripts/: some example scripts.output/: Output directory where results and logs are saved.doc/: Stores documentation files for the workflows.data/: Usually used to store files.
if you use BioMaster in your work, please cite the following paper:
@article{su2025biomaster,
title={BioMaster: Multi-agent System for Automated Bioinformatics Analysis Workflow},
author={Su, Houcheng and Long, Weicai and Zhang, Yanlin},
journal={bioRxiv},
pages={2025--01},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}You can also join Biomaster community:
- Discord:
- wechat:
This project is licensed under the following terms:
- Code: Licensed under the MIT License. See LICENSE for details.
- Data and Documentation: Licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
- This project uses the langchain library for integration with OpenAI and other tools.
- Thanks to all contributors and the open-source community for making BioMaster possible!

