Magistral Small: 《关于vLLM与Ollama的示范项目指南》

最新推荐文章于 2025-09-07 19:29:44 发布

吴脑的键客

最新推荐文章于 2025-09-07 19:29:44 发布

阅读量1k

点赞数 33

CC 4.0 BY-SA版权

分类专栏：人工智能文章标签：语言模型人工智能自然语言处理开源 AIGC

本文链接：https://blog.csdn.net/weixin_41446370/article/details/148683785

人工智能专栏收录该内容

660 篇文章

订阅专栏

学习如何使用Ollama和vLLM设置并运行Mistral的Magistral Small模型，并构建一个调试错误逻辑的演示项目。

Mistral发布了其首个推理模型Magistral，提供两个版本：Magistral Small（开放权重）和Magistral Medium（闭源模型）。

本博客将重点关注Magistral Small——一个专为需要结构化逻辑、多语言理解和可追溯解释能力的任务而设计的开放权重推理模型。当搭配vLLM等高吞吐量推理引擎或Ollama等易用工具时，它将成为调试逻辑缺陷和推理任务的绝佳工具。

本教程将逐步讲解如何：

使用vLLM和Ollama运行Magistral Small（24B）模型
开发一个演示项目，通过透明化的逐步推理来调试逻辑

什么是Mistral Magistral？

Magistral是Mistral AI推出的首个专用推理模型，专为逐步逻辑推理、多语言精准处理及可追溯输出而构建。它是一个双版本发布模型，包含两种变体：

Magistral小型版（24B）：完全开源模型，采用Apache 2.0许可证，适合本地部署。
Magistral中型版：更强大的企业级模型，可通过Mistral的Le Chat、SageMaker及其他企业云平台获取。

在这里插入图片描述
我们重点关注的开放模型Magistral Small支持128K上下文窗口（为保证稳定性能推荐使用40K）。该模型通过监督式微调Magistral Medium轨迹数据并结合强化学习进行训练。

如何通过Ollama在本地搭建并运行Mistral小型模型

本节中，我们将使用Ollama对Mistral的Magistral模型进行本地推理。请注意，该模型量化后约需14GB空间，可适配单张RTX 4090显卡或32GB内存的MacBook。本演示在M3芯片的MacBook Pro上运行。

第一步：通过Ollama拉取模型

从以下地址下载适用于macOS、Windows或Linux的Ollama：https://ollama.com/download

按照安装程序指引完成安装后，在终端运行以下命令进行验证：

ollama --version

接下来，通过运行以下代码拉取 Magistral 模型：

ollama pull magistral

在这里插入图片描述
这将把Magistral模型拉取到您的本地机器。注意：由于模型大小约为14 GB，此过程需要一些时间。

第二步：安装依赖项

我们先从安装所有必需的依赖项开始。

pip install ollama
pip install requests

安装完依赖项后，我们即可准备运行推理。

第3步：创建结构化提示模板

现在，我们将按照原始Magistral论文中所述，搭建一个引导模型思考的提示模板结构。

import gradio as gr
import requests
import json
def build_prompt(flawed_logic):
    return f"""<s>[SYSTEM_PROMPT]
A user will ask you to solve a task. You should first draft your thinking process (inner monologue) until you have derived the final answer. Afterwards, write a self-contained summary of your thoughts.
Your thinking process must follow the template below:
<think>
Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and detailed as needed until you're confident.
</think>
Do not mention that you're debugging — just present your thought process and conclusion naturally.
[/SYSTEM_PROMPT][INST]
Here is a flawed solution. Can you debug it and correct it step by step?
\"\"\"{flawed_logic}\"\"\"
[/INST]
"""

上述函数返回一个格式化提示，用于指导Magistral进行以下操作：

使用<think>...</think>标签逐步思考
在内部推理后给出明确结论
自然解释时忽略所有"调试"相关表述

该结构对Magistral等经过工具增强提示训练的模型尤为重要。相同的系统提示结构也可应用于数学和编程问题。

第四步：流式推理与构建Gradio界面

此步骤通过Ollama本地API实时流式传输Magistral模型的输出。由于我们聚焦于通过可追溯的逐步推理来调试逻辑缺陷，因此用户必须能看到模型得出结论的过程。最终通过简洁的Gradio界面呈现解释说明。

def call_ollama_stream(flawed_logic):
    prompt = build_prompt(flawed_logic)
    response_text = ""
    with requests.post(
        "http://localhost:11434/api/generate",
        json={"model": "magistral", "prompt": prompt, "stream": True},
        stream=True,
    ) as r:
        for line in r.iter_lines():
            if line:
                content = json.loads(line).get("response", "")
                response_text += content
    return response_text
with gr.Blocks(theme=gr.themes.Base()) as demo:
    gr.Markdown("## Chain-of-Logic Debugger (Magistral + Ollama)")
    gr.Markdown("Paste a flawed logical argument or math proof, and Magistral will debug it with step-by-step reasoning.")
    with gr.Row():
        input_box = gr.Textbox(lines=8, label="Flawed Logic / Proof")
        output_box = gr.Textbox(lines=15, label="Debugged Explanation")  
    debug_button = gr.Button("Run Debugger")
    debug_button.click(fn=call_ollama_stream, inputs=input_box, outputs=output_box)
demo.launch(debug = True, share=True)

以下是此处情况的概述：

我们首先使用可复用的build_prompt()函数，将用户输入封装成结构化提示，用<think>推理标签引导模型。
当用户提交存在缺陷的证明或逻辑陈述时，call_ollama_stream()函数通过流式POST请求将提示发送至Ollama的HTTP API接口 localhost:11434。
该函数使用requests.iter_lines()逐行监听流式响应。对收到的每行数据，它会从JSON有效载荷中提取响应字段，并将其追加至动态文本缓冲区。
所有流式数据行收集完成后，完整的模型响应将被返回并显示在Gradio用户界面中。

我尝试的输入内容如下：

Assume x = y. Then, x² = xy. Subtracting both sides gives x² - y² = xy - y². So, (x+y)(x−y) = y(x−y). Cancelling x−y gives x+y = y. But since x = y, this means 2y = y → 2 = 1.

在这里插入图片描述
在M3版MacBook Pro上的测试中，该模型处理简单逻辑链和数学证明的表现相当不错。但对于更深层次的推理任务或较长思维链，偶尔会遗漏边缘案例——这在240亿参数的开放模型中是可以预见的。这种方法非常适合轻量级推理演示或设备端思维链应用，无需依赖云端API。

使用vLLM高效运行Magistral小模型

本节将介绍如何在RunPod上配置强大的GPU实例，使用vLLM部署Mistral的Magistral模型，并开放兼容OpenAI的API接口，实现本地和远程推理功能。

环境： A100 SXM GPU with 80GB VRAM

无论是在终端还是 Jupyter Notebook 中，安装 vLLM 及其依赖项。

pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
pip install gradio

此外，请通过运行以下命令确保您正在使用 mistral_common >= 1.6.0：

python -c "import mistral_common; print(mistral_common.__version__)"

启动模型服务

现在，我们来启动模型服务。从终端，然后运行以下命令：

vllm serve mistralai/Magistral-Small-2506 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice

保持终端运行，因为此命令使用vLLM启动Magistral Small模型，并通过一个快速兼容OpenAI的API端点（http://localhost:8000/v1）提供服务。以下是各标志的详细说明：

Flag	描述
`mistralai/Magistral-Small-2506`	这是一个Hugging Face模型标识符。如果尚未存在，vLLM会自动下载该模型。
`--tokenizer-mode mistral`	这确保了分词器使用Mistral特定逻辑进行解释
`--config-format mistral`	这表明模型配置使用的是Mistral自定义格式，而非Hugging Face的默认格式。
`--load-format mistral`	这使用Mistral预期的布局加载模型权重（对兼容性很重要）。
`--tool-call-parser mistral`	它根据Mistral的结构实现了工具调用解析语法。
`--enable-auto-tool-choice`	如果使用工具调用，会根据输入自动选择最佳工具。这是可选项，但对于受过工具推理训练的模型很有用。

使用Magistral和vLLM调试逻辑错误

我们将构建一个演示场景：要求Magistral调试存在缺陷的逻辑或数学证明。模型将输出包裹在<think>标签中的详细内心独白，并给出最终总结报告。

初始化 OpenAI 客户端与系统提示

首先在 Jupyter Notebook 中设置导入并初始化 OpenAI 客户端。接着按照原始 Magistral 论文的建议，配置 Magistral 系统提示。

import gradio as gr
from openai import OpenAI
import re
import time
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
SYSTEM_PROMPT = """<s>[SYSTEM_PROMPT]system_prompt
A user will ask you to solve a task. You should first draft your thinking process (inner monologue) until you have derived the final answer. Afterwards, write a self-contained summary of your thoughts.
<think>
Your thoughts or draft, like working through an exercise on scratch paper.
</think>
Here, provide a concise summary that reflects your reasoning and presents a clear final answer to the user.
Problem:
[/SYSTEM_PROMPT]"""

SYSTEM_PROMPT定义了模型应遵循的结构化响应格式：

要求模型首先生成一个<think>思考轨迹（类似内心独白），
然后在</think>标签后生成最终总结。

流式输出模型并允许中断

接下来，按照 Magistral 原始博客的建议，我们通过设置所需的 temperature、top_p 和 max_tokens 来处理模型的流式输出。

# Streaming logic with stop control
def debug_faulty_logic_stream(faulty_proof, stop_signal):
    stop_signal["stop"] = False 
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Here is a flawed logic or math proof. Can you debug it step-by-step?\n\n{faulty_proof}"}
    ]
    try:
        response = client.chat.completions.create(
            model="mistralai/Magistral-Small-2506",
            messages=messages,
            stream=True,
            temperature=0.7,
            top_p=0.95,
            max_tokens=2048
        )
        buffer = ""
        for chunk in response:
            if stop_signal.get("stop"):
                break
            delta = chunk.choices[0].delta
            if hasattr(delta, "content") and delta.content:
                buffer += delta.content
                filtered = re.sub(r"<think>.*?</think>", "", buffer, flags=re.DOTALL).strip()
                yield filtered
            time.sleep(0.02)
    except Exception as e:
        yield f"Error: {str(e)}"
# Set stop flag when stop button is clicked
def stop_streaming(stop_signal):
    stop_signal["stop"] = True
    return gr.Textbox.update(value="Stopped.")

上述代码片段处理模型逐令牌输出的实时流式传输。stop_signal允许用户通过点击"停止"按钮中断流传输。

缓冲区会累积所有内容，但仅使用正则表达式输出摘要（不包括<think>标签）。如果发生任何错误（例如网络问题），则返回错误信息。

构建Gradio界面

最后我们将整合一个简单的Gradio应用，允许用户输入其存在缺陷的逻辑或证明，并将其提交给模型进行推理。

with gr.Blocks() as demo:
    gr.Markdown("## Chain-of-Logic Debugger (Streaming via Magistral + vLLM)")

    input_box = gr.Textbox(
        label="Paste Your Faulty Logic or Proof",
        lines=8,
        placeholder="e.g., Assume x = y, then x² = xy..."
    )
    output_box = gr.Textbox(label="Corrected Reasoning (Streaming Output)")

    submit_btn = gr.Button("Submit")
    stop_btn = gr.Button("Stop")

    stop_flag = gr.State({"stop": False})

    submit_btn.click(
        fn=debug_faulty_logic_stream,
        inputs=[input_box, stop_flag], 
        outputs=output_box
    )

    stop_btn.click(
        fn=stop_streaming,
        inputs=stop_flag,
        outputs=output_box
    )

if __name__ == "__main__":
    demo.launch(share=True, inbrowser=True, debug=True)

上述代码创建了一个简单的Gradio界面，包含：

一个文本输入框，供用户粘贴存在逻辑缺陷的内容；
一个实时更新的输出框，随着令牌流式传输而动态显示；
一个启动调试器的提交按钮，以及一个中断流程的停止按钮。

该界面使用gr.State来追踪用户是否要中断流式处理过程。随后，launch()方法会在本地运行应用并在浏览器中打开它。以下是我尝试输入的测试内容：

Assume x = y. Then, x² = xy. Subtracting both sides gives x² - y² = xy - y². So, (x+y)(x−y) = y(x−y). Cancelling x−y gives x+y = y. But since x = y, this means 2y = y → 2 = 1.

在这里插入图片描述
您可以切换到正在运行vLLM服务的终端，查看KV缓存使用情况的日志，以及模型返回输出时命中率上升的情况。

在这里插入图片描述
与Ollama相比，vLLM在推理过程中速度明显更快且更稳定。数据流传输流畅，输出结果大多结构良好。不过该模型偶尔会在<think>思考标签内重复观点，这可能是由于自回归解码机制缺乏采样惩罚导致的。

结论

在本教程中，我们使用了Mistral推出的推理优先大语言模型Magistral Small，逐步构建了一个逻辑调试器。我们通过Ollama在本地部署该模型，用于快速设备端测试，同时采用vLLM实现高吞吐量GPU推理，并提供OpenAI兼容的API接口。此外，我们还通过Gradio应用程序测试了模型的推理能力。无论是调试逻辑错误还是构建推理类AI工具，Magistral Small都能成为理想的解决方案。