前言
一、Xinference是什么?
Xorbits Inference (Xinference) 是一个开源平台,用于简化各种 AI 模型的运行和集成。借助 Xinference,您可以使用任何开源 LLM、嵌入模型和多模态模型在云端或本地环境中运行推理,并创建强大的 AI 应用。
二、快速入门
1.安装
conda新建环境
conda create -n xinf python=3.10
conda activate xinf
安装Transformers 引擎
pip install "xinference[transformers]"
安装vLLM引擎
pip install "xinference[vllm]"
2.命令行启动服务
#XINFERENCE_MODEL_SRC 为指定模型默认从魔搭下载
#XINFERENCE_HOME 为默认缓存目录(模型)
XINFERENCE_MODEL_SRC=modelscope XINFERENCE_HOME=/home/PM xinference-local --host 0.0.0.0 --port 9997
浏览器打开http://127.0.0.1:9997
3.自定义注册模型
新建model.json内容如下
{
"version": 1,
"context_length": 4096,
"model_name": "llama-3-sqlcoder-8b",
"model_lang": [
"en"
],
"model_ability": [
"generate"
],
"model_family": "llama-3",
"model_specs": [
{
"model_format": "pytorch",
"model_size_in_billions": 8,
"quantizations": [
"4-bit",
"8-bit",
"none"
],
"model_id": "llama-3-sqlcoder-8b",
"model_uri": "/home/PM/llama-3-sqlcoder-8b"
}
]
}
参数解读
context_length #表示输入文本长度与
model_family #表示模型架构类型
model_ability #表示模型能力
model_uri #为模型文件绝对路径
新建CustomModel.py 内容如下
import json
from xinference.client import Client
with open('model.json') as fd:
model = fd.read()
# replace with real xinference endpoint
endpoint = 'http://0.0.0.0:9997'
client = Client(endpoint)
client.register_model(model_type="LLM", model=model, persist=False)
#运行
python CustomModel.py
#以命令行的方式
xinference register --model-type LLM --file model.json --persist
查看内置模型和自定义模型
xinference registrations --model-type LLM
Is-built-in 为False 是我们搞刚刚注册的模型
(xinf) root@DESKTOP-SUTT5JT:/home/PM# xinference registrations --model-type LLM
Type Name Language Ability Is-built-in
------ --------------------------- ------------------------------------------------------------ ------------------ -------------
LLM aquila2 ['zh'] ['generate'] True
LLM aquila2-chat ['zh'] ['chat'] True
LLM aquila2-chat-16k ['zh'] ['chat'] True
LLM baichuan ['en', 'zh'] ['generate'] True
LLM baichuan-2 ['en', 'zh'] ['generate'] True
LLM baichuan-2-chat ['en', 'zh'] ['chat'] True
LLM baichuan-chat ['en', 'zh'] ['chat'] True
LLM c4ai-command-r-v01 ['en', 'fr', 'de', 'es', 'it', 'pt', 'ja', 'ko', 'zh', 'ar'] ['chat'] True
LLM chatglm ['en', 'zh'] ['chat'] True
LLM chatglm2 ['en', 'zh'] ['chat'] True
LLM chatglm2-32k ['en', 'zh'] ['chat'] True
LLM chatglm3 ['en', 'zh'] ['chat', 'tools'] True
LLM chatglm3-128k ['en', 'zh'] ['chat'] True
LLM chatglm3-32k ['en', 'zh'] ['chat'] True
LLM code-llama ['en'] ['generate'] True
LLM code-llama-instruct ['en'] ['chat'] True
LLM code-llama-python ['en'] ['generate'] True
LLM cogvlm2 ['en', 'zh'] ['chat', 'vision'] True
LLM csg-wukong-chat-v0.1 ['en'] ['chat'] True
LLM deepseek ['en', 'zh'] ['generate'] True
LLM deepseek-chat ['en', 'zh'] ['chat'] True
LLM deepseek-coder ['en', 'zh'] ['generate'] True
LLM deepseek-coder-instruct ['en', 'zh'] ['chat'] True
LLM deepseek-vl-chat ['en', 'zh'] ['chat', 'vision'] True
LLM falcon ['en'] ['generate'] True
LLM falcon-instruct ['en'] ['chat'] True
LLM gemma-2-it ['en'] ['chat'] True
LLM gemma-it ['en'] ['chat'] True
LLM glaive-coder ['en'] ['chat'] True
LLM glm-4v ['en', 'zh'] ['chat', 'vision'] True
LLM glm4-chat ['en', 'zh'] ['chat', 'tools'] True
LLM glm4-chat-1m ['en', 'zh']
LLM internlm-20b ['en', 'zh'] ['generate'] True
LLM internlm-7b ['en', 'zh'] ['generate'] True
LLM internlm-chat-20b ['en', 'zh'] ['chat'] True
LLM internlm-chat-7b ['en', 'zh'] ['chat'] True
LLM internlm2-chat ['en', 'zh'] ['chat'] True
LLM internlm2.5-chat ['en', 'zh'] ['chat'] True
LLM internlm2.5-chat-1m ['en', 'zh'] ['chat'] True
LLM internvl-chat ['en', 'zh'] ['chat', 'vision'] True
LLM llama-2 ['en'] ['generate'] True
LLM llama-2-chat ['en'] ['chat'] True
LLM llama-3 ['en'] ['generate'] True
LLM llama-3-instruct ['en'] ['chat'] True
LLM llama-3-sqlcoder-8b ['en'] ['generate'] False
LLM llama-3.1 ['en', 'de', 'fr', 'it', 'pt', 'hi', 'es', 'th'] ['generate'] True
LLM llama-3.1-instruct ['en', 'de', 'fr', 'it', 'pt', 'hi', 'es', 'th'] ['chat'] True
(xinf) root@DESKTOP-SUTT5JT:/home/PM#
4.运行模型
xinference launch --model-name llama-3-sqlcoder-8b --model-format pytorch --model-engine transformers
测试
curl -X 'POST' \
'http://127.0.0.1:9997/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "llama-3-sqlcoder-8b",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is the largest animal?"
}
]
}'
报错
{“detail”:“[address=0.0.0.0:42545, pid=65026] Model model_format=‘pytorch’ model_size_in_billions=8 quantizations=[‘4-bit’, ‘8-bit’, ‘none’] model_id=‘llama-3-sqlcoder-8b’ model_hub=‘huggingface’ model_uri=‘/home/PM/llama-3-sqlcoder-8b’ model_revision=None is not for chat.”}(xinf) root@DESKTOP-SUTT5JT:/home/PM#
使用Generate API
curl -X 'POST' \
'http://127.0.0.1:9997/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "llama-3-sqlcoder-8b",
"prompt": "Generate a SQL query to answer this question: `{Query the top 10 order records}`",
"max_length": 4096,
"temperature": 0,
"top_p": 1
}'
漫长的无限循环