Sophon LLM_api_server development
1. Introduction
The LLM_api_server routine is an Openai_api-like LLM service built on BM1684X, currently supporting ChatGLM3, Qwen, Qwen1.5, and Qwen2.
1. Features
- Supports BM1684X (PCIe, SoC), BM1688 (SoC).
- Supports calling of openai library.
- Support web interface calls.
2. Project Directory
LLM_api_server
├── models
│ ├── BM1684X
│ │ ├── chatglm3-6b_int4.bmodel # BM1684X chatglm3-6b模型
│ │ ├── qwen2-7b_int4_seq512_1dev.bmodel # BM1684X qwen2-7b模型
├── python
│ ├── utils # 工具库
│ ├── api_server.py # 服务启动程序
│ └── config.yaml # 服务配置文件
│ └── request.py # 请求示例程序
│ └── requirements.txt # python依赖
└── scripts
├── download_model.sh # 模型下载脚本
├── download_tokenizer.sh # tokenizer下载脚本
2. Operation steps
1. Prepare data and models
1.1 Copy the official sophon_demo project directory of Suanneng (or copy and upload LLM_api_server to /data in the box)
git clone https://github.com/sophgo/sophon-demo.git
cd sophon-demo/appliction/LLM_api_server
cd /data/LLM_api_server ##If only the LLM_api_server has been uploaded, you only need to enter this directory.
1.2 Install unzip and other environments. If they are already installed, please skip this step. For non-Ubuntu systems, use yum or other methods to install them as appropriate.
sudo apt-get update
sudo apt-get install pybind11-dev
pip3 install sentencepiece transformers==4.30.2
pip3 install gradio==3.39.0 mdtex2html==1.2.0 dfss
sudo apt install unzip
chmod -R +x scripts/
./scripts/download_tokenizer.sh ##Download the tokenizer
./scripts/download_model.sh ##Download the model file
2. Python routine
2.1 Environmental Preparation
pip3 install -r python/requirements.txt
##Since the sophon-sail version required for this routine is relatively new, a usable sophon-sail whl package is provided here. The SoC environment can download it through the following command
python3 -m dfss --url=open@sophgo.com:sophon-demo/Qwen/sophon_arm-3.8.0-py3-none-any.whl
python3 -m dfss --install sail ##Install sophon_sail
2.2 Start the service
Parameter Description
api_server.py uses the config.yaml configuration file for parameter configuration.
The content of config.yaml is as follows
models: # Model list
- name: qwen # Model name, qwen/chatglm3 are optional
bmodel_path: ../models/BM1684X/qwen2-7b_int4_seq512_1dev.bmodel # Model path, modify it according to the actual situation
token_path: ./utils/qwen/token_config # Tokenizer path
dev_id: 0 # TPU ID
port: 18080 # Service port
How to use
cd python ##Switch the working directory
python3 api_server.py --config ./config.yaml
3. Service Call
1. You can use the OpenAI library to call
python3 request.py ##If you want to use different Q&A for the model, you need to change the content of messages.["content"] in the request.py file.
2. Use http interface to call
The interface information is in request.py and can be modified according to it (such as IP, etc.)
Interface url: ip:port/v1/chat/completions, for example: 172.26.13.98:18080/v1/chat/completions

Interface parameters (json format)
{
"model": "qwen",
"messages": [
{"role": "user", "content": "你好"}
],
"stream": true
}
You can use postman to test the interface
Download address: (https://www.postman.com/downloads/)
Usage Examples
!!! The ip address should be the IP of the box!!!
