HOME
  • GM-3568JHF
  • M4-R1
  • M5-R1
  • SC-3568HA
  • M-K1HSE
  • CF-NRS1
  • CF-CRA2
  • 1684XB-32T
  • 1684X-416T
  • C-3568BQ
  • C-3588LQ
  • GC-3568JBAF
  • C-K1BA
Shop
  • English
  • 简体中文
HOME
  • GM-3568JHF
  • M4-R1
  • M5-R1
  • SC-3568HA
  • M-K1HSE
  • CF-NRS1
  • CF-CRA2
  • 1684XB-32T
  • 1684X-416T
  • C-3568BQ
  • C-3588LQ
  • GC-3568JBAF
  • C-K1BA
Shop
  • English
  • 简体中文
  • 1684XB-32T

    • Introduction

      • AIBOX-1684XB-32 Introduction
    • Get started quickly

      • First time use
      • Network Configuration
      • Disk usage
      • Memory allocation
      • Fan Strategy
      • Firmware Upgrade
    • Deployment Tutorial

      • Algorithm deployment
      • Deploy Llama3 Example
    • Application Development

      • Sophgo SDK Development
      • Sophon LLM_api_server development
      • Deploy MiniCPM-V-2_6
      • Qwen-2-5-VL Image and Video Recognition DEMO
      • Qwen3-chat-DEMO
  • 1684X-416T

    • Introduction

      • AIBOX-1684X-416 Introduction
    • Demo simple operation guide

      • Simple instructions for using shimeta smart monitoring demo

Sophon LLM_api_server development

1. Introduction

The LLM_api_server routine is an Openai_api-like LLM service built on BM1684X, currently supporting ChatGLM3, Qwen, Qwen1.5, and Qwen2.

1. Features

  • Supports BM1684X (PCIe, SoC), BM1688 (SoC).
  • Supports calling of openai library.
  • Support web interface calls.

2. Project Directory

    LLM_api_server
    ├── models
    │   ├── BM1684X
    │   │   ├── chatglm3-6b_int4.bmodel              # BM1684X chatglm3-6b模型
    │   │   ├── qwen2-7b_int4_seq512_1dev.bmodel     # BM1684X qwen2-7b模型	
    ├── python
    │   ├── utils                         # 工具库
    │   ├── api_server.py                 # 服务启动程序
    │   └── config.yaml                   # 服务配置文件
    │   └── request.py                    # 请求示例程序
    │   └── requirements.txt              # python依赖
    └── scripts
        ├── download_model.sh       # 模型下载脚本
        ├── download_tokenizer.sh   # tokenizer下载脚本

2. Operation steps

1. Prepare data and models

1.1 Copy the official sophon_demo project directory of Suanneng (or copy and upload LLM_api_server to /data in the box)

    git clone https://github.com/sophgo/sophon-demo.git
    cd sophon-demo/appliction/LLM_api_server
    cd /data/LLM_api_server  ##If only the LLM_api_server has been uploaded, you only need to enter this directory.

1.2 Install unzip and other environments. If they are already installed, please skip this step. For non-Ubuntu systems, use yum or other methods to install them as appropriate.

    sudo apt-get update
    sudo apt-get install pybind11-dev
    pip3 install sentencepiece transformers==4.30.2
    pip3 install gradio==3.39.0 mdtex2html==1.2.0 dfss
    sudo apt install unzip
    chmod -R +x scripts/
    ./scripts/download_tokenizer.sh  ##Download the tokenizer
    ./scripts/download_model.sh  ##Download the model file

2. Python routine

2.1 Environmental Preparation

    pip3 install -r python/requirements.txt

    ##Since the sophon-sail version required for this routine is relatively new, a usable sophon-sail whl package is provided here. The SoC environment can download it through the following command
    python3 -m dfss --url=open@sophgo.com:sophon-demo/Qwen/sophon_arm-3.8.0-py3-none-any.whl
    python3 -m dfss --install sail  ##Install sophon_sail

2.2 Start the service

Parameter Description

api_server.py uses the config.yaml configuration file for parameter configuration.

The content of config.yaml is as follows

models:  # Model list
  - name: qwen   # Model name, qwen/chatglm3 are optional
    bmodel_path: ../models/BM1684X/qwen2-7b_int4_seq512_1dev.bmodel # Model path, modify it according to the actual situation
    token_path: ./utils/qwen/token_config  # Tokenizer path
    dev_id: 0  #  TPU ID


port: 18080 # Service port
How to use
    cd python  ##Switch the working directory
    python3 api_server.py --config ./config.yaml

3. Service Call

1. You can use the OpenAI library to call

    python3 request.py  ##If you want to use different Q&A for the model, you need to change the content of messages.["content"] in the request.py file.

2. Use http interface to call

The interface information is in request.py and can be modified according to it (such as IP, etc.)

Interface url: ip:port/v1/chat/completions, for example: 172.26.13.98:18080/v1/chat/completions

TOOL

Interface parameters (json format)

{
    "model": "qwen",
    "messages": [
        {"role": "user", "content": "你好"}
    ],
    "stream": true
}

You can use postman to test the interface

Download address: (https://www.postman.com/downloads/)

Usage Examples

!!! The ip address should be the IP of the box!!!

TOOL
Edit this page on GitHub
Last Updated:
Contributors: zwhuang
Prev
Sophgo SDK Development
Next
Deploy MiniCPM-V-2_6