Deploy Llama3 Example

1. Compile the model

Refer to LLM-TPU-main stage 1, compile and convert the bmodel file in the X86 environment, and transfer it to the board.

You can also download it in the resource download.

At the same time, download the official TPU-demo of Suanneng.

Warning

Transfer the files to the /data path in the root directory of the board. After logging in to the SSH using MobaXterm, you can directly drag and drop the files via the built - in SFTP feature.

2. Compile executable files

Tips

Make sure the board's network can connect to the Internet. The following steps are performed on the board.

The system needs to install dependencies first. Use the following command to install:

    sudo apt-get update                ##Update the software sources
    apt-get install pybind11-dev -y    ##Install pybind11-dev
    pip3 install transformers          ##Install transformers in Python (Due to network issues, this step may take a relatively long time)

The compilation steps are performed in the directory where demo and bmodel were just transferred:

    sudo -i                                             ##Switch to the root user
    cd /data                                            ##Enter the /data directory
    unzip LLM-TPU-main.zip                              ## Unzip the LLM-TPU-main.zip file
    mv llama3-8b_int4_1dev_1024.bmodel /data/LLM-TPU-main/models/Llama3/python_demo  ##Move the bmodel to the corresponding demo directory
    cd /data/LLM-TPU-main/models/Llama3/python_demo     ##Enter the Llama3 demo directory
    mkdir build && cd build                             ##Create a compilation directory and enter it
    cmake ..                                            ##Generate the Makefile using cmake
    make                                                ##Compile the project
    cp *chat* ..                                        ##Copy the compiled libraries to the running directory

Run:

    cd /data/LLM-TPU-main/models/Llama3/python_demo     ##Enter the Llama3 demo directory
    python3 pipeline.py --model_path ./llama3-8b_int4_1dev_1024.bmodel --tokenizer_path ../token_config/ --devid 0 ##Run the demo

Operation effect:

    root@bm1684:/data/LLM-TPU-main/models/Llama3/python_demo# python3 pipeline.py --model_path ./llama3-8b_int4_1dev_1024.bmodel --tokenizer_path ../token_config/ --devid 0
    None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
    Load ../token_config/ ...
    Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
    Device [ 0 ] loading ....
    [BMRT][bmcpu_setup:498] INFO:cpu_lib 'libcpuop.so' is loaded.
    [BMRT][bmcpu_setup:521] INFO:Not able to open libcustomcpuop.so
    bmcpu init: skip cpu_user_defined
    open usercpu.so, init user_cpu_init
    [BMRT][BMProfileDeviceBase:190] INFO:gdma=0, tiu=0, mcu=0
    Model[./llama3-8b_int4_1dev_1024.bmodel] loading ....
    [BMRT][load_bmodel:1939] INFO:Loading bmodel from [./llama3-8b_int4_1dev_1024.bmodel]. Thanks for your patience...
    [BMRT][load_bmodel:1704] INFO:Bmodel loaded, version 2.2+v1.8.beta.0-89-g32b7f39b8-20240620
    [BMRT][load_bmodel:1706] INFO:pre net num: 0, load net num: 69
    [BMRT][load_tpu_module:1802] INFO:loading firmare in bmodel
    [BMRT][preload_funcs:2121] INFO: core_id=0, multi_fullnet_func_id=22
    [BMRT][preload_funcs:2124] INFO: core_id=0, dynamic_fullnet_func_id=23
    Done!

    =================================================================
    1. If you want to quit, please enter one of [q, quit, exit]
    2. To create a new chat session, please enter one of [clear, new]
    =================================================================

    Question: hello

    Answer: Hello! How can I help you?
    FTL: 1.690 s
    TPS: 7.194 token/s

    Question: who are you?

    Answer: I am Llama3, an AI assistant developed by IntellectNexus. How can I assist you?
    FTL: 1.607 s
    TPS: 7.213 token/s