Qwen-2-5-VL Image and Video Recognition DEMO

1. Introduction

Qwen-2-5-VL is a multimodal large language model (MLLM) launched by the Alibaba Tongyi Qianwen team. It is part of the Qwen-2 series of models and supports multimodal interaction between vision and language.

1. Features

Dynamic resolution processing: Adaptively adjust image resolution to balance accuracy and computational efficiency.
Cross-modal alignment: Aligning visual-linguistic feature spaces through contrastive learning to improve modal interaction capabilities.
Low resource adaptation: supports lightweight deployment solutions such as quantization and LoRA fine-tuning.

2. Project Directory

    Qwen2-5-VL
    ├── datasets
    │   ├──images                        # By default, there is an image named panda.jpg
    │   └──videos                        # By default, there is a video named carvana_video.mp4
    ├── models
    │   └── BM1684X
    │       └── qwen2.5-vl-3b_bm1684x_w4bf16_seq2048.bmodel    # BM1684X qwen2.5-vl-3b model
    ├── python
    │   ├── __pycache__
    │   ├── configs                       # Configuration files
    │   ├── qwen2_5_vl.py                 # Startup program
    │   ├── README.md                     # Instruction document
    │   ├── vision_process.py             # Visual data preprocessing file
    │   └── requirements.txt              # Python dependencies
    ├── scripts
    │   ├── compile.sh
    │   ├── datasets.zip
    │   ├── download_bm1684x_bmodel.sh       # Download script for the BM1684X box model
    │   ├── download_bm1688_bmodel.sh        # Download script for the BM1688 box model
    │   └── download_datasets.sh             # Dataset download script
    └── tools                              # Toolkit

2. Operation steps

1. Prepare Python environment, data and model

1.1 First upgrade the python version to 3.10

    sudo add-apt-repository ppa:deadsnakes/ppa
    sudo apt update
    sudo apt install python3.10 python3.10-dev

    # Create a virtual environment (without pip packages). Each time you run it in the future, you need to switch to the virtual environment according to the steps.
    cd /data
    # Create a virtual environment (without pip)
    python3.10 -m venv --without-pip myenv

    # Enter the virtual environment
    source myenv/bin/activate

    # Manually install pip
    curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
    python get-pip.py
    rm get-pip.py

    # Install dependent libraries
    pip3 install torchvision pillow qwen_vl_utils transformers --upgrade

1.2 Copy the official Qwen2-5-VL project directory of Suanneng (or copy and upload Qwen2_5-VL to /data in the box)

    git clone https://github.com/sophgo/sophon-demo.git
    cd sophon-demo/sample/Qwen2_5-VL
    cd /data/Qwen2_5-VL  ##If only the LLM_api_server has been uploaded, you only need to operate within this directory.

1.3 Preparing the operating environment

No need to modify the memory on PCIe, the following is related to the soc mode: For 1684X series devices (such as SE7/SM7), the environment preparation can be completed in this way to meet the Qwen2.5-VL operating conditions. First, make sure to use V24.04.01 SDK. You can check the SDK version through the bm_version command. If you need to upgrade, you can get the v24.04.01 version SDK from sophgo.com. The flash package is located in sophon-img-xxx/sdcard.tgz. Refer to the corresponding product manual for flashing.

After ensuring the SDK version, in the 1684x SoC environment, refer to the following command to modify the device memory

    cd /data/
    mkdir memedit && cd memedit
    wget -nd https://sophon-file.sophon.cn/sophon-prod-s3/drive/23/09/11/13/DeviceMemoryModificationKit.tgz
    tar xvf DeviceMemoryModificationKit.tgz
    cd DeviceMemoryModificationKit
    tar xvf memory_edit_{vx.x}.tar.xz #vx.x is the version number
    cd memory_edit
    ./memory_edit.sh -p #This command will print the current memory layout information.
    ./memory_edit.sh -c -npu 7615 -vpu 2048 -vpp 2048 #If it is on the 1688 platform, please modify it to: ./memory_edit.sh -c -npu 10240 -vpu 0 -vpp 3072
    sudo cp /data/memedit/DeviceMemoryModificationKit/memory_edit/emmcboot.itb /boot/emmcboot.itb && sync
    sudo reboot

1.4 Install unzip and prepare test data set

    sudo apt install unzip
    chmod -R +x scripts/
    ./scripts/download_bm1684x_bmodel.sh  ##Download the model file
    ./scripts/download_datasets.sh  ##Download the dataset

2. Python routine

2.1 Environmental Preparation

    # Additionally, you may need to install other libraries
    cd /data/Qwen2_5-VL/python
    pip3 install dfss -i https://pypi.tuna.tsinghua.edu.cn/simple --upgrade
    pip3 install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

    # You need to install SILK2.Tools.logger
    python3 -m dfss --url=open@sophgo.com:tools/silk2/silk2.tools.logger-1.0.2-py3-none-any.whl
    pip3 install silk2.tools.logger-1.0.2-py3-none-any.whl --force-reinstall
    rm -f silk2.tools.logger-1.0.2-py3-none-any.whl

    # This routine depends on sophon-sail. You can directly install sophon-sail by executing the following commands:
    pip3 install dfss --upgrade
    python3 -m dfss --install sail

    # You need to download the operation configuration file. Execute the following commands:
    python3 -m dfss --url=open@sophgo.com:sophon-demo/Qwen2_5_VL/configs.zip
    unzip configs.zip
    rm configs.zip

    # Video recognition
    python3 qwen2_5_vl.py --vision_inputs="[{\"type\":\"video_url\",\"video_url\":{\"url\": \"../datasets/videos/carvana_video.mp4\"},\"resized_height\":420,\"resized_width\":630,\"nframes\":2}]"

    # Image recognition
    python3 qwen2_5_vl.py --vision_inputs="[{\"type\":\"image_url\",\"image_url\":{\"url\": \"../datasets/images/panda.jpg\"}, \"max_side\":420}]"

    # at the same time
    python3 qwen2_5_vl.py --vision_inputs="[{\"type\":\"video_url\",\"video_url\":{\"url\": \"../datasets/videos/carvana_video.mp4\"},\"resized_height\":420,\"resized_width\":630,\"nframes\":2},{\"type\":\"image_url\",\"image_url\":{\"url\": \"../datasets/images/panda.jpg\"}, \"max_side\":840}]"

    # Pure text conversation
    python3 qwen2_5_vl.py --vision_inputs=""

Effects

You can add pictures or videos to the folder for testing.