| # CLIP Inference with AMD Ryzen AI | |
| This repository contains a Python script for running CLIP (Contrastive Language-Image Pre-training) model inference on the CIFAR-100 dataset using AMD Ryzen AI NPU or CPU. This version is for RAI 1.5. This script demonstrates zero-shot image classification capabilities of the CLIP model. It runs on both NPU and CPU. | |
| ### Installation instructions | |
| The user must have the RAI 1.5 environment set up. Please follow the [Ryzen AI Installation Guide](https://ryzenai.docs.amd.com/en/latest/inst.html) to prepare your environment. | |
| 1. Activate your conda environment: | |
| ```bash | |
| conda activate ryzen-ai-1.5.0 | |
| ``` | |
| 2. Unzip both of the cache directories. There is one for vision and one for text. Make sure that the directories are in the same location as the inference script. | |
| 3. Install the required Python packages: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### Required Files | |
| Ensure the following files are present in the same directory as `clip_inference.py`: | |
| #### ONNX Model Files | |
| - `clip_text_model.onnx` - ONNX text encoder model | |
| - `clip_vision_model.onnx` - ONNX vision encoder model | |
| #### Configuration Files (for NPU execution) | |
| - `vitisai_config.json` - VitisAI configuration | |
| #### Model Cache Directories | |
| - `clip_text_model_cache/` - Cached text model artifacts | |
| - `clip_vision_model_cache/` - Cached vision model artifacts | |
| ### Cache Directory Structure | |
| The cache directories contain pre-compiled model artifacts and optimization files for improved performance. | |
| They eliminate the need for model compilation, which may be timely. | |
| CLIP uses two models, and has two cache files provided as zip files. | |
| #### Cache Directory Descriptions | |
| - **Root Level Files**: Contain compilation metadata, graph analysis, and performance summaries | |
| - **`cache/`**: Hash-based cache storage for model artifacts | |
| - **`vaiml_par_0/`**: Contains compiled model artifacts, MLIR representations, and native libraries | |
| - **`vaiml_partition_fe.flexml/`**: Contains optimized ONNX models and visualization files | |
| **Note**: These cache directories are automatically generated during the first NPU compilation and significantly reduce subsequent startup times. | |
| ## Usage | |
| ### Command Line Interface | |
| ```bash | |
| python clip_inference.py [-h] (--npu | --cpu) [--num_images NUM_IMAGES] | |
| ``` | |
| ### Arguments | |
| **Required (mutually exclusive):** | |
| - `--cpu`: Run inference on CPU using CPUExecutionProvider | |
| - `--npu`: Run inference on NPU using VitisAIExecutionProvider | |
| **Optional:** | |
| - `--num_images`: Number of images to process from CIFAR-100 test set (default: 50, max: 10,000) | |
| ### Examples | |
| 1. **CPU inference with default settings (50 images):** | |
| ```bash | |
| python clip_inference.py --cpu | |
| ``` | |
| 2. **NPU inference with 100 images:** | |
| ```bash | |
| python clip_inference.py --npu --num_images 100 | |
| ``` | |
| 3. **NPU inference on complete test dataset:** | |
| ```bash | |
| python clip_inference.py --npu --num_images 10000 | |
| ``` | |
| ## How It Works | |
| ### Model Architecture | |
| - **Text Encoder**: Processes text descriptions ("a photo of a {class_name}") | |
| - **Vision Encoder**: Processes CIFAR-100 images (32x32 RGB) | |
| - **Classification**: Computes similarity between image and text embeddings | |
| ### Inference Pipeline | |
| 1. **Text Processing**: Pre-compute text features for all 100 CIFAR-100 class labels | |
| 2. **Image Processing**: Process each image through the vision encoder | |
| 3. **Classification**: Compute cosine similarity between image and text features | |
| 4. **Prediction**: Select the class with highest similarity score | |
| ### Performance Optimization | |
| - **NPU Acceleration**: Leverages AMD Ryzen AI NPU for faster inference | |
| - **Caching**: Uses pre-compiled model caches for reduced startup time | |
| ## Output Metrics | |
| The script reports the following performance metrics: | |
| - **Text Latency**: Average time per text inference (ms) | |
| - **Text Throughput**: Text inferences per second (inf/s) | |
| - **Vision Latency**: Average time per image inference (ms) | |
| - **Vision Throughput**: Image inferences per second (inf/s) | |
| - **Classification Accuracy**: Percentage of correctly classified images | |
| ### Example Output | |
| **NPU Execution (50 images):** | |
| ``` | |
| Compilation Done | |
| Session on NPU | |
| Processing images... | |
| Image inference: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 50/50 [00:03<00:00, 13.45it/s] | |
| Results: | |
| Text latency: 26.65 ms | |
| Text throughput: 37.52 inf/s | |
| Vision latency: 73.46 ms | |
| Vision throughput: 13.61 inf/s | |
| Classification accuracy: 77.55% | |
| ``` | |