OpenVINO™ Execution Provider

Accelerate ONNX models on Intel CPUs, GPUs with Intel OpenVINO™ Execution Provider. Please refer to this page for details on the Intel hardware supported.

Contents

Install

Pre-built packages and Docker images are published for OpenVINO™ Execution Provider for ONNX Runtime by Intel for each release.

Requirements

ONNX Runtime OpenVINO™ Execution Provider is compatible with three lastest releases of OpenVINO™.

ONNX Runtime OpenVINO™ Notes
1.17.1 2023.3 Details
1.17.1 2023.2 Details
1.16.0 2023.1 Details
1.15.0 2023.0 Details
1.14.0 2022.3 Details

Build

For build instructions, please see the BUILD page .

Usage

Set OpenVINO™ Environment for Python

Please download onnxruntime-openvino python packages from PyPi.org:

pip install onnxruntime-openvino
  • Windows

    To enable OpenVINO™ Execution Provider with ONNX Runtime on Windows it is must to set up the OpenVINO™ Environment Variables using the full installer package of OpenVINO™. Initialize the OpenVINO™ environment by running the setupvars script as shown below. This is a required step:

        C:\ <openvino_install_directory>\setupvars.bat
    
  • Linux

    OpenVINO™ Execution Provider with Onnx Runtime on Linux, installed from PyPi.org comes with prebuilt OpenVINO™ libs and supports flag CXX11_ABI=0. So there is no need to install OpenVINO™ separately.

    But if there is need to enable CX11_ABI=1 flag of OpenVINO, build Onnx Runtime python wheel packages from source. For build instructions, please see the BUILD page . OpenVINO™ Execution Provider wheels on Linux built from source will not have prebuilt OpenVINO™ libs so we must set the OpenVINO™ Environment Variable using the full installer package of OpenVINO™:

    ```
    $ source <openvino_install_directory>/setupvars.sh
    

Set OpenVINO™ Environment for C++

For Running C++/C# ORT Samples with the OpenVINO™ Execution Provider it is must to set up the OpenVINO™ Environment Variables using the full installer package of OpenVINO™. Initialize the OpenVINO™ environment by running the setupvars script as shown below. This is a required step:

  • For Windows run:
     C:\ <openvino_install_directory>\setupvars.bat
    
  • For Linux run:
     $ source <openvino_install_directory>/setupvars.sh
    

    Note: If you are using a dockerfile to use OpenVINO™ Execution Provider, sourcing OpenVINO™ won’t be possible within the dockerfile. You would have to explicitly set the LD_LIBRARY_PATH to point to OpenVINO™ libraries location. Refer our dockerfile .

Set OpenVINO™ Environment for C#

To use csharp api for openvino execution provider create a custom nuget package. Follow the instructions here to install prerequisites for nuget creation. Once prerequisites are installed follow the instructions to build openvino execution provider and add an extra flag --build_nuget to create nuget packages. Two nuget packages will be created Microsoft.ML.OnnxRuntime.Managed and Microsoft.ML.OnnxRuntime.Openvino.

Features

OpenCL queue throttling for GPU devices

Enables OpenCL queue throttling for GPU devices. Reduces CPU utilization when using GPUs with OpenVINO EP.

Model caching

OpenVINO™ supports model caching .

From OpenVINO™ 2023.1 version, model caching feature is supported on CPU, GPU along with kernel caching on iGPU, dGPU.

This feature enables users to save and load the blob file directly on to the hardware device target and perform inference with improved Inference Latency.

Kernel Caching on iGPU and dGPU:

This feature also allows user to save kernel caching as cl_cache files for models with dynamic input shapes. These cl_cache files can be loaded directly onto the iGPU/dGPU hardware device target and inferencing can be performed.

Enabling Model Caching via Runtime options using c++/python API’s.

This flow can be enabled by setting the runtime config option ‘cache_dir’ specifying the path to dump and load the blobs (CPU, iGPU, dGPU) or cl_cache(iGPU, dGPU) while using the c++/python API’S.

Refer to Configuration Options for more information about using these runtime options.

Support for INT8 Quantized models

Int8 models are supported on CPU and GPU.

Support for Weights saved in external files

OpenVINO™ Execution Provider now supports ONNX models that store weights in external files. It is especially useful for models larger than 2GB because of protobuf limitations.

See the OpenVINO™ ONNX Support documentation .

Converting and Saving an ONNX Model to External Data: Use the ONNX API’s. documentation .

Example:

import onnx
onnx_model = onnx.load("model.onnx") # Your model in memory as ModelProto
onnx.save_model(onnx_model, 'saved_model.onnx', save_as_external_data=True, all_tensors_to_one_file=True, location='data/weights_data', size_threshold=1024, convert_attribute=False)

Note:

  1. In the above script, model.onnx is loaded and then gets saved into a file called ‘saved_model.onnx’ which won’t have the weights but this new onnx model now will have the relative path to where the weights file is located. The weights file ‘weights_data’ will now contain the weights of the model and the weights from the original model gets saved at /data/weights_data.

  2. Now, you can use this ‘saved_model.onnx’ file to infer using your sample. But remember, the weights file location can’t be changed. The weights have to be present at /data/weights_data

  3. Install the latest ONNX Python package using pip to run these ONNX Python API’s successfully.

Support for IO Buffer Optimization

To enable IO Buffer Optimization we have to set OPENCL_LIBS, OPENCL_INCS environment variables before build. For IO Buffer Optimization, the model must be fully supported on OpenVINO™ and we must provide in the remote context cl_context void pointer as C++ Configuration Option. We can provide cl::Buffer address as Input using GPU Memory Allocator for input and output.

Example:

//Set up a remote context
cl::Context _context;
.....
// Set the context through openvino options
std::unordered_map<std::string, std::string> ov_options;
ov_options[context] = std::to_string((unsigned long long)(void *) _context.get());
.....
//Define the Memory area
Ort::MemoryInfo info_gpu("OpenVINO_GPU", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemTypeDefault);
//Create a shared buffer , fill in with data
cl::Buffer shared_buffer(_context, CL_MEM_READ_WRITE, imgSize, NULL, &err);
//Cast it to void*, and wrap it as device pointer for Ort::Value
void *shared_buffer_void = static_cast<void *>(&shared_buffer);
Ort::Value inputTensors = Ort::Value::CreateTensor(
        info_gpu, shared_buffer_void, imgSize, inputDims.data(),
        inputDims.size(), ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT);

Multi-threading for OpenVINO™ Execution Provider

OpenVINO™ Execution Provider for ONNX Runtime enables thread-safe deep learning inference

Multi streams for OpenVINO™ Execution Provider

OpenVINO™ Execution Provider for ONNX Runtime allows multiple stream execution for difference performance requirements part of API 2.0

Auto-Device Execution for OpenVINO EP

Use AUTO:<device 1><device 2>.. as the device name to delegate selection of an actual accelerator to OpenVINO™. Auto-device internally recognizes and selects devices from CPU, integrated GPU and discrete Intel GPUs (when available) depending on the device capabilities and the characteristic of CNN models, for example, precisions. Then Auto-device assigns inference requests to the selected device.

From the application point of view, this is just another device that handles all accelerators in full system.

For more information on Auto-Device plugin of OpenVINO™, please refer to the Intel OpenVINO™ Auto Device Plugin .

Heterogeneous Execution for OpenVINO™ Execution Provider

The heterogeneous execution enables computing for inference on one network on several devices. Purposes to execute networks in heterogeneous mode:

  • To utilize accelerator’s power and calculate the heaviest parts of the network on the accelerator and execute unsupported layers on fallback devices like the CPU to utilize all available hardware more efficiently during one inference.

For more information on Heterogeneous plugin of OpenVINO™, please refer to the Intel OpenVINO™ Heterogeneous Plugin .

Multi-Device Execution for OpenVINO EP

Multi-Device plugin automatically assigns inference requests to available computational devices to execute the requests in parallel. Potential gains are as follows:

  • Improved throughput that multiple devices can deliver (compared to single-device execution)
  • More consistent performance, since the devices can now share the inference burden (so that if one device is becoming too busy, another device can take more of the load)

For more information on Multi-Device plugin of OpenVINO™, please refer to the Intel OpenVINO™ Multi Device Plugin .

Configuration Options

OpenVINO™ Execution Provider can be configured with certain options at runtime that control the behavior of the EP. These options can be set as key-value pairs as below:-

Python API

Key-Value pairs for config options can be set using InferenceSession API as follow:-

session = onnxruntime.InferenceSession(<path_to_model_file>, providers=['OpenVINOExecutionProvider'], provider_options=[{Key1 : Value1, Key2 : Value2, ...}])

Note that the releases from (ORT 1.10) will require explicitly setting the providers parameter if you want to use execution providers other than the default CPU provider (as opposed to the current behavior of providers getting set/registered by default based on the build flags) when instantiating InferenceSession.

C/C++ API 2.0

The session configuration options are passed to SessionOptionsAppendExecutionProvider API as shown in an example below for GPU device type:

std::unordered_map<std::string, std::string> options;
options[device_type] = "GPU_FP32";
options[device_id] = "";
options[num_of_threads] = "8";
options[num_streams] = "8";
options[cache_dir] = "";
options[context] = "0x123456ff";
options[enable_opencl_throttling] = "false";
session_options.AppendExecutionProvider("OpenVINO", options);

C/C++ Legacy API

The session configuration options are passed to SessionOptionsAppendExecutionProvider_OpenVINO() API as shown in an example below for GPU device type:

OrtOpenVINOProviderOptions options;
options.device_type = "GPU_FP32";
options.device_id = "";
options.num_of_threads = 8;
options.cache_dir = "";
options.context = 0x123456ff;
options.enable_opencl_throttling = false;