ONNX
🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
Introduction​
Cortex.onnx is a C++ inference library for Windows that relies on onnxruntime-genai, utilizing DirectML for hardware acceleration. DirectML is a high-performance DirectX 12 library for machine learning, providing GPU acceleration across various hardware and drivers, including AMD, Intel, NVIDIA, and Qualcomm GPUs. It integrates and sometimes upstreams onnxruntime-genai for inference tasks.
The current valid combinations for executor and precision are:
- FP32 CPU
- FP32 CUDA
- FP16 CUDA
- FP16 DML
- INT4 CPU
- INT4 CUDA
- INT4 DML
Usage​
cortex engines onnx init
The command will check, download, and install these dependencies for Windows:
- engine.dll- D3D12Core.dll- DirectML.dll- onnxruntime.rel.dll- onnxruntime-genai.dll- MSBuild libraries: - msvcp140.dll - vcruntime140.dll - vcruntime140_1.dll
To include onnx
in your own server implementation, follow the steps here.
Get ONNX Models​
You can download precompiled ONNX models from the Cortex Hub on Hugging Face. These models include configurations, tokenizers, and dependencies tailored for optimal performance with the onnx
engine.
Interface​
onnx
has the following Interfaces:
- HandleChatCompletion: Processes chat completion tasks.
void HandleChatCompletion(std::shared_ptr<Json::Value> json_body,std::function<void(Json::Value&&, Json::Value&&)>&& callback);
- LoadModel: Loads a model based on the specifications.
void LoadModel(std::shared_ptr<Json::Value> json_body,std::function<void(Json::Value&&, Json::Value&&)>&& callback);
- UnloadModel: Unloads a model as specified.
void UnloadModel(std::shared_ptr<Json::Value> json_body,std::function<void(Json::Value&&, Json::Value&&)>&& callback);
- GetModelStatus: Retrieves the status of a model.
void GetModelStatus(std::shared_ptr<Json::Value> json_body,std::function<void(Json::Value&&, Json::Value&&)>&& callback);
All the interfaces above contain the following parameters:
Parameter | Description |
---|---|
jsonBody | The requested content is in JSON format. |
callback | A function that handles the response. |
Architecture​
Main Components​
These are the main components that interact to provide an API for inference
tasks using the onnxruntime-genai
library:
- cortex-cpp: Responsible for handling API requests and responses.
- enginei: Engine interface for inference.
- onnx: It makes APIs accessible through an engine interface, allowing others to use its features easily.
- onnx_engine: Exposes APIs for inference. It loads and unloads models and simplifies API calls to
onnxruntime_genai
. - onnxruntime_genai: A submodule from the
onnxruntime_genai
repository that provides the core functionality for inferences.
Communication Protocols​
Load a Model​
The diagram above illustrates the interaction between three components: cortex-js
, cortex-cpp
, and onnx
when using the onnx
engine in Cortex:
-
HTTP Request from cortex-js to cortex-cpp:
cortex-js
sends an HTTP request tocortex-cpp
to load a model.
-
Engine Loading in cortex-cpp:
- Upon receiving the HTTP request,
cortex-cpp
initiates the loading of the engine.
- Upon receiving the HTTP request,
-
Model Loading from cortex-cpp to onnx:
cortex-cpp
then requestsonnx
to load the model.
-
Model Preparation in onnx:
onnx
performs the following tasks:- Create Tokenizer: Initializes a tokenizer for the model.
- Create ONNX Model: Sets up the ONNX model for inference.
- Cache Chat Template: Caches the chat template for future use.
-
Callback from onnx to cortex-cpp:
- Once the model is loaded and ready,
onnx
sends a callback tocortex-cpp
to indicate the completion of the model loading process.
- Once the model is loaded and ready,
-
HTTP Response from cortex-cpp to cortex-js:
cortex-cpp
sends an HTTP response back tocortex-js
, indicating that the model has been successfully loaded and is ready for use.
Stream Inference​
The diagram above illustrates the interaction between three components: cortex-js
, cortex-cpp
, and onnx
when using the onnx
engine to call the chat completions endpoint
with the stream inference option:
-
HTTP Request from cortex-js to
cortex-cpp
:cortex-js
sends an HTTP request tocortex-cpp
for chat completion.
-
Request Chat Completion from
cortex-cpp
toonnx
:cortex-cpp
forwards the request toonnx
to process the chat completion.
-
Chat Processing in
onnx
:onnx
performs the following tasks:- Apply Chat Template: Applies the chat template.
- Encode: Encodes the input data.
- Set Search Options: Configures search options for inference.
- Create Generator: Creates a generator for token generation.
-
Token Generation in
onnx
:onnx
executes the following steps in a loop to generate the response:- Compute Logits: Computes the logits.
- Generate Next Token: Generates the next token.
- Decode New Token: Decodes the newly generated token.
-
Callback from
onnx
tocortex-cpp
:- Once a token is generated,
onnx
sends a callback tocortex-cpp
.
- Once a token is generated,
-
HTTP Stream Response from
cortex-cpp
tocortex-js
:cortex-cpp
streams the response back tocortex-js
as the tokens are generated.
-
Wait for Done in
cortex-js
:cortex-js
waits until the entire response is received and the process is completed.
Non-stream Inference​
The diagram above illustrates the interaction between three components: cortex-js
, cortex-cpp
, and onnx
when using the onnx
engine to call the chat completions endpoint
with the non-stream inference option:
-
HTTP Request from
cortex-js
tocortex-cpp
:cortex-js
sends an HTTP request tocortex-cpp
for chat completion.
-
Request Chat Completion from
cortex-cpp
toonnx
:cortex-cpp
forwards the request toonnx
to process the chat completion.
-
Chat Processing in
onnx
:onnx
performs the following tasks:- Apply Chat Template: Applies the chat template.
- Encode: Encodes the input data.
- Set Search Options: Configures search options for inference.
- Create Generator: Creates a generator to process the request.
-
Output Generation in
onnx
:onnx
executes the following steps to generate the response:- Generate Output: Generates the output based on the processed data.
- Decode Output: Decodes the generated output.
-
Callback from
onnx
tocortex-cpp
:- Once the output is generated and ready,
onnx
sends a callback tocortex-cpp
to indicate the completion of the chat completion process.
- Once the output is generated and ready,
-
HTTP Response from
cortex-cpp
tocortex-js
:cortex-cpp
sends an HTTP response back tocortex-js
, providing the generated output.
Code Structure​
.├── base # Engine interface definition| └── cortex-common # Common interfaces used for all engines| └── enginei.h # Define abstract classes and interface methods for engines├── examples # Server example to integrate engine│ └── server.cc # Example server demonstrating engine integration├── onnxruntime-genai│ └── (files from upstream onnxruntime-genai)├── src # Source implementation for onnx engine│ ├── chat_completion_request.h # OpenAI compatible request handling│ ├── onnx_engine.h # Implementation onnx engine of model loading and inference | ├── onnx_engine.cc├── third-party # Dependencies of the onnx project └── (list of third-party dependencies)