ONNX is an open supply machine studying (ML) framework that gives interoperability throughout numerous frameworks, working techniques, and {hardware} platforms. ONNX Runtime is an execution time engine for mannequin inference and coaching utilizing ONNX.
AWS Graviton3 processors are optimized for ML workloads and embrace help for bfloat16, Scalable Vector Extensions (SVE), and Matrix Multiplication (MMLA) directions. Bfloat16 accelerated SGEMM kernel and int8 MMLA accelerated quantized GEMM (QGEMM) kernel in ONNX enhance fp32 inference efficiency by as much as 65% and int8 quantized inference efficiency by 30% on a number of pure language processing (NLP) fashions on AWS Graviton3 . Beginning with model v1.17.0, the ONNX runtime helps these optimized cores.
On this article, we present easy methods to carry out ONNX execution time inference on AWS Graviton3-based EC2 situations and easy methods to configure them to make use of optimized GEMM cores. We additionally demonstrated the ensuing acceleration by means of benchmark assessments.
Optimized GEMM kernel
The ONNX runtime helps the Microsoft Linear Algebra Subroutine (MLAS) backend because the default execution supplier (EP) for deep studying operators. AWS Graviton3-based EC2 situations (c7g, m7g, r7g, c7gn, and Hpc7g situations) help the bfloat16 format and MMLA directions to speed up deep studying operators. These directions enhance SIMD {hardware} utilization and cut back end-to-end inference latency by 1.65x in comparison with cores based mostly on armv8 DOT product directions.
The AWS crew carried out the MLAS core for bfloat16 quick math and int8 quantized common matrix multiplication (GEMM) utilizing BFMMLA, SMMLA, and UMLLA directions, which give larger matrix multiplication throughput in comparison with DOT directions. bfloat16 help permits environment friendly deployment of fashions educated with bfloat16, fp32, and computerized combined precision (AMP) with out quantization. As proven within the determine beneath, the optimized GEMM core can be built-in into the ONNX Runtime CPU EP because the MLAS core.
The primary picture reveals the ONNX software program stack, highlighting (orange) parts optimized for inference efficiency enhancements on the AWS Graviton3 platform.
The next determine illustrates the ONNX execution time EP move, highlighting (proven in orange) parts optimized to enhance inference efficiency on the AWS Graviton3 platform.
Allow optimization
These optimizations are a part of the ONNX Runtime 1.17.0 launch, obtainable beginning with the onnxruntime-1.17.0 pythonwheel and conda-1.17.0 packages. The optimized int8 core is enabled by default and can be robotically chosen for the AWS Graviton3 processor. Then again, the Bfloat16 quick math cores will not be enabled by default and require the next session choices within the ONNX execution time to allow them:
# For C++ functions
# For Python functions
Benchmark outcomes
We first measure the inference throughput in queries per second of the fp32 mannequin with none optimizations (utilizing ONNX execution time 1.16.0), marked as 1.0 by the crimson dashed line within the determine beneath. We then in contrast the development of the bfloat16 quick math kernel in ONNX Runtime 1.17.1 for a similar fp32 mannequin inference. The normalized outcomes are plotted within the graph. It may be seen that for BERT, RoBERTa and GPT2 fashions, the throughput is elevated by as much as 65%. Comparable enhancements had been noticed for inference latency.
Much like the earlier fp32 inference comparability graph, we first measure the inference throughput (in queries per second) of the int8 quantized mannequin with none optimizations (utilizing ONNX runtime 1.16.0), which is marked with a crimson dashed line at 1.0 line within the picture beneath. We then in contrast the development of the optimized MMLA core in ONNX Runtime 1.17.1 for a similar mannequin inference. The normalized outcomes are plotted within the graph. It may be seen that for BERT, RoBERTa and GPT2 fashions, the throughput is improved by as much as 30%. Comparable enhancements had been noticed for inference latency.
baseline setting
We used an AWS Graviton3-based c7g.4xl EC2 executable and an Ubuntu 22.04-based AMI to exhibit the efficiency enhancements of the optimized GEMM core within the ONNX Runtime. Occasion and AMI particulars are talked about within the following code snippet:
The ONNX execution time repository supplies inference benchmark scripts for transformer-based language fashions. These scripts help a wide range of fashions, frameworks, and codecs. We selected PyTorch-based BERT, RoBERTa, and GPT fashions to cowl frequent language duties comparable to textual content classification, sentiment evaluation, and prediction of masked phrases. These fashions cowl encoder and decoder transformer architectures.
The next code lists the steps to carry out fp32 mannequin inference utilizing the ONNX execution time benchmark script utilizing bfloat16 quick math mode and int8 quantization mode. The script downloads the mannequin, exports it to ONNX format, quantizes it to int8 for int8 inference, and performs inference for various sequence lengths and batch sizes. When the script completes efficiently, it prints the inference throughput in queries per second (QPS) and latency in milliseconds together with the system configuration. For extra particulars, see ONNX Runtime Benchmark Script.
in conclusion
On this article, we mentioned easy methods to carry out ONNX execution time inference on an EC2 occasion based mostly on AWS Graviton3 and easy methods to configure the occasion to make use of an optimized GEMM core. We additionally exhibit the ensuing acceleration impact. We hope you may attempt it!
If you happen to discover a use case the place you do not observe related efficiency enhancements on AWS Graviton, please tell us by elevating a problem on the AWS Graviton Technical Information GitHub.
Concerning the creator
Sunita Nadpalli Is a Software program Growth Supervisor at AWS. She is answerable for optimizing Graviton software program efficiency for machine studying and HPC workloads. She is obsessed with open supply software program improvement and delivering high-performance and sustainable software program options utilizing Arm SoCs.