Accelerate NLP inference using ONNX Runtime on AWS Graviton processors

ONNX is an open supply machine studying (ML) framework that gives interoperability throughout numerous frameworks, working techniques, and {hardware} platforms. ONNX Runtime is an execution time engine for mannequin inference and coaching utilizing ONNX.

AWS Graviton3 processors are optimized for ML workloads and embrace help for bfloat16, Scalable Vector Extensions (SVE), and Matrix Multiplication (MMLA) directions. Bfloat16 accelerated SGEMM kernel and int8 MMLA accelerated quantized GEMM (QGEMM) kernel in ONNX enhance fp32 inference efficiency by as much as 65% and int8 quantized inference efficiency by 30% on a number of pure language processing (NLP) fashions on AWS Graviton3 . Beginning with model v1.17.0, the ONNX runtime helps these optimized cores.

On this article, we present easy methods to carry out ONNX execution time inference on AWS Graviton3-based EC2 situations and easy methods to configure them to make use of optimized GEMM cores. We additionally demonstrated the ensuing acceleration by means of benchmark assessments.

Optimized GEMM kernel

The ONNX runtime helps the Microsoft Linear Algebra Subroutine (MLAS) backend because the default execution supplier (EP) for deep studying operators. AWS Graviton3-based EC2 situations (c7g, m7g, r7g, c7gn, and Hpc7g situations) help the bfloat16 format and MMLA directions to speed up deep studying operators. These directions enhance SIMD {hardware} utilization and cut back end-to-end inference latency by 1.65x in comparison with cores based mostly on armv8 DOT product directions.

The AWS crew carried out the MLAS core for bfloat16 quick math and int8 quantized common matrix multiplication (GEMM) utilizing BFMMLA, SMMLA, and UMLLA directions, which give larger matrix multiplication throughput in comparison with DOT directions. bfloat16 help permits environment friendly deployment of fashions educated with bfloat16, fp32, and computerized combined precision (AMP) with out quantization. As proven within the determine beneath, the optimized GEMM core can be built-in into the ONNX Runtime CPU EP because the MLAS core.

The primary picture reveals the ONNX software program stack, highlighting (orange) parts optimized for inference efficiency enhancements on the AWS Graviton3 platform.

The next determine illustrates the ONNX execution time EP move, highlighting (proven in orange) parts optimized to enhance inference efficiency on the AWS Graviton3 platform.

Allow optimization

These optimizations are a part of the ONNX Runtime 1.17.0 launch, obtainable beginning with the onnxruntime-1.17.0 pythonwheel and conda-1.17.0 packages. The optimized int8 core is enabled by default and can be robotically chosen for the AWS Graviton3 processor. Then again, the Bfloat16 quick math cores will not be enabled by default and require the next session choices within the ONNX execution time to allow them:

# For C++ functions

SessionOptions so; 
so.config_options.AddConfigEntry( kOrtSessionOptionsMlasGemmFastMathArm64Bfloat16, "1");

# For Python functions

sess_options = onnxruntime.SessionOptions()
sess_options.add_session_config_entry("mlas.enable_gemm_fastmath_arm64_bfloat16", "1")

Benchmark outcomes

We first measure the inference throughput in queries per second of the fp32 mannequin with none optimizations (utilizing ONNX execution time 1.16.0), marked as 1.0 by the crimson dashed line within the determine beneath. We then in contrast the development of the bfloat16 quick math kernel in ONNX Runtime 1.17.1 for a similar fp32 mannequin inference. The normalized outcomes are plotted within the graph. It may be seen that for BERT, RoBERTa and GPT2 fashions, the throughput is elevated by as much as 65%. Comparable enhancements had been noticed for inference latency.

Much like the earlier fp32 inference comparability graph, we first measure the inference throughput (in queries per second) of the int8 quantized mannequin with none optimizations (utilizing ONNX runtime 1.16.0), which is marked with a crimson dashed line at 1.0 line within the picture beneath. We then in contrast the development of the optimized MMLA core in ONNX Runtime 1.17.1 for a similar mannequin inference. The normalized outcomes are plotted within the graph. It may be seen that for BERT, RoBERTa and GPT2 fashions, the throughput is improved by as much as 30%. Comparable enhancements had been noticed for inference latency.

baseline setting

We used an AWS Graviton3-based c7g.4xl EC2 executable and an Ubuntu 22.04-based AMI to exhibit the efficiency enhancements of the optimized GEMM core within the ONNX Runtime. Occasion and AMI particulars are talked about within the following code snippet:

Occasion: c7g.4xl occasion
Area: us-west-2
AMI: ami-0a24e6e101933d294 (Ubuntu 22.04/Jammy with 6.5.0-1014-aws kernel)

The ONNX execution time repository supplies inference benchmark scripts for transformer-based language fashions. These scripts help a wide range of fashions, frameworks, and codecs. We selected PyTorch-based BERT, RoBERTa, and GPT fashions to cowl frequent language duties comparable to textual content classification, sentiment evaluation, and prediction of masked phrases. These fashions cowl encoder and decoder transformer architectures.

The next code lists the steps to carry out fp32 mannequin inference utilizing the ONNX execution time benchmark script utilizing bfloat16 quick math mode and int8 quantization mode. The script downloads the mannequin, exports it to ONNX format, quantizes it to int8 for int8 inference, and performs inference for various sequence lengths and batch sizes. When the script completes efficiently, it prints the inference throughput in queries per second (QPS) and latency in milliseconds together with the system configuration. For extra particulars, see ONNX Runtime Benchmark Script.

# Set up Python
sudo apt-get replace
sudo apt-get set up -y python3 python3-pip

# Improve pip3 to the newest model
python3 -m pip set up --upgrade pip

# Set up onnx and onnx runtime
# NOTE: We used 1.17.1 as a substitute of 1.17.0 because it was the newest
# model obtainable whereas amassing knowledge for this submit
python3 -m pip set up onnx==1.15.0 onnxruntime==1.17.1

# Set up the dependencies
python3 -m pip set up transformers==4.38.1 torch==2.2.1 psutil==5.9.8

# Clone onnxruntime repo to get the benchmarking scripts
git clone --recursive https://github.com/microsoft/onnxruntime.git
cd onnxruntime
git checkout 430a086f22684ad0020819dc3e7712f36fe9f016
cd onnxruntime/python/instruments/transformers

# To run bert-large fp32 inference with bfloat16 quick math mode
python3 benchmark.py -m bert-large-uncased -p fp32 --enable_arm64_bfloat16_fastmath_mlas_gemm

# To run bert-base  fp32 inference with bfloat16 quick math mode
python3 benchmark.py -m bert-base-cased -p fp32 --enable_arm64_bfloat16_fastmath_mlas_gemm

# To run roberta-base  fp32 inference with bfloat16 quick math mode
python3 benchmark.py -m roberta-base -p fp32 --enable_arm64_bfloat16_fastmath_mlas_gemm

# To run gpt2  fp32 inference with bfloat16 quick math mode
python3 benchmark.py -m gpt2 -p fp32 --enable_arm64_bfloat16_fastmath_mlas_gemm

# To run bert-large int8 quantized inference
python3 benchmark.py -m bert-large-uncased -p int8

# To run bert-base int8 quantized inference
python3 benchmark.py -m bert-base-cased -p int8

# To run roberta-base int8 quantized inference
python3 benchmark.py -m roberta-base -p int8

# To run gpt2 int8 quantized inference
python3 benchmark.py -m gpt2 -p int8

in conclusion

On this article, we mentioned easy methods to carry out ONNX execution time inference on an EC2 occasion based mostly on AWS Graviton3 and easy methods to configure the occasion to make use of an optimized GEMM core. We additionally exhibit the ensuing acceleration impact. We hope you may attempt it!

If you happen to discover a use case the place you do not observe related efficiency enhancements on AWS Graviton, please tell us by elevating a problem on the AWS Graviton Technical Information GitHub.

Concerning the creator

Sunita Nadpalli Is a Software program Growth Supervisor at AWS. She is answerable for optimizing Graviton software program efficiency for machine studying and HPC workloads. She is obsessed with open supply software program improvement and delivering high-performance and sustainable software program options utilizing Arm SoCs.

Source link

What's Hot

New Doctor Who spin-off series coming to Disney+

Warner Bros. Discovery sues NBA in attempt to block Amazon’s new streaming plan

Apple adopts Biden administration’s AI safeguards

Revolutionize your growth with data-driven ABM

blue screen freeze

How to use data analytics to improve customer experience

Digital Asset Management (DAM): Benefits, Features, Use Cases

Sales Channel Analysis-Ciente

New Doctor Who spin-off series coming to Disney+

Apple adopts Biden administration’s AI safeguards

Sonos admits its latest app update was a huge mistake

Kevin Feige says Marvel’s new Blade movie must be R-rated

Amazon is discontinuing my favorite Echo, the Echo Dot with clock

Mistral Large 2 now available on Amazon Bedrock

Amazon SageMaker launches Cohere Command R fine-tuning model

Secure AccountantAI Chatbot: Lili’s Amazon Bedrock Journey

Visual haystack benchmark! – Berkeley Artificial Intelligence Research Blog

Use the Amazon Bedrock knowledge base to perform metadata filtering on table data

Warner Bros. Discovery sues NBA in attempt to block Amazon’s new streaming plan

Emma Corrin talks fighting Deadpool and Wolverine

Groundbreaking quantum microscope reveals slow-motion movement of electrons

Meta AI will be available on Quest headsets in the United States in August

Warner Bros. Acquired MultiVersus, the developer behind the Brawl game

NFT sales grew 8.5% to $107 million

KnownOrigin gradually shuts down on-chain market: A sign of growing instability in the NFT space? | NFT Culture | NFT News | Web3 Culture

What is the ERC-404 Token Standard on Ethereum (2024)

Reddit Phases Out Polygon NFT’s Animated Collection Expressions

Trump confirms fourth NFT series: ‘Incredible spirit’

Accelerate NLP inference using ONNX Runtime on AWS Graviton processors

Mistral Large 2 now available on Amazon Bedrock

Amazon SageMaker launches Cohere Command R fine-tuning model

Secure AccountantAI Chatbot: Lili’s Amazon Bedrock Journey

Visual haystack benchmark! – Berkeley Artificial Intelligence Research Blog

Leave A Reply Cancel Reply

Subscribe to Updates

What's Hot

Accelerate NLP inference using ONNX Runtime on AWS Graviton processors

Optimized GEMM kernel

Allow optimization

Benchmark outcomes

baseline setting

in conclusion

Concerning the creator

Related Posts

Leave A Reply Cancel Reply