AIML 003: Achieving efficient, productive and good quality End-to-End AI Pipelines | Slide 11 | Vrushabh Sanghavi | Performance boost from optimized Intel software vs stock software on the same Xeon hardware. Overall, a 15X speed up is observed from the software acceleration | Performance acceleration observed with Modin (upto 27X), hyperparameter optimization with SigOpt (3.6x) for PLAsTiCC application.It is an open data challenge on Kaggle to classify objects in the sky that vary in brightness. It uses simulated astronomical time-series data resembling observations from the Large Synoptic Survey Telescope being setup in Northern Chile. The challenge is to determine the probability that each object belongs to one of 14 classes of astronomical filters. | |
AIML 003: Achieving efficient, productive and good quality End-to-End AI Pipelines | Slide 13 | Vrushabh Sanghavi | Performance boost from a coherant optimization strategy that uses optimized Intel software, tuned parameters, INT8 quantization and multi-instance data parallel execution. Overall, a 3.4X speedup is achieved using these optimizations | Observed for the Document Analysis inference application measured on the IMDb dataset | |
AIML 003: Achieving efficient, productive and good quality End-to-End AI Pipelines | Slide 14 | Vrushabh Sanghavi | Over 3.7x performance gain with AMX on Intel Sapphire Rapids | Observed for the Document Analysis Fine-Tuning application measured on the IMDb dataset | |
AIML004: Accelerate Transformer Training and Inference with Hugging Face, Habana Gaudi Accelerators, and Intel® Xeon® Platform | Slide 4 | Julien Simon, chief evangelist, Hugging Face | 2022: Transformers are Eating Deep Learning | "Transformers are emerging as a general-purpose architecture for ML" https://www.stateof.ai/ RNN and CNN usage down, Transformers usage up https://www.kaggle.com/ kaggle-survey-2021 | |
AIML004: Accelerate Transformer Training and Inference with Hugging Face, Habana Gaudi Accelerators, and Intel® Xeon® Platform | Slide 5 | Julien Simon, chief evangelist, Hugging Face | Hugging Face: One of the Fastest Growing Open Source Projects | star-history.com | Jul-05 |
AIML004: Accelerate Transformer Training and Inference with Hugging Face, Habana Gaudi Accelerators, and Intel® Xeon® Platform | Slide 15 | Julien Simon, chief evangelist, Hugging Face | Up to 40% better price performance than latest GPU-based instances | See Amazon press announcement for more information: https://press.aboutamazon.com/news-releases/news-release-details/aws-announces-general-availability-amazon-ec2-dl1-instances EC2 instance pricing published here: https://aws.amazon.com/ec2/pricing/on-demand/ EC2 instance pricing published here: https://aws.amazon.com/ec2/pricing/on-demand/ | |
AIML004: Accelerate Transformer Training and Inference with Hugging Face, Habana Gaudi Accelerators, and Intel® Xeon® Platform | Slide 16 | Julien Simon, chief evangelist, Hugging Face | Gaudi2 outperformed Nvidia A100 on MLPerf benchmark for ResNet and BERT | https://mlcommons.org/en/training-normal-20/ | May-22 |
AIML004: Accelerate Transformer Training and Inference with Hugging Face, Habana Gaudi Accelerators, and Intel® Xeon® Platform | Slide 20 | Julien Simon, chief evangelist, Hugging Face | Near-Linear Scaling from 1 to 8 HPUs | dl1.24xlarge, Amazon EC2; 8x Habana Gaudi1, $13.11/hour (us-east-1, on-demand); Ubuntu 20.08, Habana Deep Learning Base AMI, Habana PyTorch container 1.11.0:1.5.0 - 610, measured by Hugging Face | Sep-22 |
AIML004: Accelerate Transformer Training and Inference with Hugging Face, Habana Gaudi Accelerators, and Intel® Xeon® Platform | Slide 25 | Julien Simon, chief evangelist, Hugging Face | Up to 3x Speedup with Negligible Accuracy Drop | Configuration: Test by Intel as of 07/30/2021. 2-node, 2x Intel® Xeon® Platinum 8380 Processor, 40 cores, HT On, Turbo ON, Total Memory 256 GB (16 slots/ 16GB/ 3200 MHz), BIOS: SE5C6200.86B.0022.D64.2105220049(0xd0002b1), Ubuntu 20.04.1 LTS, gcc 9.3.0 compiler, Transformer-Based Models, Deep Learning Framework: PyTorch 1.12, https://download.pytorch.org/whl/cpu/torch-1.12.0+cpu-cp38-cp38-linux_x86_64.whl, BS=1, Public Data, 10 instances/1 sockets, Datatype: FP32/INT8 | Jul-21 |
AIML005: BigDL 2.0: Seamless Scaling of AI Pipelines from Laptops to Distributed Cluster | Slide 12 | Guoqiong Song, Jiao Wang | Up to ~5.8x Training Speedup and ~9.6x Inference Speedup using BigDL-Nano | CVPR 2022 Open Access Repository (thecvf.com) | |
AIML005: BigDL 2.0: Seamless Scaling of AI Pipelines from Laptops to Distributed Cluster | Slide 23 | Guoqiong Song, Jiao Wang | 3x reduction in inference time; 30-50% increase in training throughtput; Training Speedup and ~9.6x Inference Speedup using BigDL-Nano | https://networkbuilders.intel.com/solutionslibrary/sk-telecom-intel-build-ai-pipeline-to-improve-network-quality |
AIML006: Accelerate End-to-End AI and Data Science Pipelines with Intel® Optimized Libraries for Python | Slide 7 | Rachel Oberman | Starting on the left in the data preprocessing section, data scientists can expect to see anywhere 1-100x faster Pandas workloads by using Intel Distribution of Modin, with 38x performance increase seen on a workload using 2020 US Census data. With Scikit-Learn, users see up to 100x speed-up in model training and inference by using Intel Extension for Scikit-Learn, and by using this simple 2-line code change extension, workloads can be anywhere from up to 10x faster when compared to Nvidia GPU, and up to 5x faster than AMD CPUs. With PyTorch, users can see up to 1.4x faster DLRM training, and up to 2.8x faster DLRM inference by using Intel optimizations and more efficient instruction sets. With TensorFlow, users can see up to 2.8x faster quantized inference with Intel optimizations and more efficient instruction sets. | https://www.intel.com/content/www/us/en/developer/articles/technical/blazing-fast-python-data-science-ai-performance.html#gs.c7d2kv | Oct-23 2020 (Scikit learn claim) Feb-3-2020(PyTorch Claim) Oct-16-2020(Modin Claim) Oct-26-2020(Tensorflow claim) |
AIML006: Accelerate End-to-End AI and Data Science Pipelines with Intel® Optimized Libraries for Python | Slide 17 | Rachel Oberman | We observe a considerable speedup for Modin vs stock Pandas for various operations : CSV reading-9x Query1- 1.8x Query2-17x Query3- 7x Query4- 6.5x | https://www.codeproject.com/Articles/5330204/Scale-Your-Pandas-Workflow-with-Modin | 10/5/2021 |
AIML006: Accelerate End-to-End AI and Data Science Pipelines with Intel® Optimized Libraries for Python | Slide 20 | Rachel Oberman | With the Intel ExScikit-Learn, users can see up to 300x speed-up in model training for some algorithms and up 4000x speedup in inference for some algortihms by using Intel Extension for Scikit-Learn | https://medium.com/intel-analytics-software/save-time-and-money-with-intel-extension-for-scikit-learn-33627425ae4 | June-8-2021 |
AIML006: Accelerate End-to-End AI and Data Science Pipelines with Intel® Optimized Libraries for Python | Slide 23 | Rachel Oberman | We see a speedup across the board for higgs1m, letters, airline, mortgage and MSRank datasets from the initial XGBoost version to subsequent Xgboost versions when we upstreamed the Intel optimizations into XGBoost. | https://www.intel.com/content/www/us/en/developer/articles/technical/improve-performance-xgboost-lightgbm-inference.html#gs.bnk85q | Nov-10-2020 |
AIML011 : Ease of use in leveraging default Intel optimizations for TensorFlow | slide 5 | Sachin Muradi, Om Thakkar, Tsai Louie | AI performance on cloud instances with tensorflow-oneDNN | Configuration : For Table 1 (Batch mode): Hardware configuration : AWS instance type : C6i.12xlarge 48 vCPUs, 1 socket, 2 threads per core, Intel® Xeon® Platinum 8375C CPU @ 2.90GHz, memory : 96 GB, OS: Ubuntu 20.04.2, kernel: 5.11.0-1019-aws x86_64 Testing done by Intel on May 19, 2021 Baseline performance : Tensorflow version 2.8 Improved performance : Tensorflow version 2.9 For Table 2 (Online mode): Hardware configuration : AWS instance type : C6i.2xlarge 8 vCPUs, 1 socket, 2 threads per core, Intel® Xeon® Platinum 8375C CPU @ 2.90GHz, memory : 16 GB, OS: Ubuntu 20.04.2, kernel: 5.11.0-1019-aws x86_64 Testing done by Intel on May 19, 2021 Baseline performance : Tensorflow version 2.8 Improved performance : Tensorflow version 2.9 | May-19-2021 |
AIML011 : Ease of use in leveraging default Intel optimizations for TensorFlow | slide 12 | Sachin Muradi, Om Thakkar, Tsai Louie | Saphire Rapids speed-up | presented in Intel innovation 2021(https://www.youtube.com/watch?v=38wrDHEQZuM , https://edc.intel.com/content/www/us/en/products/performance/benchmarks/innovation-event-claims/) and also was presented in oneAPI Developer summit 2022 (https://www.oneapi.io/event-sessions/tbd-session-1/) | |
AIML011 : Ease of use in leveraging default Intel optimizations for TensorFlow | slide 13 | Sachin Muradi, Om Thakkar, Tsai Louie | Windows Alderlake tensorflow with oneDNN enabled | https://medium.com/intel-analytics-software/accelerate-ai-model-performance-on-the-alder-lake-platform-a5c24ae3f522 | May-13-2022 |
CLI001 Intelligent Collaboration on the Web | Slide 8,13 | Rijubrata Bhaumik | Relative Power Comsumption. | https://apps.powerapps.com/play/95b66612-ae95-4105-9972-13577ac9aa05?tenantId=46c98d88-e344-4ed4-8496-4ed7712e255d&source=portal&screenColor=rgba(0, 176, 240, 1)&initscreen=viewitem&item=1854 |
CLI002 Progressive Web Applications, Optimized for Intel XPU Architecture | Slide 11 | Moh Haghighat | 100+mW power savings achieved for video call on Windows w/ 12th Gen Intel Core | BASELINE: Tested by Intel as of 01/12/2022. Intel ADL-P i7-1255 2/8/2 15W TDP. DDR5 16GB. OS: Win1121H2(22000.376). Chromium version: M105 self-built on 01/12/2022 with patcheshttps://chromium-review.googlesource.com/c/chromium/src/ /3754088 and https://chromium-review.googlesource.com/c/chromium/src/ /3329026 applied . Tool used: Intel SoCWatch. Application tested: Microsoft Teams on Chromium browser at 720p30fps, two participants at full screen mode. Camera model: Logitech C920. Hardware acceleration for H.264 encoding/decoding enabled. Chromium command line specified: "--disable-features=WebRtcThreadsUseResourceEfficientType" OPTIMIZED: baseline configuration with Chromium comandline specified: "--enable-features=WebRtcThreadsUseResourceEfficientType" | |
CLI002 Progressive Web Applications, Optimized for Intel XPU Architecture | Slide 12 | Moh Haghighat | 10% of SoC power saving for video playback on Windows | BASELINE: Tested by Intel as of 07/20/2022. Intel ADL-P i7-1255 2/8/2 15W TDP. DDR5 16GB. OS: Win1121H2(22000.376). Chromium version: M105 self-built on 07/20/2022 with patch https://chromium-review.googlesource.com/c/chromium/src/ /3737284 applied . Tool used: Intel SoCWatch. Test clip: Tears of Steel - AVC 1080p 24FPS 10Mbps . Chromium command line specified: "--disable-features=UseBatchDecoderBufferForMediaEngine". OPTIMIZED: baseline configuration with Chromium comandline specified: "--enale-features=UseBatchDecoderBufferForMediaEngine,MediaFoundationClearPlayback" | |
CLI003 How Intel's Performance Hybrid Architecture Is Defining the Future of Multitasking for Developers, Gamers and Creators (RPL Launch Session) | Slide 4 | Mark Subotnick | Delivering up to 15% ST and 41% MT Performance 13th Gen Intel® Core™ Desktop Processor | By SPECrate2017_int_base 1 copy and n copy estimates based on measurements on Intel internal reference platforms, comparing i9-13900K to i9-12900K. See the 13th Gen Intel Core Desktop Processor Appendix for additional details. Results may vary. Performance hybrid architecture: Not available on certain 13th Gen Intel Core processors. |
CLI003 How Intel's Performance Hybrid Architecture Is Defining the Future of Multitasking for Developers, Gamers and Creators (RPL Launch Session) | Slide 7 | Mark Subotnick | Leadership in Gaming Performance - Gen on Gen | For all workload and configuration see www.intel.com/PerformanceIndex. Results may vary. Go to Processors, Intel® Core™, and Desktop. | |
CLI003 How Intel's Performance Hybrid Architecture Is Defining the Future of Multitasking for Developers, Gamers and Creators (RPL Launch Session) | Slide 10 | Mark Subotnick | Thermal Velcoty | ⨥On 13th Gen Core i9 125W and 65W SKUs Max Turbo Frequency refers to the maximum single-core processor frequency that can be achieved with Intel® Turbo Boost Technology. See www.intel.com/technology/turboboost/ for more information. Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. |
CLI003 How Intel's Performance Hybrid Architecture Is Defining the Future of Multitasking for Developers, Gamers and Creators (RPL Launch Session) | Slide 12 | Mark Subotnick | 13 th Gen Desktop Overclocking | Altering clock frequency or voltage may damage or reduce the useful life of the processor and other system components, and may reduce system stability and performance. Product warranties may not apply if the processor is operated beyond its specifications. Check with the manufacturers of system and components for additional details. Overclocking results will vary based on system configuration, board power deliver, cooling capability, component or module capabilities, risk tolerance, tuning configuration, unit to unit variations, and other factors. |
CLI003 How Intel's Performance Hybrid Architecture Is Defining the Future of Multitasking for Developers, Gamers and Creators (RPL Launch Session) | Slide 14 | Mark Subotnick | 13th Gen Intel Core Desktop Processors | For all workload and configuration see www.intel.com/PerformanceIndex. Results may vary. Go to Processors, Intel® Core™, and Desktop. | |
CLI003 How Intel's Performance Hybrid Architecture Is Defining the Future of Multitasking for Developers, Gamers and Creators (RPL Launch Session) | Slide 17 | Mark Subotnick | Leadership Gaming Frame Consistency | As measured by benchmark mode score and/or fps measurements of 13th Gen Intel Core i9-13900K with internal reference board and DDR5 5600 MT/s DRAM; and AMD Ryzen 9 5950X with Asus ROG Crosshair Hero 8 board and DDR4 3200MT/s DRAM. The Configurations for all systems include Windows 11 Pro, 1920x1080 Resolution - High Quality Graphics Preset with EVGA RTX 3090 GPU. |
CLI003 How Intel's Performance Hybrid Architecture Is Defining the Future of Multitasking for Developers, Gamers and Creators (RPL Launch Session) | Slide 20 | Mark Subotnick | Leap in Performance for Content Creation | For all workload and configuration see www.intel.com/PerformanceIndex. Results may vary. Go to Processors, Intel® Core™, and Desktop. | |
CLI007 Learn How Intel's Latest Processors for Workstations Increase Productivity of Visualization Workflows of Dassault Systèmes Visualization Solutions | Slide 34 | Partner speaker (Dassault) | Scalability results of Intel Xeon 8168 | Performance results are based on testing by Intel® as of September, 12, 2022 and may not reflect all publicly available updates.Date testing performed: September 12, 2022 System configuration: 8168 CPU 2x Intel® Xeon® Platinum 8168+ Memory 12x 16GB DDR4-2666 GPU NVIDIA GeForce GTX 1050 Ti Storage 480GB Intel® SSD DC S3500 Series OS Windows 10 21H2 build 19044.1889. Relevant testing/workload setup details: 2x8168 Application settings python.exe stellarscript\examples\benchmark.py benchmark_WFCPU.yml output.png OS settings ·Power & Sleep - Screen - When plugged in, turn off after: Never ·Power & Sleep - Sleep - When plugged in, turn off after: Never Link HERE for testing details. (Legal - not sure how you want to incorporate this into legal doc). |
CLI007 Learn How Intel's Latest Processors for Workstations Increase Productivity of Visualization Workflows of Dassault Systèmes Visualization Solutions | Slide 35 | Partner speaker (Dassault) | Scalability results of Intel Xeon 8168 and Gen13 (13900K) | Performance results are based on testing by Intel® as of September, 12, 2022 and may not reflect all publicly available updates. Date testing performed: September 12, 2022 System configuration: 8168 CPU 2x Intel® Xeon® Platinum 8168+ Memory 12x 16GB DDR4-2666 GPU NVIDIA GeForce GTX 1050 Ti Storage 480GB Intel® SSD DC S3500 Series OS Windows 10 21H2 build 19044.1889. Relevant testing/workload setup details: 2x8168 Application settings python.exe stellarscript\examples\benchmark.py benchmark_WFCPU.yml output.png OS settings ·Power & Sleep - Screen - When plugged in, turn off after: Never ·Power & Sleep - Sleep - When plugged in, turn off after: Never Link HERE for testing details. (Legal - not sure how you want to incorporate this into legal doc). | 44904 |
CLI007 Learn How Intel's Latest Processors for Workstations Increase Productivity of Visualization Workflows of Dassault Systèmes Visualization Solutions | Slide 36 | Partner speaker (Dassault) | Preformance advantages of Gen13 over Gen12 and Gen11 based on proprietary Dassault's benchmarks | Performance results are based on testing by Intel® as of September, 12, 2022 and may not reflect all publicly available updates. Date testing performed: September 09, 2022 System configuration: i9-11900K i9-12900K i9-13900K CPU Intel® Core™ i9-11900K Intel® Core™ i9-12900K Intel® Core™ i9-13900K Memory 4x 8GB DDR5-3200 2x 16GB DDR5-4800 2x 16GB DDR5-4800 GPU NVIDIA GeForce RTX 3070 NVIDIA GeForce GTX 1650 NVIDIA GeForce RTX 3080 Storage 1TB Samsung SSD 980 PRO 1TB Samsung SSD 980 PRO 500GB WD_BLACK SN850 NVMe SSD OS Windows 11 22H2 build 22622.436 Windows 11 21H2 build 22000.527 Windows 11 21H2 build 22000.978 Relevant testing/workload setup details: i9-11900K i9-12900K i9-13900K Application settings python.exe stellarscript\examples\benchmark.py benchmark_WFCPU.yml output.png OS settings ·Power & Sleep - Screen - When plugged in, turn off after: Never ·Power & Sleep - Sleep - When plugged in, turn off after: Never |
CLI009 Learn How To Run A Model on Intel® Movidius™ VPU Using OpenVINO™ | Slide 24 | Gideon Damaryam | We do not make any claims per se, but our lab will get participants to themselves generate performance numbers for fp32 and int8 and compare on the same device. Or material will also show sample results | int8 version of the same fp32 model vastly increases performance | |
CLI011 | | Rajshree Chabukswar | The fastest Performance-cores, with an industry-leading 5.8 GHz | Based on its Max Turbo Frequency of 5.8GHz, which is the highest for any Desktop processor. Additional details at intel.com/PerformanceIndex. | |
CLI011 | | Rajshree Chabukswar | Delivering up to 15% single threaded and 41% multi-threaded performance improvement | By SPECrate2017_int_base 1 copy and n copy estimates based on measurements on Intel internal reference platforms, comparing i9-13900K to i9-12900K. See the 13th Gen Intel Core Desktop Processor Appendix for additional details. Results may vary. Performance hybrid architecture: Not available on certain 13th Gen Intel Core processors. | |
CLI011 | | Rajshree Chabukswar | In fact, we conducted an internal analysis across the top 150 developer workloads and use cases and found that the added 8 E cores can provide up to a 40% performance increase gen-on-gen (depending on app scalability). | Based on internal data and analysis. | |
CLI011 | | Rajshree Chabukswar | This is an example of one of CyberLink's key use cases, Sky Replacement effect in PowerDirector. We worked with them to analyze application hotspots using Intel VTune, carefully identifying optimization areas and then using Intel Compiler with the right compiler flags. All this work provided anywhere from 10% to 5X improvement, depending on the scene and complexity of the Sky mask. | Based on internal data and analysis. | |
CLI011 | | Rajshree Chabukswar | In this case shown here, for a MT application, we have a potential to increase the multi-threaded gains from 60% to almost 80%. | Based on internal data and analysis. | |
DCC001 What's New with Intel Infrastructure Processing Units | Slides 18/19/20 | Nick Tausanovitch | CMCC/Intel - Up to x5.5 improvement on bandwidth & package forwarding rate Up to x5.5 IOPS increase in storage performance Baidu/Intel - Supports up to 1024 device hot-plug/unplug | PCRs 08-24-2022-08 and 09-10-2022-01 | Aug-22 |
DCC003 Optimized Microservices workloads with 4th Gen Intel® Xeon processor, Intel® IPU, Intel® Ethernet and Computational Storage | Slide #24 | Brad Burres | Graph showing performance benefit obtained by moving the load balancer from 4th Gen Xeon to the Intel IPU E2000 (Mount Evans) | In backup slide as well as available on the showcase floor | #### |
DCC003 - Optimized Microservices workloads with 4th Gen Intel® Xeon Scalable Processors, Intel® Infrastructure Processing Units (Intel® IPUs), Intel® Ethernet, and Computational Storage | Slide 8 | Suzi, Brad, Mrittika, Anil, Michael | Up to 85% fewer cores vs AMD EPYC for ~65K CPS SLA | 1-node, 2x pre-production 4th Gen Intel® Xeon® Scalable Processor (60 cores) with integrated Intel® Quick Assist Accelerator (Intel® QAT), on pre-production Intel® platform and software with DDR5 memory total 1024GB (16x64 GB), microcode 0xf000380, HT On, Turbo Off, SNC Off, Ubuntu 22.04.1 LTS, 5.15.0-47-generic, 1x 1.92TB Intel SSDSC2KG01, 1x Intel® Ethernet Network Adapter E810-2CQDA2, 2x100GbE, QAT engine v0.6.14, QAT v20.l.0.9.1, NGINX 1.20.1, OpenSSL 1.1.1l, IPP crypto v2021_5, IPSec v1.1 , TLS 1.3 AES_128_GCM_SHA256, ECDHE-X25519-RSA2K, tested by Intel September 2022. 1-node, 2x AMD EPYC 7763 processor (64 core) on GIGABYTE R282-Z92 with 1024GB DDR4 memory (16x64 GB), microcode 0xa001144, SMT On, Boost Off, NPS=1, Ubuntu 22.04.1 LTS, 5.15.0-47-generic, 1x 1.92TB Intel SSDSC2KG01, 1x Intel® Ethernet Network Adapter E810-2CQDA2, 2x100GbE, NGINX 1.20.1, OpenSSL 1.1.1l, AES_128_GCM_SHA256, ECDHE-X25519-RSA2K, tested by Intel September 2022. | 9/19/2022 |
DCC008: Develop Like a Rockstar With 4th Gen Intel Xeon Scalable Processors | Slide 11 | Ronak Singhal | For NGINX webserver with RSA2K public key crypto, 4th Gen Intel Xeon SP with built-in acceleration with Intel QAT delivers high client density while saving 6 CPU cores | Intel QAT for NGINX Webserver: Estimated performance comparing 4th Gen Intel Xeon Scalable processor configuration with Intel® QAT enabled, versus same processor without QAT offload, with software optimizations. Configuration: 1-node, 2-sockets (1 socket tested, SPR-E3) 4th Gen Intel Xeon Scalable processor, 52C, 300W TDP with 1, 2 and 4 Intel QAT active devices (with 8 cores/16 threads). Pre-production platform with 512GB (16x32GB 4800MT/s [4800MT/s]) total memory, HT on, Turbo off, internal pre-production BIOS 0x890000a0, Ubuntu 22.04 LTS, 5.15.0-27-generic; Workload: Async NGINX 0.4.7; NGINX TLS 1.3 Webserver with ECDHE-X25519-RSA2K algorithm. Software Configuration: GCC 11.2.0, libraries: OpenSSL 1.1.1o, QAT engine v0.6.12, Intel IPsec MB v1.2, IPP Crypto ippcp_2021.5. Test by Intel as of 6/30/2022. | 9/28/2021 |
DCC008: Develop Like a Rockstar With 4th Gen Intel Xeon Scalable Processors | Slide 14 | Ronak Singhal | Intel AMX delivers 4.8x additional speedup in addition to OneDNN and quantization optimizations, for a total of 30x total improvement compared to TensorFlow baseline on 4th Gen Intel Xeon SP | Intel AMX for SSD-ResNet34: Baseline Configuration: 1-node, 2x 3rd Gen Intel Xeon Platinum 8380 with 512 GB (16 slots/ 32GB/ 3200) total DDR4 memory, microcode 0x8d9522d4, HT on, Turbo on, Ubuntu 20.04.2 LTS(docker), 5.4.0-77-generic, TensorFlow v2.5.0 w/o oneDNN, TensorFlow v2.6.0 w oneDNN, test by Intel on 09/28/2021. New Configuration: 1-node, 2x Next Gen Intel Xeon Scalable processor (codenamed Sapphire Rapids, > 40 cores) on Intel pre-production platform with 512 GB DDR memory (8(1DPC)/64GB/4800 MT/s), HT on, Turbo on, CentOS Linux 8.4, internal pre-production bios and software running SSD-ResNet34 BS=1 using TensorFlow 2.6 with intel internal optimization, test by Intel on 09/28/2021. | 4/25/2022 |
DCC008: Develop Like a Rockstar With 4th Gen Intel Xeon Scalable Processors | Slide 17 | Ronak Singhal | 4th Gen Intel Xeon SP with Intel IAA delivers 2X the operations/sec throughput on data compression for RocksDB (read only) versus Zstd software on CPU cores only | Intel IAA for RocksDB: Estimated performance comparing 4th Gen Intel® Xeon® Scalable processor configuration with Intel® IAA enabled, versus same processor running software on CPU cores without IAA offload. Configuration: 1-node, 2 sockets (1 socket tested) 4th Gen Intel Xeon Scalable processor (56-cores, 4x IAA devices) pre-production platform with 512GB (16x32GB <OUT OF SPEC> 4800MT/s [4800MT/s]) total memory, HT on, Turbo on, internal pre-production BIOS 0x8e000260, CentOS Linux 8.4.2105, 5.15.0-spr.bkc.pc.3.21.0.x86_64, internal RocksDB v7.1 with pluggable compression support (db_bench, read only). Software Configuration GCC 8.5.0, ZSTD v1.4.4, p99 latency used. Results depend on block size and database entry size. Read-Only results (Relative ops/s vs data size - 16kB block, 32B value) and Read-Write results (Relative ops/s vs data size - 16kB block, 32B value). Tradeoff up to 16% compressed data size. Test by Intel as of 4/25/2022. | 2/28/2022 |
DCC008: Develop Like a Rockstar With 4th Gen Intel Xeon Scalable Processors | Slide 18 | Ronak Singhal | For the ClickHouse database, 4th Gen Intel Xeon SP with Intel IAA delivers up to 40% higher query thorughput with better compression and saves up to 20% memory bandwidth versus LZ4 software running on CPU cores only | Intel IAA for ClickHouse: Baseline Configuration: Test by Intel as of <02/28/2022>. 1-node, 2x Intel® Xeon® <SKU, processor>, 48 cores, HT On, Turbo On, Total Memory 1024 GB (16 slots/ 64 GB/ 4800 MHz [run @ 4800 MHz] ), <Bios: EGSDCRB1.86B.0072.D01.2201101353>, <ucode version: 0x8e0001a0>, <OS Version: CentOS Stream 8>, <kernel version: 5.12.0-09_15_spr.22.x86_64+server>, <compiler version: CLANG 14>, <Clickhouse 21.12/SSB>, Normalized score=1, Compression Ratio: 1.84 New Configuration: Test by Intel as of <02/28/2022>. 1-node, 2x Intel® Xeon® <SKU, processor>, 48 cores, HT On, Turbo On, Total Memory 1024 GB (16 slots/ 64 GB/ 4800 MHz [run @ 4800 MHz] ), <Bios: EGSDCRB1.86B.0072.D01.2201101353>, <ucode version: 0x8e0001a0>, <OS Version: CentOS Stream 8>, <kernel version: 5.12.0-09_15_spr.22.x86_64+server>, <compiler version: CLANG 14>, <Clickhouse 21.12/SSB>, score= 1.06<QPS>, Compression Ratio: 2.62 | |
DCC013 - Intel AI Optimizations through the lens of sustainability | Slide 20,21 | Alex Sin | Claim 1: Optimizations delivered through Intel Extension for Pytorch, Intel OpenMP, Pytorch JIT tracing, Channel last memory layout and TCMalloc provides upto 3.25 times better inference latency than the stock Pytorch Baseline Claim 2: With the above optimizations, the inference application instance consumes upto 3.06 times lesser energy per image compared against the stock pytorch baseline | BASELINE: Tested by Intel as of 09/15/2022. AWS c6i.metal instance, 2 socket Intel® Xeon® Platinum 8375C CPU @ 2.90GHz, 32 cores per socket, HT On, Turbo On, OS Ubuntu 22.04 LTS, Kernel 5.15.0-1017-aws, Microcode 0xd000363 Total Memory 256GB, Framework: Pytorch 1.12, Torchvision Compiler: GCC 11.2.0, Dummy Test Data : 10k Images 3x224x224, Model Topology : ResNet50_v1.5, Model Weights: Pretrained Imagenet, Workload type : Model inference, Inference Config: 4 instance, 16 cpu cores / instance, Benchmark metric : latency in ms/image, Optimizations applied : None. Energy measurement : Metric - joules/image, kWh/image, Tool used - Linux Turbostat, Emissions Factor (Avoided Energy US average from AVERT EPA) - 0.709 kg-CO2/kWh OPTIMIZED: Tested by Intel as of 09/15/2022. AWS c6i.metal instance, 2 socket Intel® Xeon® Platinum 8375C CPU @ 2.90GHz, 32 cores per socket, HT On, Turbo On, OS Ubuntu 22.04 LTS, Kernel 5.15.0-1017-aws, Microcode 0xd000363 Total Memory 256GB, Framework: Pytorch 1.12, Torchvision, Intel Extension For Pytorch 1.12 Compiler: GCC 11.2.0, Dummy Test Data : 10k Images 3x224x224, Model Topology : ResNet50_v1.5, Model Weights: Pretrained Imagenet, Workload type : Model inference, Inference Config: 4 instance, 16 cpu cores / instance, Benchmark metric : latency in ms/image, Optimizations applied : KMP_AFFINITY=granularity=fine,compact,1,0 OMP_NUM_THREADS=16 LD_PRELOAD=Intel_OpenMP:TCMalloc, Intel Extension for Pytorch FP32 optimizations, Pytorch Torchscript JIT Trace and Freeze. Energy measurement : Metric - joules/image, kWh/image, Tool used - Linux Turbostat, Emissions Factor (Avoided Energy US average from AVERT EPA) - 0.709 kg-CO2/kWh | |
4th Gen Intel® Xeon® Scalable Processor - Competitive Workload Testing | | | 4th Gen Intel® Xeon® Scalable Processors with Intel® IAA are delivering up to 1.9x higher performance with up to 47% reduced latency on RocksDB compared to AMD EPYC 7763. | 1-node, 2x pre-production 4th Gen Intel Xeon Scalable Processor (60 cores) with integrated Intel In-Memory Analytics Accelerator (Intel IAA), on pre-production Intel platform and software, HT On, Turbo On, Total Memory 1024GB (16x64GB DDR5 4800), microcode 0xf000380, 1x 1.92TB INTEL SSDSC2KG01, Ubuntu 22.04.1 LTS, 5.18.12-051812-generic, QPL v0.1.21,accel-config-v3.4.6.4, ZSTD v1.5.2, RocksDB v6.4.6 (db_bench), tested by Intel September 2022. 1-node, 2x AMD* EPYC 7763 64 core Processor on GIGABYTE R282-Z92 platform, SMT On, Boost On, NPS=1, Total Memory 1024GB (16x64GB DDR4 3200), microcode 0xa001144, 1x 1.92TB INTEL SSDSC2KG01, Ubuntu 22.04.1 LTS, 5.18.12-051812-generic, ZSTD v1.5.2, RocksDB v6.4.6 (db_bench), tested by Intel September 2022. Measurement: RocksDB db_bench 80/20 RW Corpus Data Set | Test by Intel in September 2022 |
4th Gen Intel® Xeon® Scalable Processor - Competitive Workload Testing | | | 4th Gen Intel® Xeon® Scalable Processors with Intel® IAA are delivering up to 59% more queries per second, and up to 29% reduced compression rate on ClickHouse DB vs AMD EPYC 7763. | 1-node, 2x pre-production 4th Gen Intel Xeon Scalable processor (60 cores) with integrated Intel In-Memory Analytics Accelerator (Intel IAA), on pre-production Intel platform and software, HT On, Turbo On, SNC off, Total Memory 1024GB (16x64GB DDR5 4800), microcode 0xf000380, 1x 1.92TB INTEL SSDSC2KG01, Ubuntu 22.04.1 LTS, 5.18.12-051812-generic, QPL v0.1.21, accel-config-v3.4.6.4, gcc 11.2, Clickhouse 21.12, Star Schema Benchmark, tested by Intel September 2022. 1-node, 2x AMD EPYC 7763 64 core Processor on GIGABYTE R282-Z92 platform, SMT On, Boost On, NPS=1, Total Memory 1024GB (16x64GB DDR4 3200), microcode 0xa001144, 1x 1.92TB INTEL SSDSC2KG01, Ubuntu 22.04.1 LTS, 5.18.12-051812-generic, gcc 11.2, Clickhouse 21.12, Star Schema Benchmark, tested by Intel September 2022. Measurement: Clickhouse Queries per second (Q4.1 query). Compress Rate (data compressed/uncompressed)*100 | Test by Intel in September 2022 |
4th Gen Intel® Xeon® Scalable Processor - Competitive Workload Testing | | | 4th Gen Intel® Xeon® Scalable Processors with Intel® QAT can match AMD EPYC 7763 connections per second (CPS) on OpenSSL Crypto with 83% fewer cores utilized for a 65k CPS fixed performance target. | QAT Configuration HW/SW: 1-node, 2x pre-production 4th Gen Intel Xeon Scalable Processor (60 cores) with integrated Intel QuickAssist Accelerator (Intel QAT), on pre-production Intel platform and software with DDR5 memory total 1024GB (16x64 GB), microcode 0xf000380, HT On, Turbo Off, SNC Off, Ubuntu 22.04.1 LTS, 5.15.0-47-generic, 1x 1.92TB Intel SSDSC2KG01, 1x Intel® Ethernet Network Adapter E810-2CQDA2, 2x100GbE, QAT engine v0.6.14, QAT v20.l.0.9.1, NGINX 1.20.1, OpenSSL 1.1.1l, IPP crypto v2021_5, IPSec v1.1 , TLS 1.3 AES_128_GCM_SHA256, ECDHE-X25519-RSA2K, tested by Intel September 2022. OOB Configurations: 1-node, 2x pre-production 4th Gen Intel Xeon Scalable Processor (60 cores), on pre-production Intel platform and software with DDR5 memory total 1024GB (16x64 GB), microcode 0xf000380, HT On, Turbo Off, SNC Off, Ubuntu 22.04.1 LTS, 5.15.0-47-generic, 1x 1.92TB Intel SSDSC2KG01, 1x Intel® Ethernet Network Adapter E810-2CQDA2, 2x100GbE, NGINX 1.20.1, OpenSSL 1.1.1l, TLS 1.3 AES_128_GCM_SHA256, ECDHE-X25519-RSA2K, tested by Intel September 2022. 1-node, 2x AMD EPYC 7763 processor (64 core) on GIGABYTE R282-Z92 with 1024GB DDR4 memory (16x64 GB), microcode 0xa001144, SMT On, Boost Off, NPS=1, Ubuntu 22.04.1 LTS, 5.15.0-47-generic, 1x 1.92TB Intel SSDSC2KG01, 1x Intel® Ethernet Network Adapter E810-2CQDA2, 2x100GbE, NGINX 1.20.1, OpenSSL 1.1.1l, AES_128_GCM_SHA256, ECDHE-X25519-RSA2K, tested by Intel September 2022. Measurement: OpenSSL crypto NGINX TLS 1.3 AES_128_GCM_SHA256, ECDHE-X25519-RSA2K; 65 CPS SLA Handshake only | Test by Intel in September 2022 |
4th Gen Intel® Xeon® Scalable Processor - Competitive Workload Testing | | | 4th Gen Intel® Xeon® Scalable Processors with Intel® QAT outperforms AMD EPYC 7763 for Gbps on QATzip while utilizing 96% fewer cores. | 1-node, 2x pre-production 4th Gen Intel® Xeon Scalable Processor (60 core) with integrated Intel QuickAssist Accelerator (Intel QAT), on pre-production Intel platform and software with DDR5 memory Total 1024GB (16x64 GB), microcode 0xf000380, HT On, Turbo Off, SNC Off, Ubuntu 22.04.1 LTS, 5.15.0-47-generic, 1x 1.92TB Intel® SSDSC2KG01, QAT v20.l.0.9.1 , QATzip v1.0.9 , ISA-L v2.3.0, tested by Intel September 2022. 1-node, 2x AMD EPYC 7763 processor (64 core) on GIGABYTE R282-Z92 with 1024GB DDR4 memory (16x64 GB), microcode 0xa001144, SMT On, Boost Off, NPS=1, Ubuntu 22.04.1 LTS, 5.15.0-47-generic, 1x 1.92TB Intel® SSDSC2KG01, QAT v1.7.l.4.16, QATzip v1.0.9 , ISA-L v2.3.0, tested by Intel September 2022. Measurement: QAT zip Level 1 Compression | Test by Intel in September 2022 |
4th Gen Intel® Xeon® Scalable Processor - Competitive Workload Testing | | | 4th Gen Intel® Xeon® Scalable Processors with Intel® QAT can outperform AMD EPYC 7763 on Gbps for IPSec while utilizing 66% fewer cores with a 200 Gbps fixed performance target. | 1-node, 2x pre-production 4th Gen Intel Xeon Scalable processor (60 core) with integrated Intel QuickAssist Accelerator (Intel QAT), on pre-production Intel ® platform and software with 1024GB DDR5 memory (16x64 GB), microcode 0xf000380, HT On, Turbo Off, SNC Off, Ubuntu 22.04.1 LTS, 5.15.0-47-generic , 1x 1.92TB Intel SSDSC2KG01, 1x Intel Ethernet Network Adapter E810-2CQDA2, 2x100GbE, QAT v20.l.0.9.1, DPDK v21.11, IPsec v1.1, VPP 22.02, nasm v2.14.02, AES 128 GCM, VAES instructions, tested by Intel September 2022. 1-node, 2x AMD EPYC 7763 processor (64 core) on Supermicro AS-2124US-TNRP with 1024GB DDR4 memory (16x64 GB), microcode 0xa01173, SMT On, Boost Off, NPS=2, Ubuntu 22.04.1 LTS, 5.15.0-47-generic, 1x 1.92TB Intel SSDSC2KG01, 1x Intel Ethernet Network Adapter E810-2CQDA2, 2x100GbE, DPDK v21.11, IPsec v1.1, VPP 22.02, nasm v2.14.02, AES 128 GCM, tested by Intel September 2022. Measurement: IPSec encrypt VPP IPSec AES_128_GCM; 200 Gbps SLA Target | Test by Intel in September 2022 |
4th Gen Intel® Xeon® Scalable Processor - Competitive Workload Testing | | | 4th Gen Intel® Xeon® Scalable Processors with Intel® DSA deliver up to 2.5x higher performance with up to 60% reduced latency for SPKD NVMe TCP simulating large file read request compared to AMD EPYC 7763. | 1-node, 2x pre-production 4th Gen Intel Xeon Scalable processor (60 core) with integrated Intel Data Streaming Accelerator (Intel DSA), on pre-production Intel platform and software with 1024GB DDR5 memory (16x64 GB), microcode 0xf000380, HT On, Turbo On, SNC Off, Ubuntu 22.04.1 LTS, 5.15.0-47-generic, 1x 1.92TB Intel® SSDSC2KG01, 4x 1.92TB Samsung PM1733, 1x Intel® Ethernet Network Adapter E810-2CQDA2, 2x100GbE, FIO v3.30, SPDK 22.05, tested by Intel September 2022. 1-node, 2x AMD EPYC 7763 processor (64 core) on Supermicro AS-2124US-TNRP with 1024GB DDR4 memory (16x64 GB), microcode 0xa01173, SMT On, Boost On, NPS=2, Ubuntu 22.04.1 LTS, 5.15.0-47-generic, 1x 1.92TB Intel® SSDSC2KG01, 4x 1.92TB Samsung PM1733, 1x Intel® Ethernet Network Adapter E810-2CQDA2, 2x100GbE, FIO v3.30, SPDK 22.05, tested by Intel September 2022. Measurement: SPDK NVMe TCP; 128k QD64 Seq Read | Test by Intel in September 2022 |
4th Gen Intel® Xeon® Scalable Processor - Competitive Workload Testing | | | 4th Gen Intel® Xeon® Scalable Processors with Intel® DSA deliver up to 1.9x higher performance with up to 49% reduced latency for SPKD NVMe TCP simulating database random read request compared to AMD EPYC 7763. | 1-node, 2x pre-production 4th Gen Intel Xeon Scalable processor (60 core) with integrated Intel Data Streaming Accelerator (Intel DSA), on pre-production Intel platform and software with 1024GB DDR5 memory (16x64 GB), microcode 0xf000380, HT On, Turbo On, SNC Off, Ubuntu 22.04.1 LTS, 5.15.0-47-generic, 1x 1.92TB Intel® SSDSC2KG01, 4x 1.92TB Samsung PM1733, 1x Intel® Ethernet Network Adapter E810-2CQDA2, 2x100GbE, FIO v3.30, SPDK 22.05, tested by Intel September 2022. 1-node, 2x AMD EPYC 7763 processor (64 core) on Supermicro AS-2124US-TNRP with 1024GB DDR4 memory (16x64 GB), microcode 0xa01173, SMT On, Boost On, NPS=2, Ubuntu 22.04.1 LTS, 5.15.0-47-generic, 1x 1.92TB Intel® SSDSC2KG01, 4x 1.92TB Samsung PM1733, 1x Intel® Ethernet Network Adapter E810-2CQDA2, 2x100GbE, FIO v3.30, SPDK 22.05, tested by Intel September 2022. Measurement: SPDK NVMe TCP; 16k QD256 Random Read | Test by Intel in September 2022 |
4th Gen Intel® Xeon® Scalable Processor - Competitive Workload Testing | | | 4th Gen Intel® Xeon® Scalable Processors with Intel® AMX deliver up to 4.7x higher performance with up to 79% reduced latency for ResNet50 v1.5 TensorFlow Real Time Image Classification compared to AMD EPYC 7763. | 1-node, 2x pre-production 4th Gen Intel® Xeon® Scalable processor (60 core) with Intel® Advanced Matrix Extensions (Intel AMX), on pre-production Intel® platform and software with 1024GB DDR5 memory (16x64 GB), microcode 0xf000380, HT On, Turbo On, SNC Off, CentOS Stream 8, 5.19.8-1.el8.elrepo.x86_64, 1x 1.92T Intel® SSDSC2KG01, TF 2.9.1, AI Model=Resnet 50 v1_5, best scores achieved using BS1=1 core/instance , BS16=5 cores/instance, using physical cores, tested by Intel September 2022. 1-node, 2x AMD EPYC 7763 processor (64 core) on GIGABYTE R282-Z92 with 1024GB DDR4 memory (16x64 GB), microcode 0xa001144, SMT On, Boost On, NPS=1, Ubuntu 20.04.5 LTS, 5.4.0-125-generic, 1x 1.92TB INTEL_SSDSC2KG01, TF 2.9, ZenDNN=v3.3 (Ubuntu 20.04 required for ZenDNN v3.3) , AI Model=Resnet 50 v1_5, best scores achieved using BS1=2 cores/instance , BS16=8 cores/instance for INT8, BS16=4 cores/instance for fp32, using cores and threads, tested by Intel September 2022. Measurement: ResNet50 v 1.5; TensorFlow real time image classification BS=1 | Test by Intel in September 2022 |
4th Gen Intel® Xeon® Scalable Processor - Competitive Workload Testing | | | 4th Gen Intel® Xeon® Scalable Processors with Intel® AMX deliver up to 6.2x higher performance for ResNet50 v1.5 TensorFlow Batch Image Classification compared to AMD EPYC 7763. | 1-node, 2x pre-production 4th Gen Intel® Xeon® Scalable processor (60 core) with Intel® Advanced Matrix Extensions (Intel AMX), on pre-production Intel® platform and software with 1024GB DDR5 memory (16x64 GB), microcode 0xf000380, HT On, Turbo On, SNC Off, CentOS Stream 8, 5.19.8-1.el8.elrepo.x86_64, 1x 1.92T Intel® SSDSC2KG01, TF 2.9.1, AI Model=Resnet 50 v1_5, best scores achieved using BS1=1 core/instance , BS16=5 cores/instance, using physical cores, tested by Intel September 2022. 1-node, 2x AMD EPYC 7763 processor (64 core) on GIGABYTE R282-Z92 with 1024GB DDR4 memory (16x64 GB), microcode 0xa001144, SMT On, Boost On, NPS=1, Ubuntu 20.04.5 LTS, 5.4.0-125-generic, 1x 1.92TB INTEL_SSDSC2KG01, TF 2.9, ZenDNN=v3.3 (Ubuntu 20.04 required for ZenDNN v3.3) , AI Model=Resnet 50 v1_5, best scores achieved using BS1=2 cores/instance , BS16=8 cores/instance for INT8, BS16=4 cores/instance for fp32, using cores and threads, tested by Intel September 2022. Measurement: ResNet50 v 1.5; TensorFlow batch image classification BS=16 | Test by Intel in September 2022 |
NEC003 Providing an Optimal Network Stack | Slide #12 | Edwin Verplanke/Keith Wiles | Graphs showing Performance | In backup Slide #25 shows the configuration. | |
NEC003 Providing an Optimal Network Stack | OVS-DPDK (Slide 6) | Edwin Verplanke/Keith Wiles | Chart showing performance | Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 88 On-line CPU(s) list: 0-87 Thread(s) per core: 2 Core(s) per socket: 22 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel® Xeon® Gold 6152 CPU @ 2.10GHz Stepping: 4 CPU MHz: 3163.438 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 30976K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke flush_l1d | #### |
NEC003 Providing an Optimal Network Stack | Slide #12 | Edwin Verplanke/Keith Wiles | Graphs showing Performance | In backup Slide #25 shows the configuration. | |
NEC004 Designing Cloud Native Enterprise Network and Security Edge Platforms on Intel Architecture | Slide 21 | Brian Will & Tarun Viswanathan | NGINX TLS Handshake performance for ECDHE-X25519-RSA2K is >3x higher using Intel crypto instruction optimisations when compared to default openssl for the same core count on Intel Xeon Platinum 8470N (SPR E3). Intel QAT on SPR further increases the performance by another 1.6x when compared to Intel crypto instructions. | Backup slides have the detailed system configuration, workload and raw performance scaling (slides 33-38) | 06/22/2022 |
NEC004 Designing Cloud Native Enterprise Network and Security Edge Platforms on Intel Architecture | Slide 22 | Brian Will & Tarun Viswanathan | Snort with Hyperscan performing 2.4 times better then Snort unmodified using ac-bnfa | Backup slides have the detailed system configuration (slides 39-41) | 09/16/2022 |
NEC006 Capturing, Managing and Analyzing Real Time Data Where it Matters | Slide 8 | Rita H. Wouhaybi/Sam Kaira | We are able to ingest 1250 FPS in our video pipeline on Comet Lake and EII 3.0 | In backup slides we have details of the system setup (slides 22-23) | #### |
NEC006 Capturing, Managing and Analyzing Real Time Data Where it Matters | Slide 8 | Rita H. Wouhaybi/Sam Kaira | We are able to ingest 1250 FPS in our video pipeline on Comet Lake and EII 3.0 | In backup slides we have details of the system setup (slides 22-23) | #### |
OAC005 Crack the challenges of serverless | slide 19, 20, 22, 23 | Cathy Zhang | The cold start latency is reduced more than 50% using the snapsho- based way of creating the application instance. The application container image size is reduced more than 50%. The new scaling approach more than double/triple the scaling speed. The higher scaling concurrency requirement, the more performance gains of using the new scaling approach | | ##### |
OAC010: Driving toward standard Web APIs for XPU Accelerated AI | Slide 8 | Bryan Bernhart | There is up to 56% speed-up on MobileNet of TensorFlow.js Model Benchmark by converting 128-bit Wasm SIMD instructions into 256-bit IA instructions dynamically. | | Aug-03-2022 |
OAC010: Driving toward standard Web APIs for XPU Accelerated AI | Slide 23 | Bryan Bernhart | A new Web API called WebNN (Neural Network) can be used as execution backend by Web machine learning (ML) frameworks, such as TensorFlow-Lite Web and ONNXRuntime Web, and delivers near-native inference performance for various computer vision models on CPU and GPU. Comparing to MobileNet V2 inference on CPU through legacy WebAssembly, it delivers ~9.6x performance speedup vs Wasm SIMD, and ~3.1x performance speedup vs fastest Wasm (SIMD + Multithreads) for client AI use cases like Image Classification etc. | | Aug-03-2022 |
OAC012 | Slide 31 | Andrew Richards | Nbody is a well known algorithm for simulating a fictional galaxy. From the gif there you can see how it simulate the movement of fictional stars. This is the formula that is used, this is what calculates the force each star in our fictional galaxy experiences, and that is just the sum of all the other interactions in the galaxy. It's intentionally simple, for example it doesn't use any shared memory, computation scales with O to N sqared, but it could be made a much bigger problem. In terms of rendering this uses OpenGL, which is in a separate translation unit. For this simple project we ran the original CUDA code and then the SYCL code that we migrated on the same Nvidia GPU. In this example the performance of the kernel code was quite comparable. I would emphasise that this can vary a lot depending on your own project, but this is an example of a comparison. | Machine used: Lenovo ThinkBook 16p RTX3060 Laptop (AMD Ryzen 7 5800H Processors, Nvidia GeForce RTX 3060) with Ubuntu 20.04, DPC++ built from source and CUDA 16). Compiler flags used are available in the GitHub project build scripts at https://github.com/codeplaysoftware/cuda-to-sycl-nbody. | Sep-22 |