banner



Nvidia Unveils Pascal Tesla P100 With Over 20 TFLOPS Of FP16 Performance - Powered By GP100 GPU With 15 Billion Transistors & 16GB Of HBM2

Nvidia has just unveiled its fastest GPU yet here at GTC 2022, a brand new graphics chip based on the company'south side by side generation Pascal compages. The GP100 is NVIDIA'south almost advanced GPU to date, powering the company's next generation compute monster, the Tesla P100.

Nvidia GTC-11
Nvidia claims that GP100 is the largest FinFET GPU that has ever been fabricated, measuring at 600mm² and packing over 15 billion transistors.  The Tesla P100 features a slightly cutting dorsum GP100 GPU and delivers 5.three teraflops of double precision compute, 10.6 TFLOPS of single precision compute and 21.2 TFLOPS of one-half precision FP16 compute. Keeping this massive GPU fed is 4MB of L2 cache and a whopping 14MB worth of annals files.

The entire Telsa P100 package is comprised of many chips not just the GPU, that collectively add up to over 150 billion transistors and features 16GB of stacked HBM2 VRAM for a total of 720GB/s of bandwidth. Nvidia'southward CEO & Co-Founder Jen-Hsun Huang confirmed that this behemoth of a graphics carte du jour is already in volume product with samples already delivered to customers which volition brainstorm announcing their products in Q4 and will be shipping their products in Q1 2022.

Pascal GP100 Architecture & Specs

Nvidia Printing Release

V Architectural Breakthroughs
The Tesla P100 delivers its unprecedented operation, scalability and programming efficiency based on five breakthroughs:

  • NVIDIA Pascal architecture for exponential performance leap -- A Pascal-based Tesla P100 solution delivers over a 12x increase in neural network grooming performance compared with a previous-generation NVIDIA Maxwell™-based solution.

  • NVIDIA NVLink for maximum application scalability -- The NVIDIA NVLink™ loftier-speed GPU interconnect scales applications across multiple GPUs, delivering a 5x dispatch in bandwidth compared to today's best-in-class solution1. Up to eight Tesla P100 GPUs can be interconnected with NVLink to maximize application performance in a single node, and IBM has implemented NVLink on its POWER8 CPUs for fast CPU-to-GPU communication.

  • 16nm FinFET for unprecedented energy efficiency -- With fifteen.3 billion transistors congenital on 16 nanometer FinFET fabrication applied science, the Pascal GPU is the world's largest FinFET chip ever congenital2. It is engineered to deliver the fastest operation and all-time energy efficiency for workloads with about-infinite calculating needs.

  • CoWoS with HBM2 for big information workloads -- The Pascal compages unifies processor and data into a single package to deliver unprecedented compute efficiency. An innovative arroyo to retention design, Fleck on Wafer on Substrate (CoWoS) with HBM2, provides a 3x boost in retentivity bandwidth functioning, or 720GB/sec, compared to the Maxwell compages.

  • New AI algorithms for peak performance -- New half-precision instructions deliver more than 21 teraflops of peak performance for deep learning.

The GP100 GPU is comprised of  3840 CUDA cores, 240 texture units and a 4096bit memory interface. The 3840 CUDA cores are arranged in six Graphics Processing Clusters, or GPCs for short. Each of these has 10 Pascal Streaming Multiprocessors. As mentioned before in the article the Tesla P100 features a cut downwardly GP100 GPU. This cutting back version has 3584 CUDA cores and 224 texture mapping units.

Pascal Tesla P100 GPU Board

Each Pascal streaming multiprocessor includes 64 FP32 CUDA cores, half that of Maxwell. Within each Pascal streaming multirprocessor there are ii 32 CUDA cadre partitions, two acceleration units, a warp scheduler and a fairly big instruction buffer, matching that of Maxwell.

Pascal GP100

The massive GP100 GPU has significantly more than pascal streaming multiprocessors, or CUDA core blocks.  Because each of these has admission to a annals file that'southward the same size of Maxwell's 128 CUDA core SMM. This means that each Pascal CUDA core has access to twice the annals files. In plough nosotros should expect even more performance out of each Pascal CUDA cores compared to Maxwell.

NVIDIA GP100 Block Diagram

Nvidia Press Release

Tesla P100 Specifications
Specifications of the Tesla P100 GPU accelerator include:

  • 5.3 teraflops double-precision operation, ten.6 teraflops single-precision performance and 21.two teraflops half-precision performance with NVIDIA GPU BOOST™ engineering science

  • 160GB/sec bi-directional interconnect bandwidth with NVIDIA NVLink

  • 16GB of CoWoS HBM2 stacked memory

  • 720GB/sec retentivity bandwidth with CoWoS HBM2 stacked memory

  • Enhanced programmability with page migration engine and unified memory

  • ECC protection for increased reliability

  • Server-optimized for highest data eye throughput and reliability

Tesla P100 Boosts To Almost ane.5Ghz

Perhaps i of the most exciting, yet peradventure predictable, revaluations about the GP100 Pascal flagship GPU is that it can accomplish clocks fifty-fifty higher than Maxwell. Despite Nvidia opting for very conservative clock speeds on its professional person GPUs like the Tesla & Quadro products the P100 really has a base clock speed of 1328mhz and a boost clock speed of 1480mhz. Considering that GPU Boost 2.0 really allows these cards to operate at even higher clock speeds than the nominal heave clock.

Nosotros're looking at bodily frequencies of upwards of 1500Mhz on the GeForce equivalent of the P100. What is inevitably going to launch as the next GTX Titan. This ways boost clocks of even upwards of 1600Mhz on factory overclocked models, and perhaps 2Ghz+ manual overclocks. This should be extremely heady news to all GeForce fans.

Tesla Products Tesla K40 Tesla M40 Tesla P100
GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal)
SMs xv 24 56
TPCs 15 24 28
FP32 CUDA Cores / SM 192 128 64
FP32 CUDA Cores / GPU 2880 3072 3584
FP64 CUDA Cores / SM 64 4 32
FP64 CUDA Cores / GPU 960 96 1792
Base Clock 745 MHz 948 MHz 1328 MHz
GPU Heave Clock 810/875 MHz 1114 MHz 1480 MHz
Compute Operation - FP32 5.04 TFLOPS 6.82 TFLOPS ten.6 TFLOPS
Compute Operation - FP64 1.68 TFLOPS 0.21 TFLOPS 5.three TFLOPS
Texture Units 240 192 224
Retentivity Interface 384-bit GDDR5 384-bit GDDR5 4096-flake HBM2
Retentiveness Size Upwards to 12 GB Up to 24 GB 16 GB
L2 Cache Size 1536 KB 3072 KB 4096 KB
Register File Size / SM 256 KB 256 KB 256 KB
Register File Size / GPU 3840 KB 6144 KB 14336 KB
TDP 235 Watts 250 Watts 300 Watts
Transistors 7.i billion 8 billion 15.3 billion
GPU Die Size 551 mm² 601 mm² 610 mm²
Manufacturing Process 28-nm 28-nm 16-nm

Nvidia Pascal - 2X Perf/Watt With 16nm FinFET, Stacked Memory ( HBM2 ), NV-Link And Mixed Precision Compute

There are 4 hallmark technologies for the Pascal generation of GPUs. Namely HBM, mixed precision compute, NV-Link and the smaller, more power efficient TSMC 16nm FinFET manufacturing process. Each is very important in its own correct and equally such we're going to intermission downward everyone of these four separately.

Pascal To Be Nvidia's Outset Graphics Architecture To Characteristic High Bandwidth Memory HBM

Stacked memory will debut on the green side with Pascal. HBM Gen2 more precisely, the second generation of the SK Hynix AMD co-developed high bandwidth  JEDEC retention standard.  The new retention will enable memory bandwidth to exceed 1 Terabyte/s which is 3X the bandwidth of the Titan 10. The new retentivity standard will also allow for a huge increase in memory capacities, 2.7X the memory capacity of Maxwell to be precise. Which indicates that the new Pascal flagship volition characteristic 32GB of video retentivity, a heed-bogglingly huge number.

We've already seen AMD accept advantage of HBM memory technology with its Fiji XT GPU terminal year. Which features 512GB/S of memory bandwidth, twice that of the GTX 980. AMD has also appear concluding calendar month at its Capsaicin event that information technology will be bringing HBM2 with its next generation Vega architecture, succeeding its 14nm FinFET Polaris architecture launching this summertime with GDDR5 retentiveness.

TSMC'southward new 16nm FinFET process promises to be significantly more power efficient than planar 28nm. It also promises to bring virtually a considerable comeback in transistor density. Which would enable Nvidia to build faster, significantly more complex and more than ability efficient GPUs.

Pascal Is Nvidia'southward First Graphics Architecture To Deliver One-half Precision Compute FP16 At Double The Rate Of Full Precision FP32

One of the more significant features that was revealed for Pascal was the addition of 16FP compute support, otherwise known as mixed precision compute or half precision compute. At this mode the accurateness of the consequence to any computational problem is significantly lower than the standard 32FP method, which is required for all major graphics programming interfaces in games and has been for more a decade. This includes DirectX 12, 11, ten and DX9 Shader model iii.0 which debuted most a decade ago. This makes mixed precision mode unusuable for any mod gaming application.

All the same due to its very bonny power efficiency advantages over FP32 and FP64 information technology can be used in scenarios where a loftier caste of computational precision isn't necessary. Which makes mixed precision computing especially useful on power limited mobile devices. Nvidia's Maxwell GPU architecture characteristic in the GTX 900 series of GPUs is limited to FD32 operations, this in turn means that FP16 and FP32 operations are processed at the aforementioned rate by the GPU. However, adding the mixed precision adequacy in Pascal means that the compages will now be able to process FP16 operations twice as quickly as FP32 operations. And equally mentioned above this can be of great benefit in ability limited, low-cal compute scenarios.

16nm FinFET Manufacturing Process Engineering

TSMC's new 16nm FinFET process promises to be significantly more than ability efficient than planar 28nm. Information technology too promises to bring about a considerable improvement in transistor density. Which would enable Nvidia to build faster, significantly more than complex and more than ability efficient GPUs.

TSMC'south 16FF+ (FinFET Plus) technology tin provide to a higher place 65 percent higher speed, around 2 times the density, or seventy percent less ability than its 28HPM technology. Comparing with 20SoC technology, 16FF+ provides extra 40% higher speed and 60% power saving. By leveraging the experience of 20SoC technology, TSMC 16FF+ shares the same metal backend process in club to quickly improve yield and demonstrate process maturity for time-to-marketplace value.

Nvidia'due south Proprietary High-Speed Platform Atomics Interconnect For Servers And Supercomputers - NV-Link

Pascal will also be the beginning Nvidia GPU to feature the company's new NV-Link engineering science which Nvidia states is v to 12 times faster than PCIE iii.0.

The applied science targets GPU accelerated servers where the cross-fleck advice is extremely bandwidth limited and a major arrangement bottleneck. Nvidia states that NV-Link will be upwardly to v to 12 times faster than traditional PCIE 3.0 making it a major step forrard in platform atomics. Earlier this year Nvidia announced that IBM will exist integrating this new interconnect into its upcoming PowerPC server CPUs. NVLink volition debut with Nvidia's Pascal in 2022 earlier information technology makes its way to Volta in 2022.
NVLINK_4

NVLink is an free energy-efficient, high-bandwidth communications channel that uses up to 3 times less energy to movement data on the node at speeds five-12 times conventional PCIe Gen3 x16. First bachelor in the NVIDIA Pascal GPU architecture, NVLink enables fast communication between the CPU and the GPU, or between multiple GPUs. Figure 3: NVLink is a key edifice block in the compute node of Summit and Sierra supercomputers.

VOLTA GPU Featuring NVLINK and Stacked Memory NVLINK GPU high speed interconnect lxxx-200 GB/s 3D Stacked Memory 4x Higher Bandwidth (~one TB/southward) 3x Larger Capacity 4x More than Energy Efficient per flake.

NVLink is a fundamental engineering in Summit's and Sierra's server node architecture, enabling IBM POWER CPUs and NVIDIA GPUs to admission each other'due south memory fast and seamlessly. From a programmer's perspective, NVLink erases the visible distinctions of data separately fastened to the CPU and the GPU by "merging" the retentivity systems of the CPU and the GPU with a high-speed interconnect. Because both CPU and GPU take their own retentiveness controllers, the underlying memory systems can be optimized differently (the GPU's for bandwidth, the CPU'due south for latency) while yet presenting every bit a unified memory organisation to both processors. NVLink offers two singled-out benefits for HPC customers. First, information technology delivers improved awarding performance, simply past virtue of greatly increased bandwidth between elements of the node. 2d, NVLink with Unified Memory technology allows developers to write code much more than seamlessly and still achieve high performance. via NVIDIA News


Pascal brings many new improvements to the table both in terms of hardware and software. However, the focus is crystal clear and is 100% almost pushing power efficiency and compute performance college than always earlier. The plethora of new updates to the architecture and the ecosystem underline this focus.

Pascal will exist the company'south get-go graphics architecture to utilise next generation stacked memory technology, HBM. It will besides exist the first ever to feature a brand new from the footing-up loftier-speed proprietary interconnect, NV-Link. Mixed precision support is too going to play a major role in introducing a step function improvement in perf/watt in mobile applications.

GPU Family Vega NVIDIA Pascal
Flagship GPU Vega 10 GP102
GPU Process 14nm FinFET 16nm FinFET
GPU Transistors Up To eighteen Billion 12 Billion
Retentivity Up to 16 GB HBM2 12GB GDDR5X
Bandwidth 512 GB/s 480 GB/southward
Graphics Compages Vega (NCU) Pascal
Predecessor Fiji (Fury Series) GM200 (900 Serial)

Source: https://wccftech.com/nvidia-pascal-gpu-gtc-2016/

Posted by: jacobsbeasto.blogspot.com

0 Response to "Nvidia Unveils Pascal Tesla P100 With Over 20 TFLOPS Of FP16 Performance - Powered By GP100 GPU With 15 Billion Transistors & 16GB Of HBM2"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel