AI Startup Etched Unveils Transformer ASIC Claiming 20x Speed-up Over NVIDIA H100 (2024)

');$('.tpu-fancybox-wrap').css('maxWidth', maxWidth);*/instance.$refs.stage.on('transitionend', function() {updateButtonPos(instance);});},onUpdate: updateButtonPos,afterShow: function(instance, slide) {updateButtonPos(instance);instance.$refs.inner.find('.fancybox-tpu-nav').show();},beforeClose: function(instance, slide) {instance.$refs.inner.find('.fancybox-tpu-nav').hide();},afterClose: function(instance, slide) {$('.tpu-fancybox-wrap').contents().unwrap();$('body').removeClass('tpu-fancybox-body-wrap')},baseTpl: '

',});});}loadjs.ready(['jquery', 'fancybox', 'swiper'], function() {attachLightbox('a[data-fancybox]');if ($(window).width()<600) {$('.imgcontainer').each(function() {var $this=$(this);if (($this.find('a').length==1) || ($this.find('a').length>7))return;$this.addClass('swiper-container');$this.find('a').addClass('swiper-slide').css('width', 'auto').wrapAll('

');new Swiper ($this.eq(0), { slidesPerView: 'auto', slidesPerGroup: 1, spaceBetween: 15, pagination: { el: '.swiper-pagination', clickable: true } });});}$('.newspost').on('click', '.spoiler > .button, .spoiler > a', function(e) {e.preventDefault();$(this).next('div').toggle();});$('.newspost').on('click', '.ispoiler', function(e) {e.preventDefault();$(this).find('div').css('filter', '');$(this).removeClass('ispoiler');});$('.contnt').on('click', '.newspoll_btn', function() {popup.Show('TechPowerUp Quick Poll','Loading...');$.get('/news-poll/options?id='+$(this).data('id'), function(data) {$('#popup_content').html(data);});});});

Tuesday, June 25th 2024

AleksandarK

Tuesday, 13:30Discuss (37 Comments)

A new startup emerged out of stealth mode today to power the next generation of generative AI. Etched is a company that makes an application-specific integrated circuit (ASIC) to process "Transformers." The transformer is an architecture for designing deep learning models developed by Google and is now the powerhouse behind models like OpenAI's GPT-4o in ChatGPT, Antrophic Claude, Google Gemini, and Meta's Llama family. Etched wanted to create an ASIC for processing only the transformer models, making a chip called Sohu. The claim is Sohu outperforms NVIDIA's latest and greatest by an entire order of magnitude. Where a server configuration with eight NVIDIA H100 GPU clusters pushes Llama-3 70B models at 25,000 tokens per second, and the latest eight B200 "Blackwell" GPU cluster pushes 43,000 tokens/s, the eight Sohu clusters manage to output 500,000 tokens per second.

Why is this important? Not only does the ASIC outperform Hopper by 20x and Blackwell by 10x, but it also serves so many tokens per second that it enables an entirely new fleet of AI applications requiring real-time output. The Sohu architecture is so efficient that 90% of the FLOPS can be used, while traditional GPUs boast a 30-40% FLOP utilization rate. This translates into inefficiency and waste of power, which Etched hopes to solve by building an accelerator dedicated to power transformers (the "T" in GPT) at massive scales. Given that the frontier model development costs more than one billion US dollars, and hardware costs are measured in tens of billions of US Dollars, having an accelerator dedicated to powering a specific application can help advance AI faster. AI researchers often say that "scale is all you need" (resembling the legendary "attention is all you need" paper), and Etched wants to build on that.

However, there are some doubts going forward. While it is generally believed that transformers are the "future" of AI development, having an ASIC solves the problem until the operations change. For example, this is reminiscent of the crypto mining craze, which brought a few cycles of crypto ASIC miners that are now worthless pieces of sand, like Ethereum miners used to dig the ETH coin on proof of work staking, and now that ETH has transitioned to proof of stake, ETH mining ASICs are worthless.

Nonetheless, Etched wants the success formula to be simple: run transformer-based models on the Sohu ASIC with an open-source software ecosystem and scale it to massive sizes. While details are scarce, we know that the ASIC runs on 144 GB of HBM3E memory, and the chip is manufactured on TSMC's 4 nm process. Enabling AI models with 100 trillion parameters, more than 55x bigger than GPT-4's 1.8 trillion parameter design.

Source:Etched

Related News

Tags:
4 nm
accelerator
AI
ASIC
B200
Blackwell
ChatGPT
Ethereum
Google
GPU
H100
HBM
HBM3
HBM3E
Hopper
Meta
NVIDIA
NVIDIA H100
OpenAI
Transformers
TSMC

Dec 24th 2023 NVIDIA GeForce RTX 50 Series "Blackwell" On Course for Q4-2024 (126)
May 23rd 2024 NVIDIA RTX 5090 "Blackwell" Founders Edition to Implement the "RTX 4090 Ti" Cinderblock Design (118)
Jan 25th 2024 More AMD Ryzen 9000 "Zen 5" Desktop Processor Details Emerge (85)
May 5th 2024 NVIDIA to Only Launch the Flagship GeForce RTX 5090 in 2024, Rest of the Series in 2025 (154)
Jan 28th 2024 Top AMD RDNA4 Part Could Offer RX 7900 XTX Performance at Half its Price and Lower Power (292)
Feb 19th 2024 NVIDIA RTX 50-series "Blackwell" to Debut 16-pin PCIe Gen 6 Power Connector Standard (106)
May 9th 2024 NVIDIA Testing GeForce RTX 50 Series "Blackwell" GPU Designs Ranging from 250 W to 600 W (84)
Jun 11th 2024 Possible Specs of NVIDIA GeForce "Blackwell" GPU Lineup Leaked (139)
Mar 17th 2024 NVIDIA B100 "Blackwell" AI GPU Technical Details Leak Out (41)
Jun 22nd 2024 Legendary Overclocker KINGPIN Leaves EVGA and Joins PNY to Develop Next-Generation GPUs for Extreme OC (195)

Add your own comment

Space Lynx

Astronaut

RIP Nvidia?

P4-630

Space LynxRIP Nvidia?

That would be something...

dgianstefani

TPU Proofreader

Space LynxRIP Nvidia?

Doubt.jpg

While it is generally believed that transformers are the "future" of AI development, having an ASIC solves the problem until the operations change.

Denver

"The Sohu architecture is so efficient that 90% of the FLOPS can be used, while traditional GPUs boast a 30-40% FLOP utilization rate."

In some cases even less than that. Eventually, the limitations of silicon might compel companies to rethink their GPUs completely, but it's hard to say for sure.

john_

GPUs can be used for many things, no matter how inefficient they might look. I think this is the reason why Intel's Gaudi isn't having as much success. While in the short time buying specialized hardware looks as the smart move from any perspective, that hardware could end up as a huge and highly expensive pile of garbage, if things somewhat change, as mentioned in the article. With GPUs you adapt them or just throw them to do other computational tasks.

64K

It comes down to $$$ at the end of the day. Can this do the same thing as Nvidia GPUs for the same money or preferably less since they are the relatively unknown and businesses are more comfortable sticking with the known which gives Nvidia the edge.

Assimilator

I wonder how long before somebody creates an "AI" company called "Grift".

Steevo

Denver"The Sohu architecture is so efficient that 90% of the FLOPS can be used, while traditional GPUs boast a 30-40% FLOP utilization rate."In some cases even less than that. Eventually, the limitations of silicon might compel companies to rethink their GPUs completely, but it's hard to say for sure.

Just look at how AMD screwed the 7900XTX in benchmarking, the dual issue doesn't work unless its verbosely in the code, meaning while it performs great with game ready drivers generic benchmarks or unaware software suffers at half the performance in many situations. GPU hardware is slowly merging on one standard, like X86-64 or ARM is, pretty soon its going to be like ARM hardware, you check the boxes for your application and silicon or whatever substrate is shipped to you.

AssimilatorI wonder how long before somebody creates an "AI" company called "Grift".

The milk maids are a milking.......

ADB1979

AssimilatorI wonder how long before somebody creates an "AI" company called "Grift".

Call the company GAI: Grift A I. Tagline: The A I does the grift for you, it's the GAI way.

#10

Vya Domus

SteevoJust look at how AMD screwed the 7900XTX in benchmarking, the dual issue doesn't work unless its verbosely in the code

Nvidia architectures struggle with utilization just as much, AD102 has 30% more FP32 units than Navi31 but is only 20% faster in raster. In fact every architecture does, CPUs included.

#11

Broken Processor

I wonder if ASIC's coming out or soon to be coming out is the reason for Nvidia stock drop say what you want about tech investors but they are a very tech savvy clued in bunch and maybe more ASIC products are on the way or maybe my tinfoil hat needs pressed, folded and recycled.

#12

Broken ProcessorI wonder if ASIC's coming out or soon to be coming out is the reason for Nvidia stock drop say what you want about tech investors but they are a very tech savvy clued in bunch and maybe more ASIC products are on the way or maybe my tinfoil hat needs pressed, folded and recycled.

I sold all my NVDA stock once it hit 1100 or 110. It was a nice run, I still own AMD stock, I bought that right after Zen 1 was released. But I didnt buy very much of it.

#13

dragontamer5788

There's a ton of architectures that are better than NVidia or GPUs in general for AI.

The fundamental fact is that NVidia GPUs are doing FP16 4x4 matrix multiplications as their basis. You can gain significantly more efficiencies by going 8x8 matrix or 16x16 matrix. (Or go TPU and go a full 256x256 sized matrix). The matricies in these "Deep Learning AI" are all huge, so making bigger-and-bigger matricies at a time leads to more efficiencies in power, area, etc. etc.

The issue is that the 4x4 matrix multiplication was chosen because it fits in a GPU register space. Its the best a general purpose GPU can basically do on NVidia's architecture for various reasons. I'd expect that if a few more registers (or 64-way CDNA cores from AMD) were used, then maybe 8x4 or maybe 8x8 sizes could be possible, but even AMD is doing 4x4 matrix sizes on their GPUs. So 4x4 is it.

Anyone can just take a bigger fundamental matrix, write software that efficiently splits up the work as 8x8 or 16x16 (or Google TPU it to 256x256 splits) and get far better efficiency. Its not a secret and such "systolic arrays" are cake to do from an FPGA perspective. The issue is that these bigger architectures are "not GPUs" anymore, and will be useless outside of AI. And furthermore, you have even more competitors (Google TPU in particular) who you actually should be gunning for.

No one is buying NVidia GPUs to lead in AI. They're buying GPUs so that they have something else to do if the AI bubble pops. Its a hedged bet. If you go 100% AI with your ASIC chip (like Google or this "Etched" company), you're absolutely going to get eff'd when the AI bubble pops, as all those chips suddenly become worthless. The NVidia GPUs will lose valuation, but there's still other compute projects you can do with them afterwards.

#14

Marcus L

john_GPUs can be used for many things, no matter how inefficient they might look. I think this is the reason why Intel's Gaudi isn't having as much success. While in the short time buying specialized hardware looks as the smart move from any perspective, that hardware could end up as a huge and highly expensive pile of garbage, if things somewhat change, as mentioned in the article. With GPUs you adapt them or just throw them to do other computational tasks.

When it gets to that point with an ASIC, you are probably many GPU generations ahead anyway, so keep buying GPU's for extortionate pricing or buy an ASIC and replace it when it becomes obsolete? people talking like GPU's don't become obsolete and become e-waste.... they indeed do when performance/efficiency/instruction sets/API's etc are behind the latest generation

#15

Evrsr

dragontamer5788There's a ton of architectures that are better than NVidia or GPUs in general for AI.
The fundamental fact is that NVidia GPUs are doing FP16 4x4 matrix multiplications as their basis. You can gain significantly more efficiencies by going 8x8 matrix or 16x16 matrix. (Or go TPU and go a full 256x256 sized matrix). The matricies in these "Deep Learning AI" are all huge, so making bigger-and-bigger matricies at a time leads to more efficiencies in power, area, etc. etc.
The issue is that the 4x4 matrix multiplication was chosen because it fits in a GPU register space. Its the best a general purpose GPU can basically do on NVidia's architecture for various reasons. I'd expect that if a few more registers (or 64-way CDNA cores from AMD) were used, then maybe 8x4 or maybe 8x8 sizes could be possible, but even AMD is doing 4x4 matrix sizes on their GPUs. So 4x4 is it.
Anyone can just take a bigger fundamental matrix, write software that efficiently splits up the work as 8x8 or 16x16 (or Google TPU it to 256x256 splits) and get far better efficiency. Its not a secret and such "systolic arrays" are cake to do from an FPGA perspective. The issue is that these bigger architectures are "not GPUs" anymore, and will be useless outside of AI. And furthermore, you have even more competitors (Google TPU in particular) who you actually should be gunning for.
No one is buying NVidia GPUs to lead in AI. They're buying GPUs so that they have something else to do if the AI bubble pops. Its a hedged bet. If you go 100% AI with your ASIC chip (like Google or this "Etched" company), you're absolutely going to get eff'd when the AI bubble pops, as all those chips suddenly become worthless. The NVidia GPUs will lose valuation, but there's still other compute projects you can do with them afterwards.

Nobody is hedging with ASICs that cost more than $30000 which are only worth that price at AI. Which is why AMD is #1 on Top 500 and Intel is #2.

You are not doing really much else with A100s and later, because they are not V100s. They are very, very much specialized for AI calculations and economically worthless for anything else.

#16

dragontamer5788

EvrsrNobody is hedging with ASICs that cost more than $30000 which are only worth that price at AI. Which is why AMD is #1 on Top 500 and Intel is #2.
You are not doing really much else with A100s and later, because they are not V100s. They are very, very much specialized for AI calculations and economically worthless for anything else.

A100 and H100 are still better at FP64 and FP32 than their predecessors. They're outrageously expensive because of the AI chips, but the overall GPU-performance (aka: traditional FP64 physics modeling performance) is still outstanding.

As such, the H100 is still a hedge. If AI collapses tomorrow, I'd rather have an H100 than an "Etched" AI ASIC. Despite being a hedge, the H100 is still the market leader in practical performance, thanks to all the software optimizations in CUDA. (Even if the fundamental organization of the low-level 4x4 Matrix Multiplication routines are much smaller and less efficient than large 8x8 or 16x16 sized competitors).

#17

Vya Domus

dragontamer5788As such, the H100 is still a hedge.

I really doubt it, things like MI300 look to be much faster in general purpose compute and likely a lot cheaper, if demand for ML drops off a cliff you don't want these on your hand, it will take ages till you break ROI.

Nvidia really doesn't treat these as anything more than ML accelerators despite them still being "GPUs" technically, they have far inferior FP64/FP16 performance compared to MI300 for example.

#18

trsttte

john_GPUs can be used for many things, no matter how inefficient they might look. I think this is the reason why Intel's Gaudi isn't having as much success. While in the short time buying specialized hardware looks as the smart move from any perspective, that hardware could end up as a huge and highly expensive pile of garbage, if things somewhat change, as mentioned in the article. With GPUs you adapt them or just throw them to do other computational tasks.

Let's put things a different way, everything is an ASIC, the A (application) can just be more or less generic. A gpu is an ASIC designed for a wide range of applications, this startup thingy is designed for a very specific set of instructions, a TPU or NPU is not as generic as a gpu but also not as constrained as what's tipically referred as ASIC like this thingy. Right now everyone is using nvidia gpus because the software stack is very robust and things are still developing very quickly to become tied to a specific instruction set.

That will eventually change.

Broken ProcessorI wonder if ASIC's coming out or soon to be coming out is the reason for Nvidia stock drop say what you want about tech investors but they are a very tech savvy clued in bunch and maybe more ASIC products are on the way or maybe my tinfoil hat needs pressed, folded and recycled.

Nah, they're just as dumb as anyone else, otherwise nvidia wouldn't be the most valuable company in the world right now.

dragontamer5788Anyone can just take a bigger fundamental matrix, write software that efficiently splits up the work as 8x8 or 16x16 (or Google TPU it to 256x256 splits) and get far better efficiency. Its not a secret and such "systolic arrays" are cake to do from an FPGA perspective. The issue is that these bigger architectures are "not GPUs" anymore, and will be useless outside of AI. And furthermore, you have even more competitors (Google TPU in particular) who you actually should be gunning for.

How does intel XMX architecture fare with that?

dragontamer5788No one is buying NVidia GPUs to lead in AI. They're buying GPUs so that they have something else to do if the AI bubble pops. Its a hedged bet. If you go 100% AI with your ASIC chip (like Google or this "Etched" company), you're absolutely going to get eff'd when the AI bubble pops, as all those chips suddenly become worthless. The NVidia GPUs will lose valuation, but there's still other compute projects you can do with them afterwards.

I don't think it's about hedging their bets, i think it's just a case of what's available and easy to start with because of all the work nvidia already put towards a robust software stack.

#19

Prima.Vera

Are any of those claimed results verified by somebody??
I can claim the sea and the sun, but with actuall proof and 3rd party confirmation, I'm just dust in the wind...

#20

Minus Infinity

AssimilatorI wonder how long before somebody creates an "AI" company called "Grift".

It has, but it's spelled Grok. fElon Musk has all bases covered.

#21

dragontamer5788

Prima.VeraAre any of those claimed results verified by somebody??
I can claim the sea and the sun, but with actuall proof and 3rd party confirmation, I'm just dust in the wind...

The benefits, and downsides, of a textbook systolic array architecture are well known and well studied.

If you know how memory is going to move, then you can hardwire the data movements to occur. A hardwired data movement is just that: a wire. It's not even a transistor... a dumb wire is the cheapest thing in cost, power and has instantaneous performance.

The problem with hardwired data movements is that they're hardwired. They literally cannot do anything else. If it's add then multiply, the hardwired will only do adds then multiply. (not like a CPU or GPU that can change the order, the data and do other things).

I can certainly believe that a large systolic array is exponentially faster at this job. But their downsides is that its.... Hardwired. Unchanging. Inflexible.

--------

Systolic arrays were deployed as error correction back in the CD-ROM days, since the error correction always had the same order of math in a regular matrix multiplication pattern. Same with Hardware RAID or other ASICs. They've been for decades, superior in performance.

The question of ASIC AI accelerators is not about the performance benefits. The pure question is if it is a worthy $Billion-ish investment and business plan. It's only a good idea if it makes all the money back.

#22

ADB1979

trsttteLet's put things a different way, everything is an ASIC, the A (application) can just be more or less generic.

The A "application" is nothing more than a generic word that has no meaning in this context until it is placed next to the S "specific" "Application Specific" is very, VERY different to "Application" in this context and I would personally take ASIC and all four of it's letters/words together as one because that is how it is meant to be understood and used. I have not read anything else you wrote beyond this point because all of your arguments hinge on this point about "Application" vs "Application Specific" in discussing this proposed new ASIC product.

#23

Lewzke

and what if someone comes up with a better model? probably the Transformers is just the beginning

#24

kondamin

Lewzkeand what if someone comes up with a better model? probably the Transformers is just the beginning

Like mining, you need new machines.

other than mining, you get to keep running that old workload for maybe a cheaper subscription fee as it’s still worth something

before the as a service model, buying something and having it do that specific thing until you bought something new was normal

#25

dragontamer5788

Vya DomusI really doubt it, things like MI300 look to be much faster in general purpose compute and likely a lot cheaper, if demand for ML drops off a cliff you don't want these on your hand, it will take ages till you break ROI.
Nvidia really doesn't treat these as anything more than ML accelerators despite them still being "GPUs" technically, they have far inferior FP64/FP16 performance compared to MI300 for example.

AMD is making good hardware, but as usual the question is if AMD's software can keep up.

For the most part, people don't want to port off CUDA for minor gains that AMD's hardware represents. The HPC / Supercomputer guys probably aren't even using ROCm for the most part, but are instead writing programs at higher-levels and relying upon a smaller team of specialists to port just elements of their kernel to ROCm one step at a time. (A structure only possible because National Labs have much more $$$ to afford specialist programmers like this).

I think AMD is making good progress. They've found that NVidia is lagging on traditional SIMD compute and have carved out a niche for themselves. But NVidia still "wins" because of the overall software package in practice.

Add your own comment

AI Startup Etched Unveils Transformer ASIC Claiming 20x Speed-up Over NVIDIA H100 (2024)

Related News

References