Nvidia’s New AI Chips Are Reportedly Overheating in Server Farms

Nvidia's New AI Chips Are Reportedly Overheating in Server Farms


Customers are pissed.

Graphic Pushback Unit

Nvidia’s unreleased AI chips are reportedly overheating, with customers worrying that their already-delayed shipment may be pushed back yet again.

As The Information reports, the company’s uber-powerful Blackwell graphics processing units (GPUs) are overheating when connected in server racks that can hold up to 72 of them.

According to Nvidia employees who’ve been working on releasing the chips, as well as customers and vendors with knowledge of the issue, the firm has repeatedly asked its suppliers to redesign the racks to head off the overheating problem.

The issue is so problematic that the company informed Microsoft this week that shipment will be delayed at least another three months — the latest development in a series of pushbacks that have haunted the company since the Blackwell chips were first unveiled back in March.

And that doesn’t bode well, considering the astronomical resources AI companies are allocating to building out server farms, nagging growing pains that could hold back their efforts to train and roll out the next AI product.

Design and Demand

Nvidia claims its next-generation GPUs are extremely powerful and 30 times as fast as preceding models when it comes to AI applications. As CEO Jensen Huang told CNBC last month, demand for Blackwell chips has been “insane” as people rush to pre-order the chips that cost tens of thousands of dollars apiece.

Amid all that hype, however, rumors of design flaws have plagued the release of the Blackwell chips for months. Eventually, Huang admitted in part that some of the hearsay was true.

“We had a design flaw in Blackwell, it was functional, but the design flaw caused the yield to be low,” the CEO said during an October 23 press conference, per Reuters. “It was 100 percent Nvidia’s fault.”

While that admission seems to have been related to another production issue, it nevertheless seems to have caused yet another unnecessary delay in the shipment process.

In the meantime, a Nivida spokesperson claimed that the latest overheating issue was nothing to worry about and that “the engineering iterations are normal and expected.”

The massive rack of 72 GPUs weighs a whopping 3,000 pounds and needs to be cooled using water, a departure from the air-cooling many AI data centers have come to rely upon. According to The Information, Nvidia was struggling with even a much smaller 36-GPU rack overheating.

As the immense hype surrounding the release of new AI products continues to grow, the pressure is rising considerably for Nvidia.

Customers have already been hit with delays of the new Blackwell chips — the latest development likely won’t sit well with him either.

More on AI computing power: AI Expert Warns Crash Is Imminent As AI Improvements Hit Brick Wall



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.