Thursday, July 4, 2024

DDN Cranks the Knowledge Throughput with AI400X2 Turbo

DDN as we speak launched a brand new model of its high-end storage answer for AI and high-performance computing, which Nvidia makes use of to energy its huge Eos supercomputer. The AI400X2 Turbo includes a 30% efficiency enhance in comparison with the system it replaces, which is able to allow prospects to extra effectively prepare giant language fashions when paired with GPUs from Nvidia.

DDN has an extended historical past creating storage options for the HPC enterprise. Within the new AI period, it’s leveraged that management to serve the exploding want for high-speed storage options to coach giant language fashions (LLMs) and different AI fashions.

Whereas the coaching knowledge in an LLM is quite modest by massive knowledge requirements, the necessity to frequently again up, or checkpoint, the LLM throughout a coaching session has pushed the demand. As an illustration, when Nvidia began working with AI400X2 methods two years in the past, Nvidia required a group of storage methods able to delivering 1TB per second for reads and 500GB a second for writes, in keeping with James Coomer, senior vp of merchandise for DDN.

“That was very important to them,” Coomer says. “Despite the fact that this was an LLM and rationally you assume that’s solely phrases, that’s not large volumes, the mannequin measurement turns into very giant and so they have to be checkpointed quite a bit.”

Nvidia’s EOS Supercomputer, which is backed by AI400X2 storage, is quantity 9 on the TOP500 listing

Nvidia, which is holding its GPU Know-how Convention this week in San Jose, California, adopted the AI400X2 for its personal supercomputer, dubbed Eos, which was launched in March 2022. The 18 exaflop cluster sports activities 48 AI400X2 home equipment, which delivers 4.3 TB/sec reads and three.1 TB/sec writes to the SuperPOD loaded with 576f DGX methods and greater than 4,600 H100 GPUs.

“That write efficiency was a very massive objective for them due to the checkpointing operations,” says Kurt Kuckein, vp of selling for DDN. “Their complete objective was to make sure round 2 TB/sec and we had been capable of obtain above 3 [TB/sec] for the write efficiency.”

That complete throughput would theoretically go up 30% with the brand new AI400X2 Turbo that DDN introduced as we speak. As a 2U equipment, the AI400X2 Turbo can learn knowledge at speeds as much as 120 GB/s and write knowledge at speeds as much as 75 GB/s, with complete IOPS of three million. That compares with 90 GB/s for reads and 65 GB/s for writes with the AI400X, which the AI400X Turbo replaces atop the DDN stack.

Clients will be capable to leverage that 30% profit in a number of methods, both by both cranking by means of extra work in the identical period of time, getting the identical job executed faster, or getting the identical job executed with a fewer variety of storage methods, DDN says.

“We will cut back the variety of home equipment provisioned, and so probably you get 30% financial savings in energy versus simply uncooked efficiency, coaching occasions and issues like that,” Kuckein says. “Relying on the variety of GPUs and issues that you’ve got, probably you’re simply lowering the storage footprint.”

When prospects string a number of AI400X2 home equipment collectively to Nvidia DGX methods or SuperPODs over 200Gb InfiniBand or Ethernet networks, the full throughput goes up accordingly. But it surely’s not simply concerning the {hardware} funding, Coomer says.

“For us in fact the argument isn’t actually that we do 120 GB/sec. The largest argument by far is prospects of ours have spent like $100 million in infrastructure and cooling and networks and knowledge scientists and knowledge facilities and stuff. There’s a giant aggressive play on the market to get your fashions executed quicker. It’s about spending 5% of that funds max on storage if you happen to select DDN, then you definitely get extra productive output.”

DDN has skilled a big improve in gross sales as a result of GenAI increase. The corporate says its 2023 gross sales for AI storage was double the 2022 stage.

“We didn’t comprehend it was going to be like this,” Coomer stated. “We posted a press launch final 12 months saying we shipped as a lot within the first quarter as we did within the earlier 12 months. This 12 months, it sort of appears prefer it would possibly turn into comparable.”

AI400X2 Turbo will likely be accessible quickly. The home equipment could be fitted with 2.5-inch NVMe drives with 30TB to 500TB capacities. Along with DDN’s file system, it contains high quality of service, port zoning detection, and knowledge integrity test/correction functionalities.

Associated Objects:

AWS Delivers ‘Lightning’ Quick LLM Checkpointing for PyTorch

GenAI Doesn’t Want Greater LLMs. It Wants Higher Knowledge

Why Object Storage Is the Reply to AI’s Largest Problem

 

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles