Crushing the cost curve: Enterprises adopt new strategies for efficient AI inference at scale

Crushing the cost curve: Enterprises adopt new strategies for efficient AI inference at scale

Steen Graham, chief executive officer of Metrum AI, and Ace Stryker, director of AI and ecosystem marketing at Solidigm, discuss cost-efficient architectures for AI inference at scale with theCUBE at SC25. INFRA

Crushing the cost curve: Enterprises adopt new strategies for efficient AI inference at scale

Artificial intelligence demands are forcing companies to rethink chip design from the ground up. As organizations grapple with the exorbitant costs of high-bandwidth memory required for modern workloads, the architectural approach to achieving AI inference at scale is undergoing a radical transformation. 

While the initial phase of the AI boom focused on securing enough graphics processing units to run state-of-the-art models, the current phase is defined by a rigorous focus on total cost of ownership and system efficiency. This is the area where Metrum AI Inc. is making significant gains, according to Steen Graham (pictured, left), chief executive officer of Metrum AI.

“What if we offloaded our vector database and indexed in the solid-state drive?” Graham said. “There’s terabytes of memory there where … when you are just looking at the system [dynamic random access] memory, your footprint is dramatically lowered. Can you do something like that and still get the performance you are looking for and the accuracy you are looking for as well?”

Graham and Ace Stryker (right), director of AI and ecosystem marketing at Solidigm, a trademark of SK Hynix NAND Products Solutions Corp., spoke with John Furrier and Dave Vellante at SC25, during an exclusive interview on theCUBE, SiliconANGLE Media’s livestreaming studio. They discussed cost-efficient AI inference at scale, infrastructure and offloading workloads to storage. (* Disclosure below.)

Architecting storage for AI inference at scale

The industry is moving beyond simple text-based queries into bandwidth-intensive video intelligence. To manage this, companies are looking to offload data from expensive DRAM to high-performance storage. Solidigm and Metrum AI showcased a solution using vision-language models that runs an AI inference workflow in real time, parsing video and generating clips based on specific points of interest. This approach leverages NAND, or NOT AND storage, to house massive datasets. By moving data that typically resides in memory onto high-density storage, organizations can dramatically reduce costs without sacrificing the recall performance required for enterprise applications, according to Stryker.

“Part [of it] has to do with kind of zooming out and looking at things, not just in terms of the storage device itself, but at a system level,” Stryker said. “How are the components and the resources working together? — which is really at the heart of the work we’re doing.”

The collaboration between the two companies involves sophisticated software engineering, such as leveraging the DiskANN algorithm to optimize retrieval. They have successfully offloaded parts of the neural network layers onto the SSD for GPU-constrained models, Graham noted. This allows enterprises to run multi-hundred-billion parameter models on legacy hardware by batching workloads, rather than requiring the latest, most expensive GPUs.

“The North Star is always accuracy that drives business outcomes and generating that [total cost of ownership] story,” Graham said. “From … a product perspective, you’re always focusing on how I can deliver the most accurate product that saves the most time, delivers the most top line revenue results.”

Looking ahead, the focus is shifting toward optimizing the key-value cache, often referred to as the operating system of the AI factory. As context windows grow larger and interactions become more complex, the KV cache expands rapidly.

“The other big untapped area of opportunity for storage in AI inference and driving efficiency at inference is really around the KV cache,” Stryker stated. “There are opportunities to offload some of that to SSD as well.”

Here’s the complete video interview, part of SiliconANGLE’s and theCUBE’s coverage of SC25:

(* Disclosure: Solidigm sponsored this segment of theCUBE. Neither Solidigm nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Stay Informed

Get the best articles every day for FREE. Cancel anytime.