Ship fast, optimize later: Top AI engineers don't care about cost — they're prioritizing deployment

Ship fast, optimize later: Top AI engineers don’t care about cost — they’re prioritizing deployment

Across industries, rising compute costs are often cited as a barrier to AI adoption — but leading companies are finding that cost is no longer the real barrier. The toughest challenges (and top of mind for many tech leaders)? Latency, flexibility and capacity. at It’s a surpriseFor example, AI simply adds a few centers to each order. The food delivery and takeout company is very concerned about the cloud’s capacity with sky-high demands. RepetitionFor its part, small- and large-scale training and deployments are focused on on-premises clusters and cloud balancing. It has afforded flexibility at the biotech company for rapid experimentation. The companies’ real-world experiences highlight a broader industry trend: For enterprises pursuing AI at scale, economics aren’t the key deciding factor—the conversation has shifted from how to pay for AI to how quickly it can be deployed and maintained. AI leaders from both companies recently sat down with VentureBeat CEO and Editor-in-Chief Matt Marshall as part of VB’s Traveling AI Impact Series. Here’s what they shared.

Surprise: Rethink what you assume about potential

Wonder has used AI to power everything from recommendations to logistics — so far, CTO James Chen reports, AI has only added a few cents per order. Chen explained that the food ordering technology component costs 14 cents, AI 2 to 3 cents, although it is “growing really fast” to 5 to 8 cents. Still, that seems almost outrageous compared to total operating costs. Instead, the 100-clad cloud-native AI company’s main concern has been capacity with growing demand. Chen noted that Wonder was built with the “assumption” (which turned out to be wrong) that there would be “infinite capacity” so that they could move “very fast” and not have to worry about infrastructure management. But the company has grown considerably over the past few years, he said. As a result, about six months ago, “we started getting a little bit of a signal from cloud providers saying, ‘Hey, you might need to consider moving to region two,'” because they were running out of CPU or data storage capacity at their facilities as demand increased. It was “very surprising” that they had to plan B earlier than they expected. “Obviously it’s good practice to be multilateral, but we were thinking about another two years down the road,” Chen said.

Is not economically feasible (yet)

Chen noted that Wonder built its model to maximize its conversion rate. The goal is to level up as many relevant customers as possible to the new restaurant level. These are “isolated scenarios” where models are trained to be “very, very efficient and very fast” over time. Chen noted that currently, the best bet for Wonder’s use case is larger models. But in the long term, they’ll want to move to smaller models that are hyper-customized by individuals (via AI agents or concierges) based on their purchase history and even their clickstream. “It’s certainly great to have these micromodels, but the cost is too expensive right now,” Chen noted. “If you try to create one for everyone, it’s not economically feasible.”

Budgeting is an art, not a science

Chen said Wonder gives its devs and data scientists plenty of playroom to experiment, and internal teams review usage costs to make sure no one changes the model and “jacks massive compute around a huge bill.” The company is trying different things to offload AI and work at the margins. “But then it’s very difficult to budget because you have no idea,” he said. One of the difficult things is the pace of development. When a new model comes out, “We can’t just sit there, right? We have to use it.” Budgeting for the unknown economics of a token-based system is “certainly art versus science.” A key component of the software development lifecycle is preserving context when using large spatial models, he explained. When you find something that works, you can add it to your company’s “corpus of context” that can be sent with every application. It is huge and costs money every time. “Over 50%, 80% of your costs are just reselling the same information over and over again in the same engine on every request,” Chen said. In theory, the more they make, the lower the cost per unit required. “I know that when a transaction happens, I will pay X cents of tax for each one, but I don’t want to be limited to using the technology for all these creative ideas."

‘Proofing Moment’ for Repetition

Iteration, for its part, is focused on meeting the needs of massive compute through a hybrid infrastructure of on-premise clusters and cloud computing. When initially looking to develop its AI infrastructure, the company had to go with its own setup, because “the cloud providers didn’t have a lot of good offerings,” CTO Ben Mabey explained. “The moment of proof was that we needed more compute and we looked at the cloud providers and they were like, ‘Maybe a year or so.'” The company’s first cluster in 2017 added NVIDIA gaming GPUs (1080s, launched in 2016). They have since added NVIDIA H100s and A100s, and use a Kubernetes cluster that they run in the cloud or on-prem. Addressing the question of longevity, Mabe noted: “These gaming GPUs are still being used today, which is crazy, right? The myth is only three years old, it’s definitely not the case. The A100s are still at the top of the list, they’re the workhorse of the industry.”

Best use cases on-prem vs. cloud; Cost differences

More recently, Mabe’s team has been training a foundation model on Recursion’s image repository (which contains petabytes of data and over 200 images). This and other types of large training jobs require “massive clusters” and connected, multi-node setups. “We go on-prem when we need a network attached to it and access a lot of our data in a highly parallel file system,” he explained. On the other hand, fewer workloads run in the cloud. The iteration method is to “preempt” the GPU and Google Tensor Processing Unit (TPU), which is the process of interrupting running GPU tasks to work on higher priority tasks. “Because we don’t care about speed in some of these workloads where we’re uploading biological data, whether it’s an image or sequencing data, DNA data.” “We can say, ‘Give it to us in an hour,’ and if it kills a job, we’re fine.” From a cost perspective, moving large workloads to on-prem is 10 times cheaper, Mabe noted. For a five-year TCO, that’s half the price. On the other hand, for smaller storage needs, the cloud can be “quite competitive” in terms of cost. Ultimately, Mabey urged tech leaders to step back and determine whether they’re really willing to commit to AI. Cost-effective solutions typically require a multi-year purchase. “From a psychological point of view, I’ve seen colleagues who won’t invest in computing, and as a result they always end up paying on demand," said Mabe. "Their teams use very little compute because they don’t want to run up the cloud bill. Innovation is not really driven by people wanting to burn money.

Editor's pick

Get latest news

Ship fast, optimize later: Top AI engineers don’t care about cost — they’re prioritizing deployment

Surprise: Rethink what you assume about potential

Is not economically feasible (yet)

Budgeting is an art, not a science

‘Proofing Moment’ for Repetition

Best use cases on-prem vs. cloud; Cost differences

Applying in the Void with Recruiting Admin Abby Perini (Podcast #196)

VPNs vs Proxy: What are the Differences?

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news