
Scallops has expanded its cloud resource management platform with a new product aimed at enabling enterprises to run self-hosted large language model (LLMS) and GPU-based AI applications.
AI Infra Product announced todayextends the company’s existing automation capabilities to meet the growing need for efficient GPU utilization, predictive performance, and operational load reduction in large-scale AI deployments.
The company said the system is already running in enterprise production environments and is delivering big performance gains for early adopters, reducing GPU costs by between 50% and 70%. The company does not publicly list enterprise pricing for this solution and instead invites interested customers to receive a customized quote based on their operation size and needs. Here.
Explaining how the system behaves under heavy load, Scalps CEO and co-founder Yoder Schaffer said in an email to VentureBeat that the platform “uses proactive and reactive mechanisms to handle sudden spikes without performance impact,” noting that its workload rights policies “automatically manage the ability to keep resources available.”
He added that minimizing GPU cold-start latency is a priority, emphasizing that the system “ensures quick response during traffic spikes,” especially for AI workloads where model load times are substantial.
Extending resource automation to AI infrastructure
Enterprises deploying self-hosted AI models face performance variability, long load times, and persistent underutilization of GPU resources. Scalps positioned the new AI Infra product as a direct response to these issues.
The platform allocates and scales GPU resources in real-time and adapts to changes in traffic demand without requiring modifications to existing model deployment pipelines or application code.
According to Scalps, the system manages production environments for organizations including Wiz, Doxine, Rubric, Copa, Alcami, Ventur, Grubhub, Island, Chevy, and several Fortune 500 companies.
The AI ​​Infra product introduces workload-aware scaling policies that reactively adjust capacity to maintain performance during spikes in demand. The company said these policies reduce cold-start delays associated with loading large AI models, which improves responsiveness during traffic spikes.
Technical integration and platform compatibility
The product is designed to be compatible with common enterprise infrastructure patterns. It works across all Kubernetes distributions, large cloud platforms, on-premises data centers, and cloud environments. Scalps emphasizes that deployment does not require code changes, infrastructure rewrites, or modifications to existing manifests.
The platform “seamlessly integrates into existing model deployment pipelines without requiring any code or infrastructure changes,” Shaffer said, adding that teams can immediately begin optimizing with their existing Gitops, CI/CD, monitoring and deployment tooling.
Shafer also addressed how automation interacts with existing systems. The platform operates without disrupting workflows or creating conflicts with custom scheduling or scaling logic, he said, explaining that the system “does not expose or change deterministic logic” and instead extends schedulers, autoscalers and custom policies by adding real-time operational context while respecting existing configuration boundaries.
Performance, visibility, and user control
The platform provides complete visibility into GPU utilization, model behavior, performance metrics, and scaling decisions at multiple levels, including pods, workloads, nodes, and clusters. Although the system implements predefined workload scaling policies, Scallops notes that engineering teams retain the ability to conform to these policies as needed.
In practice, the company aims to reduce or eliminate the manual tuning that DevOps and AIOPS teams typically perform to handle AI workloads. Installation is intended to require minimal effort, described by Scallops as a two-minute process using a single helm flag, after which optimization can be enabled with a single action.
Cost savings and enterprise case studies
Scalps reports that initial deployments of the AI ​​Infra product have achieved a 50-70% reduction in GPU costs in customer environments. The company cited two examples:
A large creative software company running thousands of GPUs used an average of 20 percent before adopting Scalps. The product increased utilization, stabilized underutilized capacity, and made GPU nodes scalable. These changes cut overall GPU costs by more than half. The company also reported a 35% reduction in latency for key workloads.
A global gaming company used the platform to optimize dynamic LLM workloads running on hundreds of GPUs. According to Scalps, the product increased utilization by a factor of seven while maintaining service-level performance. The customer predicts $1.4 million in annual savings from this workload alone.
Scalps noted that the expected GPU savings typically outweigh the cost of adopting and operating the platform, and that customers with limited infrastructure budgets have reported a rapid return on investment.
Industry context and company perspective
The rapid adoption of self-hosted AI models has created new operational challenges for enterprises, particularly GPU performance and the complexity of handling large-scale workloads. Shaffer explained the broader landscape in which “cloud-native AI infrastructure is reaching a tipping point.”
“Cloud-native architectures unlock tremendous flexibility and control, but they also introduce a new level of complexity,” he said in the announcement. “Managing GPU resources at scale has become chaotic.
The product brings together the full set of cloud resource management functions needed to manage diverse workloads at scale, Shaffer added. The company pitched the platform as a comprehensive system for continuous, automated optimization.
A unified vision for the future
With the addition of the AI ​​Infra product, Scalps aims to establish a unified approach to managing GPU and AI workloads that integrates with existing enterprise infrastructure.
The platform’s initial performance measurements and reported cost savings focus on improving measurement performance within a scalable ecosystem of self-hosted AI deployments.