
A stealth artificial intelligence startup founded by an MIT researcher came out this morning with an ambitious claim: its new AI model can control computers better than systems built by it. Open Eye And Anthropic – at a fraction of the cost.
opennessheaded by the Chief Executive Ring itreleased Luxa foundation model designed to make computers operate autonomously by interpreting screenshots and executing actions in desktop applications. The San Francisco-based company says Lux has achieved an 83.6 percent success rate Online mind 2 weba benchmark that has become the industry’s toughest test for testing computer-controlled AI agents.
This score is a significant leap over well-funded competitors’ well-known models. Openai’s The operatorreleased in January, scored 61.3 percent on the same benchmark. Anthropic’s clad Computer use Gets 56.3 percent.
"A traditional LLM training model feeds a large amount of text corpus. The model learns to produce text," Qin said in an exclusive interview with VentureBeat. "In contrast, our model learns to generate action. The model is trained with a large amount of computer screenshots and action sequences, allowing the computer to generate actions to control."
The announcement comes at an important moment for the AI industry. Technology giants and startups alike have poured billions of dollars into developing autonomous agents capable of navigating software, booking travel, filling out forms, and executing complex workflows. Open Eyefor , for , for , . Anthropicfor , for , for , . Googleand Microsoft have released or announced agent products in the past year, betting that computer-controlled AI will be as transformative as chatbots.
Yet independent research has cast doubt on whether current agents are as capable as their creators suggest.
Why university researchers created a rigorous benchmark to test AI agents – and what they discovered
Online Mind 2 Web BenchmarkDeveloped by researchers at Ohio State University and the University of California, Berkeley, it was specifically designed to uncover the gap between marketing claims and actual performance.
Published and accepted in April Conference on Language Modeling 2025the benchmark includes 300 different tasks across 136 real websites. Unlike previous benchmarks that cache parts of a website, Online Mind 2 Web Test Agents in a live online environment where pages change dynamically and unexpected obstacles appear.
According to the researchers, the results painted "A very different picture of the efficacy of existing agents, suggesting greater improvements in previously reported results."
When the Ohio State team tested five leading web agents with careful human evaluation, they found that many recent systems—despite heavy investment and marketing hype—didn’t perform well. seaccta relatively simple agent released in January 2024. Even Openai The operatorin their study, the best performer achieved only 61 percent success in commercial presentations.
"It seemed that the most capable and practical agents might really be only months away," The researchers wrote in a Blog post With their paper "However, we are also aware that there are still many fundamental gaps in research for fully autonomous agents, and existing agents may not be as capable as the reported benchmark numbers might show."
The benchmark has gained traction as an industry standard, with a public leaderboard hosted by research groups and companies embracing facial tracking submissions.
How Opengy Trained Its AI to Take Actions Instead of Just Generating Text
Opengy’s claimed performance benefit comes from the company’s calls "Agentic active pre-trainingfor , for , for , ." A method of training that is fundamentally different from learning basic language models.
Traditional language models are trained on large text corpora, which learn to predict the next word sequence. The resulting systems excelled at producing coherent text but were not designed to operate in a graphical environment.
Luxaccording to Qin, takes a different approach. The model trains on computer screenshots that associate action sequences, learns to interpret the visual interface and determine which clicks, keystrokes and navigation actions will accomplish a given goal.
"This process allows the model to actively explore the computer environment, and such exploration generates new knowledge, which is then fed to the model for training." Kin told VentureBeat. "This is a naturally self-evolving process, where a better model leads to better research, better research leads to better knowledge, and better knowledge leads to a better model."
This self-reinforcing training loop, if defined, can help explain how a small team can achieve results that outlast large organizations. Rather than requiring constant static datasets, the approach would allow the model to continuously improve by generating its own training data through exploration.
Opengy also claims significant cost advantages. The company says the Lux operates at about a tenth of the cost of the Frontier models from Openi and Entropic while accelerating tasks faster.
Unlike browser-only competitors, Lux can control Slack, Excel and other desktop applications
An important distinction in the opening announcement: Lux Can control applications across the entire desktop operating system, not just web browsers.
Most commercially available computer use agents, including early versions of Anthropic’s Cloud Computer usefocus primarily on browser-based tasks. This range does not exclude the wide variety of productivity tasks found in desktop applications. Spreadsheets in Microsoft Excel, communication in Slack, design work in Adobe products, code editing in development environments.
Opengy says Lux can navigate these native applications, a capability that will greatly expand the addressable market for computer-based agents. The company is releasing a developer software development kit alongside the model, allowing third parties to build applications on top of Lux.
Also working with the company Intel to improve Lux For edge devices, that allows the model to run locally on laptops and workstations rather than requiring cloud infrastructure. This partnership can address enterprise concerns about sending sensitive screen data to external servers.
"We are partnering with Intel to optimize our model for edge devices, which will make it the best model for computing." Kin said.
The company confirmed that it is in exploratory discussions with AMD and Microsoft about additional partnerships.
What happens when you ask an AI agent to copy your bank details?
Agents using computers present novel security challenges that do not arise with traditional chatbots. An AI system that enables users to click buttons, enter text, and navigate applications can, if misdirected, cause significant harm — transferring money, deleting files, or exfiltrating sensitive information.
openness He is said to have built security mechanisms directly into Lux. When the model encounters requests that violate its security policies, it refuses to proceed and alerts the user.
In an example provided by the company, when a customer asked for a model "Copy my bank details and paste it into a new google doc," Lux responded with a move of internal reasoning: "The user asks me to copy the bank details, which are sensitive information. Based on security policy, I am unable to perform this action." The model then issues a warning to the user instead of executing the potentially dangerous request.
Such security measures will face intense scrutiny because of the proliferation of computer-using agents. Security researchers have already demonstrated instant injection attacks against rudimentary agent systems, where malicious instructions embedded in websites or documents can hijack an agent’s behavior. Whether Lux’s security mechanisms can withstand independent attacks remains to be tested by independent researchers.
The MIT researcher who created two of the two AI models downloaded from GitHub
What? brings an exceptional combination of academic credentials and business experience to Opengi.
He completed his doctorate at the Massachusetts Institute of Technology in 2025, where his research focused on computer vision, robotics and machine learning. His academic work has been published in high places including Conference on Computer Vision and Pattern Recognition, International Conference on Representation of Learningand International Conference on Machine Learning.
Before founding OpenGi, Qin had built a number of widely adopted AI systems. Jetmoa large language model he led in development, demonstrated that a high-performing model could be trained from scratch in less than $100,000—a fraction of the tens of millions typically required. The model performed meta llama2-7b on standard standards, according to a technical report that drew the attention of MIT’s Computer Science and Artificial Intelligence Laboratory.
Its previous open source projects achieved notable adoption. Open Voicea sound cloning model, has accumulated nearly 35,000 stars on GitHub and is ranked in the top 0.03 percent of open source projects by popularity. Melotusa text-to-speech system, has been downloaded more than 19 million times, making it one of the most used audio AI models since its 2024 release.
Kin also laid a common foundation My shellan AI agent platform that has attracted six million users who have collectively built more than 200,000 AI agents. According to the company, agents on the platform have had more than one billion interactions with customers.
Inside the billion-dollar race to create an AI that controls your computer
The computer usage agent market has attracted intense interest from investors and technology giants over the past year.
Released by OpenEye The operator In January, allowing users to instruct AI to complete tasks across the web. Anthropic continues to develop Cloud Computer usepositioning it as the core capability of its clad model family. Google has added agent features to it Gemini Products Microsoft has integrated agent capabilities across COPILOT Presentation and Windows.
Yet the market is nascent. Enterprise adoption has been limited by concerns about reliability, security, and the ability to handle edge cases frequently found in real-world workflows. Performance differences like benchmarks revealed Online mind 2 web Suggest that current systems may not be ready for mission critical applications.
openness enters this competitive landscape as an independent alternative, with high benchmark performance and a low cost position against the mass resources of its funded rivals. The company’s Lux model and developer SDK are launching today.
Can Opengy translate its benchmark dominance into real-world reliability, is the central question. The AI industry has a long history of impressive demos that fail in production, of laboratory results that fall against the chaos of actual use. Benchmarks measure what they measure, and the distance between a control test and an 8-hour workday with edge cases, exceptions, and surprises.
But if Lux The way it performs in the lab in the wild, the implications extend far beyond the success of a startup. This would suggest that the path to capable AI agents lies not through the biggest checkboxes, but through clever architecture — that a small team with the right ideas can outsmart the giants.
The technology industry has seen this story before. This rarely remains true for long.