
this weekend, Andrej Karpathiformer director of AI at Tesla and a founding member of OpenEye, decided he wanted to read a book. But he didn’t want to read it alone. He wanted to study it with a committee of artificial intelligence, each offering their own point of view, criticizing the others, and finally synthesizing a final answer under the guidance of one. "Chairman."
To do this, Karpati wrote what he called a "The Web Code Project" – A piece of software written quickly, largely by AI assistants, intended for fun rather than function. It posts the result, a repository called "LLM Councilfor , for , for , ." To GitHub with a strong disclaimer: "I’m not going to support it in any way… the code is now obsolete and the libraries are dead."
Yet, for technical decision-makers across the enterprise landscape, looking past casual withdrawal reveals something far more important than a weekend toy. In a few hundred lines of The python And JavaScriptKarpathi outlines a reference architecture for the most critical, undefined layer of the modern software stack: orchestration middleware sitting between corporate applications and the volatile market for AI models.
As companies finalize their platform investments in 2026, LLM Council Offers a stripped down look "Build vs. Buy" The reality of AI infrastructure. This shows that while the logic behind routing and gathering AI models is surprisingly simple, operational wrappers need to make it enterprise-ready where the real complexity lies.
How the LLM Council Works: Discussion, Critique, and Synthesis of Responses to Four AI Models
To the casual observer, LLM Council The web application looks almost identical to Chat GPT. The user types a query into the chat box. But behind the scenes, the application triggers a sophisticated, three-step workflow that mirrors how human decision-making institutions operate.
First, the system routes the user query to the Frontier Models panel. In the default configuration of Karpathi, it also includes Openai GPT-5.1of Google Gemini 3.0 Proanthropic Claude Sonnet 4.5and of Zai Grok 4. These models generate their initial responses in parallel.
In the second step, the software undergoes peer review. Each model is fed the anonymized responses of its counterparts and asked to evaluate them based on accuracy and insight. This step transforms the AI ​​from a generator to a critic, forcing a layer of quality control that is rarely present in standard chatbot interactions.
Finally, a nominee "Chairman LL.M" – currently structured as Google’s Gemini 3 – captures original queries, individual responses, and peer ratings. It synthesizes this mass of context into a single, authoritative response to the user.
Karpiti notes that the results are often surprising. "Often, models are surprisingly willing to choose another LLM’s response as better than their own," He wrote on X (formerly Twitter). They described using the tool to read chapters of the book, observing that the models consistently rated GPT 5.1 as the most insightful while classifying clades as the lowest. However, Carpath’s own qualitative assessment deviated from that of its digital council. It got GPT-5.1 "Very literal" And preferred "Thick and processed" Product of Gemini.
The case for treating FastPy, OpenRotor, and Frontier models as replacement components
For CTOs and platform architects, the value of LLM Council Not in his literary criticism, but in his construction. The repository serves as a baseline document that shows what a modern, minimal AI stack might look like in late 2025.
Application is made on a "thin" The architecture uses the backend Fastpya modern The python framework, while the front-end is a standard Reaction Created with request White. Data storage is handled simply rather than by a complex database JSON files Written to local disk.
It is the linchpin of the entire operation Open routeran API aggregator that normalizes differences between different model providers. By routing requests through this single broker, Karpathi avoided writing separate integration code Open Eyefor , for , for , . Googleand Anthropic. The application does not know nor care which company provides the intelligence. It simply sends a signal and waits for a response.
This design choice highlights a growing trend in enterprise architecture: the commoditization of the model layer. By treating Frontier models as interchangeable components that can be changed by editing a single line in a configuration file—specifically the council_models list in the backend code—the architecture protects the application from vendor lock-in. A new model from If Meta or wrong At the top of the leaderboards next week, he can be added to the council in seconds.
What’s Missing from Prototype to Production: Validation, PII Redaction, and Compliance
While basic logic LLM Council Beautiful, it also serves as a clear example of the space between one "Heck of a weekend" and a production system. For an enterprise platform team, cloning a Karpathi repository is just one step in a marathon.
A technical audit of the code reveals the missing "Boring" Infrastructure that commercial vendors sell at premium prices. The system lacks authentication. Anyone with access to the web interface can query the models. There is no user role concept, meaning a junior developer has the same access rights as a CIO.
Furthermore, the governance layer is non-existent. In a corporate environment, sending data to four different external AI providers simultaneously triggers immediate compliance concerns. There is no mechanism to redact personally identifiable information (PII) before it leaves the local network, nor is there an audit log that shows who asked for it.
Reliability is another open question. The system assumes Open Router API is always ongoing and that the models will respond in a timely manner. It lacks the circuit breakers, fallback strategies, and replay logic that keep business-critical applications running when a provider experiences an outage.
These absences aren’t flaws in Carpath’s code of conduct — they clearly state they don’t intend to support or improve the project — but they do define the value proposition for the commercial AI infrastructure market.
Companies love it Lingchenfor , for , for , . AWS Bedrockand various AI gateway startups are essentially selling "to harden" Around the basic logic that Karpathi demonstrated. They provide security, observability, and compliance wrappers that turn raw orchestration scripts into a viable enterprise platform.
Why Karpiti believes the code is now "Afferal" And traditional software libraries are obsolete
Perhaps the most provocative aspect of the project is the philosophy under which it was built. Karpathi described the development process as "99% vibe codedfor , for , for , ." This means that it relied heavily on AI assistants to generate code.
"The code is now permanent and the libraries are deprecated, ask your LLM to change it however you want," he wrote in the repository documentation.
This statement marks a fundamental shift in software engineering capability. Traditionally, companies build internal libraries and abstractions to manage complexity, maintaining them for years. Karpathi is proposing a future where code is treated "Immediate support" – Disposable, easily rewritten by AI, and not intended to last.
For enterprise decision makers, this is a difficult strategic question. If internal tools can be "vibe coded" On a weekend, does it make sense to buy expensive, rigid software suites for internal workflows? Or should platform teams empower their engineers to develop custom, disposable tools that fit their exact needs for a fraction of the cost?
When AI Models Judge AI: The Dangerous Gap Between Machine Preferences and Human Needs
Beyond architecture, LLM Council The project inadvertently highlights a specific risk in deploying automated AI: the disconnect between human and machine judgment.
Karpathi’s observation that his models preferred GPT 5.1, while they preferred Gemini, suggests that AI models may have a shared bias. They may favor verbs, specific formatting, or rhetorical confidences that are not necessarily compatible with human business needs for consistency and accuracy.
As businesses become increasingly dependent on "LLM-AS-A-JUDGE" In order to evaluate the quality of the system’s customer-facing bots, this contrast matters. If the automatic evaluator consistently rewards "Word spread further" Answers While human users want comprehensive solutions, metrics will demonstrate success while decreasing customer satisfaction. Karpathi’s experience shows that relying solely on grade AI is a strategy fraught with hidden alignment issues.
What enterprise platform teams can learn from the weekend hack before building their 2026 stack
Finally, LLM Council Acts as a Rorschach test for AI industry. For hobbyists, this is a fun way to read books. For the vendor, it’s a risk, proving that the core functionality of their product can be replicated in a few hundred lines of code.
But for an enterprise technology leader, it’s a reference architecture. This eliminates the orchestration layer, showing that the technical challenge is not in signaling, but in manipulating the data.
As platform teams head into 2026, many will likely find themselves staring at Carpath’s code, not deploying it, but trying to understand it. This proves that a multi-model strategy is not technically out of reach. The question is whether companies will build the governance layer themselves or pay someone else to wrap it "vibe code" In enterprise grade coach.