“As these AI systems become more powerful, they will integrate into more and more critical domains,” said Liu Gao, a research scientist at Openei. MIT Technology Review In an exclusive preview of the new work. “It’s very important to make sure they’re safe.”
This is still preliminary research. The new model, called Wet Spurs Transformer, is smaller and far less capable than top-tier mass-market models like the firm’s GPT-5, Entropic Cloud, and Google DeepMind’s Gemini. At most it’s as capable as GPT1, a model developed by OpenAI in 2018, Gao says (though he and his colleagues haven’t made direct comparisons).
But it’s not meant to compete with best-in-class (at least, not yet). Instead, by looking at how these experimental models work, Openei hopes to learn about the mechanisms hidden within these bigger and better versions of the technology.
Elisinda Grigsby, a mathematician at Boston College who was not involved in how the LLMS works and was not involved in the project, says: “I’m sure the way it’s introduced will have a significant impact.”
Lee Sharkey, a research scientist at AI startup Goodfire, agrees. “This work is aimed at the right target and seems to be well executed.”
Why models are hard to understand
Opaney’s work is part of a relatively new field of research known as mechanistic interpretation, which is trying to map the internal mechanisms that models use when they perform various tasks.
It’s more difficult than that. LLM is built from neural networks, which consist of nodes, called neurons, arranged in layers. In most networks, each neuron is connected to every other neuron in its adjacent layers. Such a network is known as a dense network.
Dense networks are relatively efficient to train and operate, but they spread what they learn across a wide network of connections. The result is that simple concepts or functions can be shared between neurons in different parts of the model. At the same time, specific neurons can represent several different properties, a phenomenon known as superposition (a term borrowed from quantum physics). The upshot is that you can’t do specific parts of a model with specific concepts.