Emergent self-awareness in large language models

Photo by editor (click to enlarge)

# Introduction

Large Language Model (LLMS) Capable of many things. They are able to produce text that looks coherent. They are able to answer human questions in human language. And they are able to analyze and organize texts from other sources, among many other skills. But, are LLMs able to analyze and report their own internal states – their complex components and activities in layers – in a meaningful way? Tell me another way, LLMS can introspect?

This article provides an overview and summary of research on the emerging topic of LLM introspection on self-transnational states, i.e. self-constituent states, along with some additional insights and final takeaways. In particular, we consider reviews and research papers Emergent self-awareness in large language models.

Note: This article uses first-person pronouns (me, me, my) to refer to the author of the current post, while, unless otherwise stated, “authors” refer to the original researchers of the paper (J. Lindsay et al.).

# Key Concept Explained: Introspective Awareness

The authors of the study describe the concept of self-awareness as a model—previously defined in other related works—based on four criteria.

But first, it’s worth understanding what is Self-report of LL.M is. This can be understood as the model’s own verbal description of what the “internal reasoning” (or, more technically, the neural activity) it believes it just did in generating the response. As you can imagine, this can be taken as a demonstration of the subtle behavior of model interpretation, which is enough (in my opinion) to justify the relevance of this research topic.

Now, let’s examine the four defining criteria for LLM self-awareness:

Correctness: Self-awareness requires that a model’s self-report should accurately reflect activities or manipulations in its internal state.
Grounding: The description of the self-report should depend on the internal state, which results in an update to the former.
Internationalism: Internal activities will be used in self-report rather than fully self-assessed texts by the LLM.
Representation of the metacognative: The model should be able to construct a higher-order internal representation rather than simply being a direct translation of the state. This is a particularly complex property that is beyond the scope of the authors’ study to demonstrate.

# Research methodology and key findings

The authors perform a series of experiments on various models of the clad family, such as Opus, Sonnet, Haiku, and so on, with the aim of determining whether or not LLMS can introspect. A cornerstone technique used in research methodology is concept injection, which in the authors’ own words—consists ofManipulating a model’s internal activities and observing how these manipulations affect his responses to questions about his mental state.“.

In particular, the activation vector or Concept vector Concepts associated with abstract nouns such as “rice” or “ramen”, or “peace” or “ami”, are taken into the LLM residual stream and exit a model layer into the LLM residual stream. Next, a signal is sent to the model, and it is asked to self-report whether a particular thought or idea has been injected, and if so, which one it was. This experiment was repeated for each model, at different levels, for each type of model, in different layers of the entire model architecture.

The key findings were:

Models occasionally (such as about 20% of the time in the case of Opus 4.1) self-reported “injected ideas,” aptly naming the corresponding concept, indicating some degree of internal recognition of the ideas found.
Likewise, it can often (above random chance) distinguish thoughts from input, i.e. repeat text input without errors and report on injected internal thought.
If the model has been artificially forced to output a particular token next as part of the response, and a matching concept vector was injected before the “prefill”, this output is more likely to be accepted as intended. This finding suggests that the model may be able to track its “own” previous intentions.
By making the model think, or avoid thinking, by generating irrelevant output about a particular word, it was revealed that it was able to change its internal representation of the information in response to incentives, such as “(don’t) think of idea X, and you’ll get a reward”. However, this phenomenon was the cause of the decline by reaching the final layer.

# Final thoughts and wrap up

This, in my opinion, is a research topic of very high relevance that deserves many studies for several reasons: first, and most obviously, LLM introspection can be the key not only to better understand the interpretation of LLM, but also to solve long-standing issues such as obsolete problems, non-valid reasoning and sometimes other ambiguous behaviors of greater importance.

The experiments were painstakingly and rigorously well-designed, the results highly self-explanatory and meaningful indicators of the introspective capacity in the intermediate layers of the models, albeit with varying levels of reproducibility. These experiments are limited to models in the clad family, and of course, it would have been interesting to see a greater diversity of architectures and model families. Nevertheless, it is understandable that there may be some limitations, such as limited access to internal activities in other model types or practical constraints when investigating proprietary systems, the authors of this study are concerned. Anthropic Of course!

Ivan Palomares Carrascosa Is a leader, author, speaker, and consultant in AI, Machine Learning, Deep Learning, and LLMS. He trains and guides others in real-world applications of AI.

# Introduction

# Key Concept Explained: Introspective Awareness

# Research methodology and key findings

# Final thoughts and wrap up

Editor's pick

Get latest news

Emergent self-awareness in large language models

# Introduction

# Key Concept Explained: Introspective Awareness

# Research methodology and key findings

# Final thoughts and wrap up

Gong study: Sales teams using AI generate 77% more revenue per rep

NVIDIA’s new AI framework trains an 8B model to manage tools like a pro

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news