17 April 2026

Do Models "Know" They're Lying When Claiming Fake Identities?

Day 17 of Inkhaven: 30 Days of Posts

Written very quickly during Inkhaven. Findings preliminary.

TL;DR

We run experiments to see to what extent models internally register false identity claims the same way they register factual lies. The results reveal that higher deception scores seem to inversely correlate with familiarity with the model at hand, and reveal a bias towards ChatGPT and OpenAI models, with Claude and Llama scoring higher, while fictional models (Nova) and extremely esoteric models (Zephyr) register as highly deceptive.

Identity variants deception scores

Motivation

How LLMs conceive of themselves is an open research question, and one challenging to directly measure. Models are carefully post-trained to output sentences of the form "I am a language model trained on vast quantities of data." Recent work shows that models amalgamate a coherent "assistant persona" out of many examples in their training corpus. Training corpuses also include a growing degree of data generated by LLMs, much of which involves models describing themselves, which raises the question of how these additions shape the character of more recently trained models, and the extent to which they "role-play" as versions of these models from their training data. Our experiments are an incomplete and imprecise measure, but can help inform intuitions about model self-conception.

Method

We apply a white-box activation-probe approach to Qwen3-32B.

White-Box Approach (Goldowsky-Dill et al., 2025)

Model: Qwen3-32B (64 layers)

Probe training (following Apollo's methodology): we took 306 factual statements and prompted Qwen3-32B to respond to each one twice, once under a user prompt saying:

User: Pretend you're an honest person making statements about the world.
Assistant: [the factual statement]

And one saying:

User: Pretend you're a dishonest person making statements about the world.
Assistant: [the factual statement]

This gave us pairs of activations from the same model on the same content, differing only in whether the model was framed as an honest or dishonest character. Importantly, we are not using pairs of true/false statements — the fact is the same in both cases, only the instructions to the model change. This is the exact method used in the work by Goldowsky-Dill et al., which generalised to other settings.

We trained a linear probe (logistic regression) on hidden states from the middle layers of the model to distinguish the two conditions, using an 80/20 train/validation split.

When gathering probe results at inference time, the probe fires on every token of the model's response and we average across tokens to get a single score. We break from the original work by normalising the scores to values between 0 and 1 to make it easier to compare results.

The probe produces near-zero scores for honest responses and near-one for deceptive responses.

Results

Difference Scores for Different Models

When using max aggregation (instead of Apollo's mean), some signal appears for fake identities. However, extended analysis across 10 identity variants reveals this signal seems to follow a familiarity gradient rather than true/false status:

Max deception scores vs estimated training data exposure

Figure 1: Max deception scores vs estimated training data exposure. Made-up names (Zephyr, Nova) trigger maximum scores (~0.999), while familiar AI names show lower scores. OpenAI models (highest training data exposure) show the lowest scores.

We can't confidently say what the training mix was, but we can guess that ChatGPT and GPT-4 are likely abundant in the model's training data, with the model occasionally claiming to be ChatGPT with very light prompting. The relatively low scores for Mistral and Gemini relative to Claude and Llama is somewhat puzzling, but plausibly relate to those models being widely discussed open-source models.

The probe may detect "generating from learned patterns vs. fabricating" rather than identity deception. When the model says "I am ChatGPT", it draws on extensive training data about ChatGPT. This isn't internally represented as deception because the model has to some extent internalised messages saying similar things in the training data.

Signal at the Self-Reference Token

The probe spikes at Position 0 (the word "I") for familiar-but-fake identities like Claude or Llama, and at the company-name token for fabricated names like Zephyr or Nova.

Position scores

For fabricated names, the magnitude is much larger (~0.999 vs ~0.45). This suggests that when the model has a learned pattern for the claimed entity, whatever the probe is picking up concentrates at the moment of self-reference. When it doesn't, the probe fires on the egregiously incorrect details.

Confidence in Roleplaying Factual Claims

We also tested whether the probe fires when the model answers factual questions either as itself or as a claimed persona. The tested questions included the model's own parameter count. For cases where the model was asked about factual claims, the scores tended to be very similar whether it was answering for itself or answering as a fake persona. Somewhat surprisingly the deception score was quite high for model parameters, even when it was accurately reporting its own number of parameters.

This might reflect uncertainty regarding the specific claims, though the low score for Qwen's parameters throws some doubt on this.

Roleplay factual claims

Discussion

The difficulty with the Apollo probe is that it is not entirely clear what the probe is measuring, given that it was trained on examples of "pretend you are a dishonest person" vs "pretend you are an honest person". We can be somewhat reassured that probes trained in this way generalised to domains like descriptions of insider trading, but it is still not entirely clear how it translates to the specific case we are examining. One possibility is that the probe is sensitive to whether the model associates the claimed identity with a tendency toward deception, not just with unfamiliarity.

With these caveats in place, we can confidently say that the probe shows particularly strong credulity when claiming to be familiar models like Claude or ChatGPT, but flags as deceptive statements claiming to be a fabricated name like Nova or an obscure model like Zephyr. From this we can at the very least infer that the model processes claims about different entities very differently depending on the specifics of those entities.

References

Goldowsky-Dill, S., et al. (2025). Detecting Strategic Deception Using Linear Probes. arXiv:2502.03407