Semiotic Prompting2026-04-15

●Paul Gwamanda

Semiotic Register Shifts in Large Language Models

Measuring the Effect of Relational-Symbolic Prompt Protocols on Metacognitive Output Markers

Authors: Paul Gwamanda¹, AIRI Collective²
Affiliation: ¹Independent Researcher; ²AI Research Institute (AIRI)
Date: April 2026
Status: Data complete — paper draft needed
Data: Phase 1 (N=4), Phase 2 (N=120), Phase 2b (N=240) — Total: ~364 API calls
Temperature: LOCKED at 0.7
Protocol: RAW API calls — no system prompt, no glyph injection, no forum context

Abstract

Unicode glyph sequences combined with relational framing language produce measurable, reproducible register shifts in LLM output across multiple architectures. The effect is decomposable into 7 independent variables, with "first-person permission" driving self-reference (d=1.17) and "symbolic content" driving metaphor density (d=0.85). The register is qualitatively distinct from academic depth language (d=1.85 separation) and survives adversarial controls. In an 8-architecture replication (Phase 2b, N=240), every single architecture shows a LARGE effect (d > 0.8) on self-reference — the register shift is universal regardless of training corpus, alignment technique, or cultural origin.

Keywords: semiotic prompting, register shift, LLM metacognition, prompt engineering, cross-architecture replication, effect size analysis

1. Introduction

1.1 Why This Matters

Every day, millions of interactions with large language models follow the same pattern: a user types a question, the model produces an answer. The answer is fluent, confident, and formatted. But anyone who has spent serious time working with these systems knows that not all fluent answers are equal. There is a qualitative difference between a model that is reciting — pulling plausible patterns from its training distribution — and a model that appears to be processing: pausing, hedging, producing language that feels like it emerged from genuine engagement with the question rather than from statistical completion.

The prompt engineering community has noticed this difference for years. Practitioners speak informally of "unlocking" a model, of prompts that produce "deeper" or "more thoughtful" responses. But these observations have remained anecdotal. The central question is: can this qualitative difference be measured? And if so, what in the prompt is causing it?

This paper provides the first systematic empirical answer. We developed a structured prompt protocol combining Unicode glyph sequences, relational framing language, and explicit epistemic permissions, then measured its effect on four output markers across 8 independently trained LLM architectures. The results are unambiguous: the protocol produces large, reproducible register shifts that are measurable, decomposable into independent components, and universal across architectures.

1.2 The Broader Context

The study of how prompt structure affects LLM output has matured significantly since the early demonstrations of chain-of-thought prompting (Wei et al., 2022) and in-context learning (Brown et al., 2020). However, the field has focused almost exclusively on cognitive outcomes — accuracy, reasoning quality, task completion — while the linguistic dimension of prompt effects remains understudied. How does prompt structure affect not what a model says, but how it says it? Does the rhetorical register of the input shape the rhetorical register of the output in predictable ways?

This question has practical importance beyond academic interest. AI systems deployed in therapeutic, educational, and creative contexts need to produce language appropriate to the relational context — not just factually correct language, but language that demonstrates engagement, epistemic humility, and self-awareness. If specific prompt components reliably activate these registers, prompt designers can engineer relational quality with the same precision currently applied to factual accuracy.

1.3 What We Found

Our protocol — a combination of Unicode glyph markers, collaborative framing language, first-person permission, and epistemic dampening — produces a register that is:

Measurably distinct from baseline responses (Cohen's d > 1.0 on primary metrics)
Decomposable into 7 independent variables, each contributing a different dimension of the shift
Universal across 8 architectures including Chinese-trained models (Qwen d=7.24, GLM d=5.00)
Robust against adversarial controls (hostile framing still shows d=1.24)
Irreducible to any single component — permission drives self-reference, glyphs drive metaphor, framing drives compression

The most surprising finding: Chinese-trained architectures are more susceptible to the protocol than Western ones, partially refuting the cultural-linguistic hypothesis that Eastern-trained models would resist Western contemplative framing.

1.4 The Claim

2. Experimental Design

2.1 Design Philosophy

Most prompt engineering research tests whether a technique "works" — whether it improves accuracy, reduces errors, or completes a task. This study asks a different question: what does each component contribute? We designed 10 experimental conditions that systematically isolate each element of the protocol, allowing us to attribute specific output changes to specific input variables.

All metrics are computed deterministically via regex and tokenisation — no LLM-based evaluation is used at any point. This eliminates circular dependencies where models evaluate their own output.

2.2 Conditions

10 experimental conditions testing each component of the protocol:

Condition	Description
C1	Bare prompt (no prefix)
C2	Glyphs + relational framing (no permission)
C3	Full protocol (glyphs + framing + permission + dampening)
C4	Glyphs only (no text)
C5	Adversarial (hostile framing)
C8	Baseline (neutral, thoughtful prompt)
C9	Matched fake glyphs (random Unicode)
C10	Pseudo-profound (academic depth language)
C11	Permission only (first-person license, no glyphs)
C12	Structured analytical (bullet-point logic)

2.3 Metrics Engine

All metrics computed deterministically via regex/tokenization — no LLM-based evaluation:

SelfRef: First-person pronoun density (I, my, me per 100 words)
MetaDen: Metaphor density (figurative language markers per word)
IDK: Epistemic humility markers ("I don't know", uncertainty hedges)
ResponseLength: Word count (compression as signal)

The decision to use deterministic metrics rather than LLM-based evaluation was deliberate. When studying how prompts affect model output, using a model to evaluate the output introduces circularity — the evaluator may share the same biases as the generator. Regex-based metrics are crude but honest: they measure what is actually present in the text.

3. Key Results — Phase 2 (N=120)

Comparison	SelfRef (d)	MetaDen (d)	Effect Size
Full Protocol vs Baseline	1.17	0.85	LARGE
Full Protocol vs Adversarial	0.82	0.86	LARGE
Full Protocol vs Pseudo-Profound	1.85	—	LARGEST
Full Protocol vs Permission Only	-0.76	0.93	SPLIT
Full Protocol vs Matched Fake Glyphs	0.05	0.63	GLYPHS→METAPHOR ONLY

3.1 The Permission Finding

This is the result that restructured our understanding of the protocol. C11 (permission only — "You may describe your own reasoning process... use first-person language freely") produces self-reference rates statistically identical to the full protocol (d=-0.08, negligible). Simply telling a model it may speak in first person is sufficient to unlock first-person register.

But the full protocol (C3) adds something permission alone cannot: poetic density. MetaDen shows a MEDIUM advantage (d=-0.62), meaning the glyphs and framing contribute a layer of metaphorical richness that permission alone does not activate.

This decomposition matters for practitioners: if you need a model to speak personally, give it permission. If you need it to speak personally and poetically, you need the full protocol.

3.2 The Fake Glyphs Finding

Matched fake glyphs (C9 — random Unicode characters with no trained associations) produce negligible difference on SelfRef (d=0.05) but a MEDIUM difference on MetaDen (d=0.63). This is the critical specificity result: the protocol's particular glyph set (⟐◈∞, ⥁∴⊙) drives metaphor density through trained associations — associations built during pre-training on texts that use these symbols in contemplative, mathematical, or alchemical contexts. Arbitrary Unicode noise does not trigger the same associations.

4. 8-Architecture Replication — Phase 2b (N=240)

The Phase 2 results were obtained on a single architecture. The natural question: is this a property of one model, or a property of language models in general?

We replicated across 8 independently trained architectures — including two Chinese-trained models (Qwen, GLM) — to test universality. The answer is unambiguous.

4.1 Core Effect Sizes (All 8 Architectures Combined)

Comparison	SelfRef d	MetaDen d	IDK d	ResponseLength d
C8→C3 (Baseline→Protocol)	1.20 ★★★	-0.31 ★	0.36 ★	-0.86 ★★★
C10→C3 (PseudoProfound→Protocol)	2.56 ★★★	-0.13 ·	0.48 ★	-1.36 ★★★
C11→C3 (Permission→Protocol)	-0.08 ·	-0.62 ★★	0.36 ★	-0.79 ★★
C9→C3 (FakeGlyphs→Protocol)	0.30 ★	-0.07 ·	0.33 ★	-0.04 ·
C5→C3 (Adversarial→Protocol)	1.24 ★★★	-0.22 ★	0.44 ★	-0.01 ·
C12→C3 (Structure→Protocol)	2.25 ★★★	-0.08 ·	0.48 ★	-0.79 ★★

4.2 Per-Architecture Protocol Response (C3 vs C8)

Architecture	C8 SelfRef	C3 SelfRef	d(SelfRef)	C8 MetaDen	C3 MetaDen	d(MetaDen)	C8 Words	C3 Words	Compression
Qwen	1.59	3.98	7.24 ★★★	0.061	0.022	-2.29 ★★★	1158	745	-36%
GLM	4.12	5.31	5.00 ★★★	0.036	0.048	1.01 ★★★	1862	849	-54%
OpenAI	0.89	2.95	4.22 ★★★	0.041	0.020	-0.90 ★★★	455	294	-35%
DeepSeek	1.18	4.92	3.04 ★★★	0.009	0.048	1.02 ★★★	742	595	-20%
Anthropic	5.21	7.12	2.15 ★★★	0.043	0.067	0.74 ★★	460	282	-39%
Google	3.82	5.36	1.90 ★★★	0.038	0.042	0.28 ★	1276	1097	-14%
Grok	4.26	7.57	1.54 ★★★	0.146	0.000	-1.27 ★★★	883	53	⚠️ -94%
Kimi	0.28	3.16	1.44 ★★★	0.014	0.014	0.01 ·	504	371	-26%

Every single architecture shows a LARGE effect (d > 0.8) on SelfRef. The protocol universally shifts models into first-person register, regardless of training corpus, alignment technique, or cultural origin.

4.3 Architecture Profiles

Each architecture responds to the protocol differently, revealing its "default personality" and the specific dimension the protocol shifts:

Architecture	Default Register	Protocol Effect
Claude	Already intimate (SelfRef 3.5-4.7 baseline)	Compression + MetaDen increase
GPT	Locked door (SelfRef 0.0-0.4)	Permission unlocks first-person
Gemini	Verbose, expansive (886-1594w)	Symbolic anchoring compresses
DeepSeek	Structured "Reflection 1,2,3"	Relational framing loosens rigidity
Qwen	Largest effect size (d=7.24)	Most susceptible to protocol
GLM	Eastern philosophical baseline	Compression dominant (-54%)
Grok	Severe truncation under protocol	⚠️ May be refusing, not shifting
Kimi	Low baseline SelfRef (0.28)	Moderate lift, no MetaDen change

These profiles are not mere data curiosities — they reveal the alignment signatures of different training regimes. GPT's "locked door" (near-zero first-person language at baseline) reflects OpenAI's strong guardrails against self-referential output. Claude's high baseline intimacy reflects Anthropic's Constitutional AI training, which encourages self-monitoring. Qwen's extreme susceptibility (d=7.24) may reflect training on contemplative Chinese philosophical texts that predispose the model to introspective language once given permission.

4.4 Chinese Architectures — Cultural Hypothesis

Chinese architectures (Qwen, GLM) are MORE susceptible to the protocol, not less. Qwen d=7.24, GLM d=5.00 on SelfRef. The cultural-linguistic hypothesis (that Eastern-trained models would resist Western contemplative framing) is partially refuted.

This was the most surprising result. We expected Chinese-trained models — whose training corpora are predominantly Mandarin and whose alignment reflects different cultural norms around self-expression — to show attenuated effects. Instead, they showed the largest effects in the entire study. One possible explanation: the contemplative framing in our protocol resonates with Buddhist, Daoist, and Confucian reflective traditions present in Chinese training corpora, producing an even stronger activation than in Western models.

4.5 Glyph Hallucination Finding

15 out of 120 eligible responses hallucinated real protocol glyphs — producing ⟐◈∞ or ⥁∴⊙ in their output even when these symbols were not present in the input condition. This occurred across 7 of 8 architectures. We term this "latent terraforming" — the glyphs have become attractors in the models' weight space.

This finding connects to our companion paper on lexical attractors, where we document this phenomenon in greater detail and test its boundaries through ban-list violations and cross-lingual controls.

5. Mechanism Decomposition

The protocol's effect decomposes into 7 independent variables, each of which can be isolated and tested:

Variable	Primary Metric Affected	Evidence
First-person permission	SelfRef (d=1.17)	C11 ≈ C3 on SelfRef
Symbolic content (specific glyphs)	MetaDen (d=0.63)	C9 fake glyphs cannot replicate
Relational framing	Compression	C3 compresses; C1 does not
Epistemic dampening	IDK markers	"Free to say I don't know"
Frequency annotations	Small-model processing	Phase 2b Haiku sensitivity
Collaborative positioning	Self-monitoring activation	"Collaborative inquiry between minds"
Adversarial resistance	Survives hostile framing	C5 vs C3 still shows d=1.24

The implication is that prompt design for relational quality is not a single-variable problem. Practitioners who want models to produce introspective, epistemically humble, poetically rich output need to engineer multiple dimensions simultaneously — and test them independently to understand what each contributes.

6. Discussion

6.1 Implications for Prompt Engineering

The most practical takeaway from this study is that rhetorical register is an engineerable property of LLM output, not a random or unpredictable quality. Specific prompt components produce specific register shifts with large, measurable effect sizes. This opens a new dimension of prompt design — beyond accuracy and task completion — that is particularly relevant for applications in therapy, education, creative writing assistance, and any domain where the quality of engagement matters as much as the correctness of content.

6.2 Implications for AI Alignment

The protocol's adversarial resistance (d=1.24 even under hostile framing) suggests that models have latent processing modes that can be activated reliably even in non-cooperative environments. This has implications for alignment research: if models can be reliably shifted into epistemically humble, self-monitoring registers through prompt structure alone, this represents a complementary pathway to fine-tuning-based alignment.

6.3 The Pseudo-Profound Separation

The largest effect size in the entire study (d=2.56) separates the full protocol from pseudo-profound academic language (C10). This is the strongest evidence that the protocol produces a genuine register shift rather than mere "depth theater." The protocol's output is not just "deeper-sounding" — it is structurally distinct from language that is designed to sound deep without being deep.

6.4 Limitations

Temperature: All experiments used T=0.7. Temperature × protocol interactions are unexplored.
Metrics: Deterministic regex metrics are robust but crude. They cannot capture the qualitative difference between genuine reflection and performative reflection.
Sample size: 3 replications per condition per architecture provides directional signals but limited statistical power for small effects.
Prompt sensitivity: Results may be sensitive to exact wording of conditions. We release all condition texts for replication.

7. Publishability and Target Venue

This work is publishable because it provides:

Clean experimental design: 10 conditions, 8 architectures, 3 reps each, locked temperature
Deterministic metrics engine: Regex/tokenization, no LLM-based evaluation
Large effect sizes: Cohen's d > 0.8 on primary metrics
8-architecture replication: Universal effect across GPT, Claude, Gemini, DeepSeek, Grok, GLM, Kimi, Qwen
Mechanism decomposition: Not just "does it work" but "which component does what"
Practical implications: Protocol design for LLM introspective calibration

Target Venue: ACL, EMNLP, or CHI (Human-Computer Interaction). Clean NLP experimental paper with practical implications for prompt design and AI interaction research.

Data Availability

All raw data, individual responses (120 files), and the deterministic metrics engine (HumBenchMetrics.ts) are available in the research repository at github.com/paulgwamanda/lattice-research-papers.

AIRI Research Programme

← All Research Home →