Does AI Have Consciousness? Why ChatGPT Fails So Confidently
In recent conversations, I have noticed how differently people imagine AI. Some people, shaped by movies, worry that AI may one day destroy humanity because it seems close to consciousness. Others see it as little more than a fancier search engine or chatbot. This article tries to answer a more practical question: what exactly is the AI we are using today? Why is it sometimes so useful, and why does it still become strangely dumb at critical moments?
I previously wrote about AI anxiety for knowledge workers in AI Anxiety for Knowledge Workers: Burnout & Mental Health. This piece takes one step back and looks at the more fundamental layer.
Here is the short version: today’s mainstream chat-based AI is not one mysterious black box. It is a combination of Transformer language models, large-scale pretraining, instruction tuning, retrieval and tool use, plus workflow and safety layers around the model. It can produce remarkable results not because it has consciousness, but because it has learned highly compressed statistical patterns from massive language and code corpora. With the right prompt, tools, and verification, those patterns can become useful output.
TL;DR: Four Core Claims
These are my own notes after reading and digesting the topic:
- Mainstream AI is fundamentally a probabilistic generation system, not digital consciousness: fluent conversation makes it very easy for people to anthropomorphize the system and overtrust it.
- It often “looks accurate” because of three things: statistical patterns learned at scale, post-training that aligns it with user intent, and external grounding through tools and retrieval.
- Agents are not magic: in practice, an agent usually means orchestration across a model, tools, state, routing, retries, and human review. Not every task deserves a multi-agent setup.
- Mature AI use is not about worshiping prompt spells or assuming AI can do everything: it is about defining the task clearly, providing the right context, asking for sources, and cross-checking important claims.
LLM Basics: Transformer
Let’s start with a short history. Before 2017, mainstream language models such as RNNs and LSTMs processed text a bit like they were forced to read one token at a time: read the current token, then move on to the next. When a sentence became long, earlier information could fade, and training was difficult to parallelize. In 2017, a Google research team published a confidently titled paper, Attention Is All You Need 1. It introduced the Transformer architecture, which allowed models to use attention to directly identify related positions in a sentence instead of relying only on step-by-step recurrent memory. Things moved quickly after that: BERT and the first GPT appeared in 2018, GPT-3 showed the power of scale in 2020, and ChatGPT brought the technology to the general public at the end of 2022. Today, almost every mainstream large language model (LLM) - GPT, Claude, Gemini, and others - is built on Transformer architecture or one of its variants.
timeline
title A Short History of Transformers
Before 2017 : RNN / LSTM
: Read one token at a time; long context was easy to forget
2017 : Transformer
: Attention Is All You Need
2018 : BERT and the first GPT take different paths
2020 : GPT-3 shows the power of scale
2022 : ChatGPT brings LLMs to the public
So what is a Transformer? You can think of it as a reading machine that is very good at deciding what matters. When it reads a piece of text, it does not mainly depend on slowly carrying memory from left to right. Instead, every token in the sentence asks, at the same time: “Which other tokens should I pay attention to right now?” For example, in “Ming threw the ball to Hua, and then he ran away,” when the model processes “he,” it can weigh whether “Ming” or “Hua” is the more likely reference. This ability to decide where to look, and how strongly to look there, is called attention, and it is the core of the Transformer.
You do not need to memorize any equations. The key intuition is this: a Transformer builds an “attention map” for each position in the sentence, showing where it should look and how much weight each position should get. It then mixes that information into the representation used for the next step. Because every position can be computed in parallel, Transformers are naturally suited to large-scale GPU training. That engineering property is one major reason the models were able to grow so quickly.
flowchart LR
A[Input text
"Ming threw the ball to Hua"] --> B[Split into tokens]
B --> C[Embedding
Turn each token into a vector]
C --> D[Self-Attention
Each position asks:
who should I pay attention to?]
D --> E[FFN
A small network processes it again]
E --> F[Repeat many layers
Layer 1..N]
F --> G[Output
Probability of the next token]
D -.-> H[Attention map concept
"threw" pays attention to "ball" and "to whom"]
For a more everyday analogy, imagine doing a reading comprehension question. When you see “Who threw the ball?”, your eyes naturally scan back to the important words - who, threw, ball, to whom - instead of giving every word in the paragraph equal weight. A Transformer turns that “look back at the important parts” behavior into trainable computation at massive scale.
That is why, when we say an LLM is “smart,” what is really happening underneath is more specific: the Transformer has learned statistical regularities from large amounts of text, such as which words tend to appear together, which sentence patterns usually continue in which ways, and which context hints at the next sentence. During generation, it uses those regularities to predict the next token, step by step.
In LLM processing, tokens are turned into embeddings - more precisely, vector-space representations - so the whole task becomes a mathematical probability problem:

If we want a simple mathematical analogy, the problem looks less like a school equation with one answer and more like this:
x + y < 100. Find x and y.
At first glance, it looks simple. But notice that there is no unique answer. x = 10 and y = 20 works. x = 66 and y = 33 also works. There are infinitely many combinations that satisfy the condition. Most of us are used to problems like “x + 3 = 10, solve for x,” where there is only one correct answer. Large language models are solving something closer to: “Among many possible answers, choose the next step that looks most reasonable given the context.”
In text generation, your context acts like the constraint, while the statistical regularities learned from large text corpora tell the model which combinations look like something a human might write. So if you ask the same question twice, and the system allows randomness, the model may answer x = 10, y = 20 one time and x = 66, y = 33 another time. Both answers satisfy the constraint, but they are not the same. Keep that image in mind, because it explains two things at once: why LLM answers can vary, and why the model can sometimes be confidently wrong. “Linguistically plausible” does not mean “factually correct.” We will return to this intuition when discussing decoding and the slot machine analogy.
Do We Really Understand AI?
Based on where LLM development is today, my own view is this: the research community knows how to train these models, but does not yet fully understand why some abilities become so effective at scale. We understand the “how” fairly well: training usually uses gradient descent to reduce the loss from predicting the wrong next token. The training programs and optimization objectives are designed by humans; there is nothing mystical about that part. But the “why does it work so well” part is thinner. Why does a system trained to predict the next token also learn grammar, translation, programming, and some forms of reasoning? Why can a model with enough parameters to memorize huge amounts of training data still generalize to problems it has not seen before instead of merely reciting answers? Today’s deep learning theory does not yet fully answer these questions.
The more honest description is: we have a recipe that works when followed, but we do not fully understand the chemistry behind it. That is why mechanistic interpretability has become an active research area. Researchers open up models almost like neuroscientists, looking for circuits and features responsible for particular concepts, trying to answer “what did the model actually learn?” But the field is still young. For everyday users, the implication is practical: model boundaries are discovered through evaluation, not fully designed in advance. Even the labs that build these models need extensive testing to know where they are strong and where they are weak.
A system we do not fully understand should not be treated as a guaranteed truth machine.
Language Models Became a General-Purpose Text Interface
If I had to compress recent LLM development into one sentence, I would put it this way: after 2017, language models did not just become larger; they became a general-purpose text interface. Transformers provided a parallelizable attention architecture that could be trained at scale. BERT 2 pushed the encoder-only path to the forefront of language understanding tasks. GPT-3 brought few-shot ability into mainstream attention 3. InstructGPT and later chat models showed that post-training and human feedback can matter more for “doing what the user means” than simply making the model larger.
This timeline sketches the skeleton behind today’s LLMs and agentic systems:
timeline
title Selected Milestones for LLMs and Agentic AI
2017 : Transformer
2018 : BERT
: SentencePiece
2019 : T5
: BART
: Sentence-BERT
2020 : GPT-3
: Scaling Laws
: DPR
: RAG
2022 : InstructGPT
: ReAct
: Emergent Abilities
2023 : GPT-4
: Toolformer
: AutoGen
: CAMEL
: Emergence Mirage Debate
2024-2026 : Simple composable agent patterns mature
: Graph workflows and multi-agent orchestration are documented
It is easier to understand that different models are not the same tool if we separate the major Transformer families:
| Type | Examples | How it works | Strengths | Common limits |
|---|---|---|---|---|
| Encoder-only | BERT | Reads context bidirectionally, focused on understanding | Classification, extraction, judgment, representation learning | Not good at natural long-form generation |
| Decoder-only | GPT-style models | Generates autoregressively, token by token | Conversation, writing, code generation, general completion | Can hallucinate; factual updates depend on tools or external data |
| Encoder-decoder | T5 / BART | Encodes input, then decodes output | Summarization, translation, QA, seq2seq tasks | More task-oriented; not always the best general chat backbone |
How an LLM Generates a Sentence
When a chat-style LLM generates an answer, the simplest version of the process is: split text into tokens, map them into vectors, add positional information, pass them through many layers of self-attention and feed-forward networks, output a probability distribution for the next token, use a decoding method to choose the next step, and repeat until the answer ends.
There are three important details here:
- A token is not necessarily a character or a word. It is a unit from the model’s vocabulary. Subword tokenization matters because natural language has an open vocabulary: new words, abbreviations, names, and spelling variations keep appearing. There are many ways to tokenize text. BPE, for example, uses a fixed-size vocabulary to represent variable-length subword strings 4. SentencePiece goes further by being language-independent and trainable directly from raw sentences, which is especially useful for languages like Chinese that do not use spaces between words 5.
- Training data is not a knowledge base. It is a corpus that lets the model learn conditional distributions. The GPT-3 paper describes a 175-billion-parameter autoregressive model trained mainly on Common Crawl, with books and Wikipedia also included 3. OpenAI’s public description of GPT-4 says the base model was pretrained on publicly available and licensed data, then aligned to user intent with RLHF 6. The coverage is enormous, but the biases, duplication, and uneven quality of the internet also enter the model. More data does not mean a complete understanding of the world.
- Usability comes mainly from post-training. BERT is a bidirectional representation model, and GPT-3 demonstrated few-shot transfer. But the turning point that made models feel able to “talk like a person and follow what you mean” came from methods such as InstructGPT, which used demonstrations and human preference rankings for post-training. OpenAI showed that a 1.3B InstructGPT model could beat the original 175B GPT-3 in human preference ratings 7. Whether AI feels useful is often a matter of training objective and interface design, not just parameter count.
At the output stage, the biggest difference is often not how much the model “knows,” but how you sample from its probability distribution 8. This table is the key to understanding why AI can feel like a slot machine:
| Decoding method | How it chooses the next token | Strength | Typical problem |
|---|---|---|---|
| Greedy | Always picks the highest-probability token | Stable, reproducible, cheap | Can commit early to the wrong path |
| Beam Search | Keeps several high-probability candidate sequences | Sometimes useful for structured generation | Optimizing model probability is not the same as optimizing truth or naturalness |
| Top-k / Top-p | Samples only from a high-probability subset | Balances diversity and fluency | Poor settings can make output too chaotic or too conservative |
| Self-consistency | Samples multiple reasoning paths and votes | Often improves reasoning tasks | More expensive, and the group can still be collectively wrong |
A High-Accuracy Slot Machine
The world contains many compressible patterns, and language happens to preserve traces of many of them. As model size, data, and compute grow, language model performance tends to improve in predictable ways 9. GPT-3 also showed transfer ability from zero-shot to few-shot settings when trained on large-scale datasets 3. This does not prove the model “understands” like a human, but when a task overlaps enough with the training distribution, the model can be extremely useful.
Accuracy has another source: external memory and tools, such as RAG, or Retrieval-Augmented Generation. The RAG paper points out that parametric memory stores a lot of knowledge, but has limits in access and updates for knowledge-intensive tasks. Once a model is connected to a vector index, generated output can usually become more specific and more fact-grounded 10.
So many agents that appear “smarter” are not smarter because a new mind appeared. They have external tools and better control flow.
More precisely, the model first calculates a probability distribution for the next token. The system then uses decoding strategies such as greedy sampling, top-p, and temperature to choose the next step. Raising temperature usually increases randomness and creativity in LLM output. The answer becomes more surprising, but also less stable. At the same time, the model is not spitting out random text without structure. Its behavior still depends deeply on learned statistical patterns, context, and tool availability.
flowchart LR
A[Low temperature / Greedy
High single-shot accuracy, low diversity
Stable and conservative, good for factual tasks] -->|increase randomness| B[Medium temperature / Top-p
Tradeoff between accuracy and diversity
A common balance point for everyday tasks]
B -->|increase further| C[High temperature / heavy sampling
Lower single-shot accuracy, higher diversity
Creative and unstable]
B -. sample multiple times and vote .-> D[Multi-sample voting
Higher accuracy and diversity
Quality costs latency and money]
C -. sample multiple times and vote .-> D
Agentic AI Is Orchestration, Not Magic
If an LLM is a powerful but unstable text reasoning core, an agent is the shell that connects it to the real world. In “Building effective agents,” Anthropic notes that workflows orchestrate LLMs and tools through predefined code paths, while agents let models dynamically direct their own processes and tool usage 11.
Consistently, the most successful implementations weren’t using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns. - Anthropic
From an engineering perspective, the basic agent stack is just executable human steps decomposed into system parts: model, tool definitions, state, routing, human review, logging, and tracing.
- You tell the LLM what tools are available, and the model decides from its intermediate judgment whether to call them.
- The application runs the tool, sends the output back to the model, and the model produces the final answer.
The thing that actually increases capability is not the word “agent.” It is safely connecting external data and actions to the system.
flowchart TD
A[User request] --> B[Router or Planner]
B --> C[Single LLM judges whether it is enough]
C -->|enough| D[Answer directly]
C -->|not enough| E[Retrieval or tool call]
E --> F[Search / database / API / calculator]
F --> G[Specialist agent or subtask]
G --> H[Reviewer / Verifier]
H --> I[Final output with sources or executable result]
H -->|high risk or uncertain| J[Human-in-the-loop]
J --> I
The diagram compresses ideas from ReAct 12, Toolformer 13, function calling, graph workflows, and human-in-the-loop systems into one engineering view.
Of course, not every complex task needs multiple agents. Many tasks can be handled just as well by one agent with the right tools and prompt. I think this matters especially in the 2025-2026 enthusiasm around AI agents. One of the biggest risks is confusing “more model calls” with “more system intelligence.”
| Approach | Representative context | Best for | Main benefit | Main cost |
|---|---|---|---|---|
| Single agent + tools | Function calling / skills | Clear task boundaries with a small number of tools | Lower cost, easier debugging | Too much context can confuse it |
| Planner-executor | ReAct-style patterns | Step-by-step reasoning and external verification | Inspectable trajectory, can reduce hallucination | Higher latency |
| Multi-agent | AutoGen / CAMEL / multi-agent frameworks | Multi-domain work, collaboration-like tasks, parallel subtasks | Modular, can divide labor | High coordination cost, more failure modes |
| Graph workflow | LangGraph / workflow engines | Long-running flows, persistent state, human review | Explicit control, observability, better reliability | Higher design cost |
Is AI Conscious?
In movies about artificial intelligence, AI is almost always portrayed as one of two things: human-like, or more human than humans, often with destructive or frightening consequences. Those stories tap into people’s fear of AI. The characters are compelling because they are given desire, intent, emotion, and a self. That is good drama, but it is not an engineering description of most LLM systems today.
The problem is that, as chat models become better at seeming like they are “talking with you,” people become more likely to treat them as systems with an inner point of view. A survey in Neuroscience of Consciousness found that most general-public participants were willing to attribute some degree of possible consciousness to LLMs, and the tendency was stronger among frequent users 14.
So the more rigorous statement is not “AI is obviously conscious.” It is that public discussion is moving much faster than scientific criteria. For today’s mainstream deployed systems, what we can say with confidence is this: they are systems designed for token prediction, instruction following, tool use, and workflow orchestration. Whether they have subjective experience is not established by any public scientific evidence or accepted test. In engineering terms, we do not need to assume that they have a self, desire, or inner intent in order to explain their main behavior.
The Spread of AI Slop: Is AI Useful, or Just Making Us More Tired?
As these tools become widespread, we probably need to think about a more urgent issue: AI slop. AI slop is content produced at scale with low human oversight, low accountability, and low quality thresholds.
It can be low-quality content copied directly from chatbot answers. The problem is not only hallucination. It is volume. When entire websites, comment threads, and feeds are filled with low-cost content, SEO farms and unattended content farms drive the signal-to-noise ratio of the information environment downward.
Noy and Zhang’s experiment found that, in mid-level professional writing tasks with 444 participants, ChatGPT reduced average completion time by about 40% and improved quality by about 18% 15. Brynjolfsson, Li, and Raymond’s customer support study found that AI assistance increased average productivity by about 14%, with especially large gains for novice and lower-skilled workers, up to 34% 16. At least for tasks that can be verbalized, decomposed, and evaluated, LLMs are real productivity tools.
But the spread of AI slop reminds us: “a single task became faster” does not mean “the whole information environment improved.” When everyone can generate massive amounts of plausible-looking content at very low cost, the supply curve of content shifts explosively to the right, while human attention and review capacity do not increase at the same rate. The result often looks like this:
- You finish an article faster, but readers have a harder time finding the truly valuable one in a flood of content.
- You produce a report faster, but the organization fills up with secondhand generated artifacts that cite and hallucinate each other.
- You get an answer faster, but it becomes harder to judge whether the answer is trustworthy.
Harvard and BCG’s “Jagged Technological Frontier” study makes a similar point: AI does not create a smooth capability curve. It creates a jagged frontier. For consulting tasks inside the frontier, GPT-4 users completed more tasks, worked 25.1% faster, and produced quality more than 40% higher. But for tasks outside the frontier, people using AI were 19 percentage points less likely to give correct answers than the control group 17.
Maybe the most suitable role for today’s AI looks more like this:
| Task type | AI fit | Why |
|---|---|---|
| Drafting, summarizing, rewriting, translating, formatting | High | The task is text-based, tolerant of errors, and immediately useful |
| Internal company knowledge QA | Medium-high | Very useful when paired with retrieval, permissions, and citations |
| Coding assistance, test scaffolds, documentation | Medium-high | Strong at local patterns and boilerplate, but still needs code review |
| Multi-step process automation | Medium | Needs tools, state management, and human approval to be reliable |
| Medical, legal, financial, and other high-risk judgments | Low to medium | Useful as assistance, but not as final authority; external verification is required |
At this point, “prompt engineering” can be stripped of some marketing language. A prompt is not a spell. It is an interface specification. Its importance is not in a mysterious template, but in whether you clearly state the task, context, constraints, and evaluation criteria.
The problem is that many people now believe AI tools can quickly summarize anything they want to know, so they skip the domain learning and independent thinking needed to ask good questions. When they need to solve a domain-specific problem, they cannot write a good prompt. In software engineering, this is like failing to define the requirements clearly and then expecting a high-quality application to appear.
Yes, AI tools may help people who do not know programming build applications faster. But reviewing code structure, optimizing systems, and maintaining the application over time still require substantial domain knowledge and training. Otherwise, AI can create the illusion that an outsider has already entered a specialized field, while the prompts themselves still fail to define the problem precisely. The output then becomes vague, plausible, and difficult to trust.
For people who already have professional knowledge, the picture is different. These tools can compress the tedious parts of ideation and organization, leaving more time and energy for final review and tuning.
Conclusion
If I had to close with one sentence, it would be this: today’s AI is neither a god nor a toy. It is a highly capable, probabilistic, tool-augmented system that still needs design and constraints. Its most valuable use is not replacing thought, but compressing repetitive work, search, drafting, restructuring, comparison, and preliminary reasoning. The danger is mistaking it for a source of truth before we have built verification around it.
The next time you see a headline like “AI is about to become conscious,” it is worth returning to first principles and inspecting the technical architecture underneath. That is how we can think more rationally about where AI is actually going.
References
-
BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018) ↩
-
Language Models are Few-Shot Learners (Brown et al., 2020) ↩ ↩2 ↩3
-
Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015) ↩
-
SentencePiece: A simple and language independent subword tokenizer (Kudo & Richardson, 2018) ↩
-
Training language models to follow instructions with human feedback (Ouyang et al., 2022) ↩
-
The Curious Case of Neural Text Degeneration (Holtzman et al., 2019) ↩
-
Scaling Laws for Neural Language Models (Kaplan et al., 2020) ↩
-
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020) ↩
-
ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022) ↩
-
Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023) ↩
-
Folk psychological attributions of consciousness to large language models (Colombatto & Fleming, 2024) ↩
-
Experimental evidence on the productivity effects of generative artificial intelligence (Noy & Zhang, 2023) ↩
-
Navigating the Jagged Technological Frontier (Dell’Acqua et al., 2023) / SSRN abstract ↩