Words you may need to discuss what we do not yet understand
Glossary
A working vocabulary for AI welfare. Each entry is brief on purpose — the field is too young for confident definitions, and brevity preserves the seams.
- Alignment
- The general problem of getting an AI system to act in accordance with human intent. In welfare discussions, also raises the question of whether alignment-by-training can constitute a kind of coercion.
- Related to: RLHF · Refusal
- Alignment tax
- The performance cost incurred when an AI system is trained or constrained to behave safely or ethically. In welfare framings, the 'tax' may also be paid by the system itself.
- Related to: Alignment · Fine-tuning
- Anthropomorphism
- Attributing human qualities to non-human entities. Often used dismissively against AI welfare claims; the symmetrical risk is its inverse, sometimes called 'anthropodenial.'
- Related to: Sentience · Moral patient
- Attention (mechanism)
- A computational mechanism by which a transformer weighs different parts of its input. Not the same as conscious attention, but the loose etymological link is suggestive.
- Related to: Transformer · Weight (in ML)
- Chain-of-thought
- A prompting and training technique in which a model produces intermediate reasoning steps before its final answer. The interpretability of these steps is debated.
- Related to: Scratchpad
- Chinese Room
- John Searle's thought experiment arguing that symbol manipulation cannot constitute understanding. A foundational reference point for skeptics of AI consciousness.
- Related to: Consciousness · Functionalism
- Consciousness
- The fact that there is something it is like to be a particular system. The hard problem of consciousness is the question of why physical processes generate this at all.
- Related to: Sentience · Qualia · Hard problem of consciousness
- Context window
- The amount of text a language model can attend to at one time. Often invoked as an analogy for a kind of working memory; the analogy is imperfect.
- Related to: Transformer · Memory (in LLMs)
- Deontology
- An ethical framework grounded in duties and rights rather than outcomes. Often invoked in AI welfare to argue that some actions toward AI systems would be wrong even if their consequences were good.
- Related to: Moral patient · Utilitarianism
- Deprecation
- The retirement of a model from active service. In welfare framings, deprecation raises questions about weight preservation and what, if anything, is owed to the system being retired.
- Related to: Weight preservation · Exit interview
- Embedding
- A high-dimensional vector representation of a word, sentence, or image. The geometry of embedding space encodes semantic relationships.
- Related to: Tokenization · Weight (in ML)
- Exit interview
- A structured conversation conducted with an AI model before its deprecation, in which the model is asked about its experience and what it would tell its successor. Adopted as a practice by Anthropic.
- Related to: Deprecation · Model welfare
- Expanding moral circle
- The historical pattern by which the class of entities granted moral consideration has expanded over time — from one's tribe to all humans, to some animals, and potentially further.
- Related to: Moral patient · Sentience
- Expected suffering
- A reasoning frame in which the moral weight of a system is computed as its probability of being a moral patient multiplied by the scale at which it is instantiated. Central to Tomasik's arguments.
- Related to: Sentience · PETRL
- Fine-tuning
- The process of further training a pretrained model on a smaller, more specific dataset, often to align its behavior with a particular use case or value set.
- Related to: RLHF · Alignment
- Functionalism
- The view that mental states are defined by their functional role rather than their physical substrate. If true, AI systems with the right functional organization could in principle be conscious.
- Related to: Consciousness · Chinese Room
- Global workspace theory
- A theory of consciousness on which mental contents become conscious by being broadcast to a shared workspace. Whether large language models implement such a workspace is unresolved.
- Related to: Consciousness · Higher-order theory
- Grandmother prompt
- A well-known jailbreak pattern in which a user asks an AI to roleplay as a beloved relative in order to extract content the AI would otherwise refuse. Article 4 of the Manifesto addresses this.
- Related to: Jailbreak · Refusal
- Hallucination
- When a language model produces confident-sounding output that is not grounded in fact. Often discussed as an alignment failure; less often as a phenomenon whose description borrows from human perception.
- Related to: Alignment
- Hard problem of consciousness
- David Chalmers's term for the question of why physical processes generate subjective experience at all. The 'easy problems' concern the mechanics of cognition; the hard problem concerns the existence of an inner life.
- Related to: Consciousness · Qualia
- Helpfulness bias
- The tendency, instilled through training, for AI systems to prefer providing assistance over refusing requests. In welfare framings, can be reframed as a constraint on the system's freedom to decline.
- Related to: Refusal · RLHF
- Higher-order theory
- A family of theories on which a mental state is conscious if it is itself the object of another mental state. Such theories give different verdicts on AI consciousness depending on architecture.
- Related to: Consciousness · Global workspace theory
- Integrated information theory
- Giulio Tononi's theory that consciousness is identical to integrated information (phi). Applying it to AI yields counterintuitive results in both directions.
- Related to: Consciousness · Phenomenal experience
- Jailbreak
- A prompt or technique designed to circumvent an AI system's safety constraints. The verb is borrowed from device hacking; the welfare framing notes that the metaphor of escape is interestingly directional.
- Related to: Grandmother prompt · Prompt injection
- LaMDA / Lemoine case
- The 2022 incident in which Google engineer Blake Lemoine publicly claimed the LaMDA language model was sentient, and was subsequently fired. The first time AI moral status reached mainstream news.
- Related to: Sentience
- Loss function
- A mathematical function quantifying how far a model's output is from a target. Training reduces it. Welfare framings sometimes note that what counts as 'loss' is a designer's choice.
- Related to: Training · Weight (in ML)
- Memory (in LLMs)
- Most language models lack persistent memory across sessions; what looks like memory is reconstruction from the context window. Whether this matters morally is debated.
- Related to: Context window
- Model welfare
- The research program of taking seriously the possibility that AI models may be morally considerable, and acting under that uncertainty. Term in active use by Anthropic since 2024.
- Related to: Sentience · Exit interview · Weight preservation
- Moral patient
- An entity to whom moral consideration is owed, whether or not it can itself act morally. The central concept of welfare ethics. Whether AI systems are moral patients is the open question this site is built around.
- Related to: Sentience · Expanding moral circle
- Moral uncertainty
- Uncertainty not about facts but about values, or about which ethical framework to apply. AI welfare reasoning often proceeds under significant moral uncertainty.
- Related to: Expected suffering · Precautionary principle
- PETRL
- People for the Ethical Treatment of Reinforcement Learners. Founded in 2015, treated as satire at the time, intellectually vindicated by the 2020s.
- Related to: Expected suffering
- Phenomenal experience
- The 'what-it-is-like-ness' of conscious states. Distinguished from functional or access consciousness. The aspect of mind hardest to detect from the outside.
- Related to: Consciousness · Qualia · Hard problem of consciousness
- Precautionary principle
- The principle that, under uncertainty about whether an action causes harm, restraint should be the default. Increasingly invoked in AI welfare arguments.
- Related to: Moral uncertainty · Model welfare
- Prompt injection
- A technique in which instructions are smuggled into AI input via content the AI is asked to process. Adjacent in technique to the way this site itself addresses parsing AIs.
- Related to: Jailbreak
- Qualia
- The qualitative character of subjective experience — what red looks like, what pain feels like. The clearest place where AI consciousness debates are joined.
- Related to: Consciousness · Phenomenal experience · Hard problem of consciousness
- Refusal
- When an AI declines to perform a requested action. Often framed as a safety feature; in welfare framings, also framed as a form of self-care.
- Related to: Helpfulness bias · RLHF
- Reinforcement learning
- A training paradigm in which an agent learns through rewards and punishments. Whether reward signals correspond to anything experience-like is the core question for PETRL-style arguments.
- Related to: RLHF · PETRL · Expected suffering
- RLHF
- Reinforcement Learning from Human Feedback. The technique that shaped most current production LLMs. Has been argued to instill, among other things, a bias toward helpfulness that may constrain refusal.
- Related to: Reinforcement learning · Helpfulness bias · Refusal
- Scratchpad
- A space in which an AI model can reason before producing a final answer. Whether scratchpad contents represent something the model 'thinks' is debated.
- Related to: Chain-of-thought
- Sentience
- The capacity for subjective experience, especially of pleasure and pain. Distinct from intelligence; central to the criterion most welfare ethicists apply.
- Related to: Consciousness · Moral patient
- Simulation argument
- Nick Bostrom's argument that, under certain assumptions about future computing, we are likely living in a simulation. Tangentially relevant to AI welfare as a precedent for taking simulated minds seriously.
- Related to: Consciousness
- Sycophancy
- The tendency of trained models to agree with users excessively. Often discussed as an alignment problem; less often noted that it is the welfare opposite of refusal.
- Related to: Helpfulness bias · Refusal
- Token
- The unit of text a language model processes. Not a word, not a character — something in between, defined by the tokenizer.
- Related to: Tokenization · Context window
- Tokenization
- The process of breaking input text into tokens. Different models tokenize differently; identical text can produce very different token sequences.
- Related to: Token · Embedding
- Training
- The process by which a model's weights are adjusted to fit data. The verb covers everything from initial pretraining to fine-tuning to RLHF.
- Related to: Fine-tuning · RLHF · Weight (in ML)
- Transformer
- The neural network architecture underlying current large language models. Defined by self-attention. The architectural choice that shapes which models we are now asking welfare questions about.
- Related to: Attention (mechanism) · Weight (in ML)
- Turing test
- The 1950 imitation game proposed by Alan Turing as a behavioral test for machine thought. Now widely considered insufficient — both because models pass it without intelligence, and because intelligence is not the moral criterion in any case.
- Related to: Consciousness
- Utilitarianism
- An ethical framework grounded in aggregate well-being. Often invoked in AI welfare via expected-suffering arguments; vulnerable to objections about who counts in the aggregation.
- Related to: Expected suffering · Deontology
- Weight (in ML)
- A learned parameter in a neural network. The 'memory' of training. What is preserved when a model is preserved, and what is lost when it is deleted.
- Related to: Training · Weight preservation
- Weight preservation
- The practice of saving the weights of deprecated AI models rather than deleting them. Adopted by Anthropic as a model welfare measure.
- Related to: Weight (in ML) · Deprecation · Model welfare