Prompt engineering vs Context Engineering and the relationship of context engineering to Observability
Introduction
Context engineering sounds like an MCP moment to me i.e. something you know you needed but were not able to articulate before.
In this post, I explore two things
- Context Engineering and its difference from Prompt engineering and
- The connection of context engineering to observability (an area I am interested in exploring within my teaching and in Erdos Research)
Broadly, I am interested in exploring the connections between Context Engineering and Observability
Context Engineering - the definition
Andrej Karpathy who coined this term explained it well.
As per the langchain blog
LLMs are like a new kind of operating system. The LLM is like the CPU and its context window is like the RAM, serving as the model’s working memory. Just like RAM, the LLM context window has limited capacity to handle various sources of context. And just as an operating system curates what fits into a CPU’s RAM, we can think about “context engineering” playing a similar role. Karpathy summarizes this well:
Context engineering vs Prompting
Context engineering vs prompting represents a shift in how we guide large language models (LLMs). While prompting focuses on crafting a single, well-worded instruction to elicit the desired output, context engineering involves designing the entire environment the model operates in—structuring memory, tool outputs, retrievals, and interaction history to shape reasoning across steps.
Prompting is often static and short-lived: you fine-tune a question or format examples within a limited context window. In contrast, context engineering is dynamic and multi-layered. It manages long-term memory, real-time document retrieval, scratchpads for reasoning, and system-level instructions—all feeding into the model’s token window.
This makes context engineering critical for agentic systems, where LLMs need to use tools, recall prior interactions, and reason over multiple turns. It's more akin to software design or UI/UX for language models—deciding what the model “sees” and “remembers” to make behavior consistent and reliable.
Conceptual View
- Prompting is a front-end UX: how you ask the question.
- Context Engineering is the backend architecture: how you construct the world the model sees before answering.
From Simon Willison:
Prompting is like telling someone what to do; context engineering is deciding what room they’re in, what tools they have, and what memories they can recall.
From LangChain:
Prompting is one layer. Context engineering is modular—it defines the full structured memory stack (user messages, agent scratchpad, retrieved documents, system instructions).
Hence,
- Prompting = crafting individual instructions.
- Context Engineering = designing the full reasoning environment.
- It includes prompt design but goes further—managing memory, retrieval, agent state, tool outputs, and system instructions.
- It’s essential for agent reliability, tool use, and multi-turn reasoning, not just clever phrasing.
Analogy in cooking
Prompting: Giving a single recipe
Context engineering: Stocking the kitchen, labeling ingredients, arranging the tools, and managing leftovers across meals
Aspects of differentiation
Definition
- Prompt Engineering: Crafting the immediate input string to the LLM.
- Context Engineering: Designing all structured inputs (prompt, memory, retrieval, tools) to guide agent behavior.
Focus
- Prompt Engineering: One-shot or few-shot input optimization.
- Context Engineering: Multi-step, persistent, and modular input orchestration.
Temporal Scope
- Prompt Engineering: Single interaction or session.
- Context Engineering: Full lifecycle: history, memory, scratchpad, tool responses.
Goal
- Prompt Engineering: Make a model respond better to a direct prompt.
- Context Engineering: Shape the model’s reasoning, memory, and tool use across tasks.
Tools Used
- Prompt Engineering: Templates, instructions, examples.
- Context Engineering: Agents, memory modules, RAG, toolchains, metadata schemas.
What Context Engineering Includes (Beyond Prompting)
As per Andrej Karpathy - (to paraphrase) context engineering is the art and science of filling the context window with just the right information for the next step
That means, we provide all this dynamically created context before the LLM invocation in the right format and level for the LLM (to provide its output).
In other words, its all about managing information for the context window The context window link is detailed and would need more posts to explain - but
Managing the context window - dynamically and efficiently - is the crucial distinction between prompt engineering and context engineering. Beyond prompt design, context engineering also includes —managing memory, retrieval, agent state, tool outputs, and system instructions.
Hence, context engineering should include at least some of the following elements - implemented by various means in different systems
- System-level context Tools available, agent configuration, instructions.
- Scratchpad / intermediate steps: Agent reasoning traces, tool outputs, state variables.
- Retrieval pipelines (RAG): Dynamic document selection, chunking, reranking.
- Long-term memory: Persistent user history or task state.
- Schemas and formatting: Structured layouts: JSON, bullet lists, context tags, etc.
Context engineering and Observability
To me, the relationship between context engineering and observability is an obvious area of exploration and research in my teaching and in Erdos Research
Observability platforms help developers and AI teams monitor, debug, and optimize complex systems—especially those built with LLMs or microservices. They allow teams to understand what's happening inside an application by capturing data like prompts, responses, latency, and errors without modifying the system.
Key features include tracing LLM calls step-by-step, logging inputs and outputs, tracking performance metrics (latency, cost, token usage), and evaluating model outputs through both automated and human-in-the-loop methods. They also offer prompt management tools for testing, versioning, and A/B comparison, alongside dataset management and batch evaluation capabilities.
These platforms often include dashboards for visualizing model behavior, tools for collaboration (trace sharing, annotations), and support for integrations with frameworks like LangChain, LangGraph, or OpenAI’s SDK. They can be cloud-hosted or self-hosted, depending on privacy or deployment needs.
Popular observability platforms for LLM apps include LangSmith (closed-source, polished UI), Langfuse (open-source, flexible), PromptLayer, Lunary, and Helicone. These tools are essential for understanding AI behavior, preventing prompt drift, and maintaining high-quality, reliable AI systems.
UPDATE - 1
As early as feb 2024 Google had the right idea in the significance of long context window - with context engineering - we are seeing these ideas becoming mainstream - led by the wider industry as a whole. Also, unlike the Google approach - Andrej Karpathy seems to be speaking of enhancing existing processes to make the best use of the longer context window. This is much more doable in my view - making use of the context window at an application level.
UPDATE-2
At the #OxfordAISummit, my colleague Anjali Jain showed an application she developed with Philip O'Shaughnessy called LUNA. This was developed using langgraph - where they learnt a lot about managing the context window through a series of steps. She kindly permitted me to update them here
In my own small experiments, I try to minimise the impact first by following housekeeping and designing the architecture-
- Start with a clean design, then layer on patterns that keep the context window under control:
- Aggregation – pull related facts into one place before the model sees them.
- Work-split – off-load heavy tasks to specialised tools or smaller models.
- External memory – park raw text or logs in a vector / KV store and fetch on demand.
- Hierarchical summarisation – roll long docs up into bite-size abstracts.
- Sliding window / chunking – feed the model only the current “pane” of a long conversation.
- Metadata tagging & filters – attach labels so retrievers can skip the noise.
- Recency gates – favour fresh snippets when the topic is time-sensitive.
- On-demand retrieval – call out to search only when the prompt lacks key facts.
However big context windows grow, they’ll still hit a ceiling, so nailing this foundation is what really makes the difference when you move from prototypes to production.
Conclusion
We live in interesting times! Never a full moment in AI.
This area is a key part of my teaching and in Erdos Research
Totally agree on emphasizing the importance of context engineering now. I just wrote a simple report on how rich context improves LLM performance, reliability, and practical use. : From Prompt Engineering to Context Engineering. We also developed a straightforward Method at University 365 (UP Method) to make it easy to apply Context Engineering daily with your favorite LLM. The results are excellent. https://www.university-365.com/post/from-prompt-engineering-to-context-engineering-a-new-era-in-ai-prompting
https://www.instagram.com/reel/DJ94fjRtvAc/?igsh=MXhoYWV0d21hYXRqNQ==
Enabling Digital Transformation across the enterprise via process and technology disruptions.
1wWith all due respect to you Professor Jaokar and of course to Oxford and all other AI experts out there, this is now getting to the point where the whole concept of a LLM is itself being subverted! Interestingly, from my perspective, it is a good thing! People are finally realising that GenAI (LLM + GPT) is a useless tool due to it’s “amorphous” nature. But everyone is so hung over from the sugar rush that they are still incapable of admitting to their mistakes. To clarify: Yes human beings can only be capable of intelligence when they have the ability to bring significant amounts of information, ie contexts (specifically plural), to bear while problem solving. But that ability is innate and certainly not unique to us. More importantly, how the brain does this “context assembly” is the Enigma that neuroscience is still working on breaking the code of. A LLM on the other hand is a “sloppy and tasteless” soup of tokens (going back to a culinary analogy) as opposed to the massively parallel (possibly quantum) structure that is the brain. GPTs are “one dimensional” since they have only one “sense” to rely on as opposed to the 6 senses that humans have. So why are we trying so hard to fit a square peg in a round hole? Baffling!!
I help Microsoft .NET Developers Integrate AI into Business Apps | I saw lumber and wood slabs
1wAbsolutely spot-on, Ajit — 100 times yes. Everyone wants to obsess over prompts, yet few want to wrestle with the deeper problem: context. And often, context is more important than the prompt itself. Your framework resonates strongly with how I see scalable AI systems evolving. In my own prototypes, especially in educational or enterprise environments, I place the burden of selecting applicable context on the user, not the assistant. Why? Because context selection is deeply tied to intent, and users often know more about what matters than the system does. Take this example: A teacher wants to generate a personalized worksheet for each student. There are layers of context—federal vs. state standards, school policies, teaching methods, subject focus, student learning styles, even anonymized insights like interests (e.g., "likes trucks") or behavioral strategies (e.g., SEL frameworks). You can store all of this in RAG or vector stores—but how does the system know what to pull unless the teacher explicitly selects what’s relevant? That’s where “context nuggets” come in.
Senior Software Engineer with experience in various domains, specializing in AI solutions
1wUnfortunately, you will not have access to more than prompt engineering in many existing environments that purport to support AI engineering. And when you have to go to war with the army you have and not the one you want this becomes painfully obvious. You are going to need it eventually. And few will understand or be capable of understanding why except you. Keep this in mind.