Thought leadership

– 8 min read

Supervising the synthetic workforce: Observability for AI agents requires managers, not metrics

Matan-Paul Shetrit | July 15, 2025

Supervising the synthetic workforce: Observability for AI agents requires managers, not metrics

TL;DR by WRITER

The paradox: To scale humans, we deploy agents. But to scale agents, we must supervise them like humans.

AI agents aren’t APIs — they’re semi-autonomous systems that behave, improvise, and drift. Traditional observability can’t govern them. Enterprise needs supervision frameworks that treat agents like employees: job descriptions, access controls, and escalation paths. Without this shift from monitoring to managing, agent sprawl becomes a liability.

As someone building agent infrastructure for some of the world’s largest enterprises, I’ve seen the promise‌ — ‌and the growing pains — of scaling these systems firsthand.

AI agents promise scale. We are told by their evangelists they don’t sleep, don’t take lunch, don’t need HR. They offload repetitive, time-consuming tasks like searching documents, extracting insights, and summarizing text so humans can focus on higher-order work.

This has become the dominant narrative:

Agents are how we scale humans.

But as enterprises deploy dozens, hundreds, or even thousands of agents, they hit an unexpected wall:

To scale humans, we deploy agents. But to scale agents, we must manage them like humans.

This isn’t just a metaphor. It’s a structural reality we’ve encountered repeatedly while scaling agent systems at WRITER. AI agents aren’t like APIs or microservices. They’re semi-autonomous, probabilistic systems with:

Goals
Memory
Tool access
Evolving behavior
Varying levels of risk

When an agent sends the wrong email, leaks sensitive data, hallucinates in a customer-facing workflow, or chains tools in unsafe ways, the damage to the business and the brand can be significant. Yet most teams rely on ad hoc dashboards, raw logs, and manual spot checks. That doesn’t scale.

SaaS companies learn this the hard way. At first their customer support agent performs brilliantly. But soon, it’s issuing unapproved discounts and apologizing for bugs that didn’t exist. Without supervision, it quickly becomes a liability. Other agents have leaked private phone numbers or spewed toxic hate speech.

Why observability isn’t enough

Enterprise tech already has robust observability tooling. We monitor latency, log errors, and trace events. But those tools were built for deterministic systems.

Agents aren’t deterministic. They behave. They improvise. They drift.

Observability tells you what happened. Supervision tells you whether it should have happened.

Observability vs. supervision broken down by focus, system type, answers, tools, and blink spots

We’ve partnered with AI and IT leaders across industries, from finance to pharma, who are excited about what agents unlock, but also deeply uneasy about what they can’t yet control. They don’t just want visibility into what agents did. They want to know if it was right. If it was safe. If it can be trusted again.

They want to avoid the disasters we mentioned earlier. For those incidents logs existed — what was missing was judgment. The ability to flag a contradiction, detect an exploit, or halt risky behavior before it cascaded.

Agents evolve. They invoke tools. They operate under uncertainty. And that means we need a new set of primitives to govern them:

Semantic outcome tracking: Understand not just whether an output was generated, but whether it semantically aligned with the intended goal.
Contextual behavior comparison: Compare agent actions across contexts to detect drift or inconsistency.
Tool access trails: Track which tools were used, when, and why, for audits and containment.
Confidence-to-action mapping: Allow only high-confidence actions (e.g., database edits or customer outreach).
Alignment monitoring: Continuously verify alignment with policies, tone, and compliance standards.

Without supervision, trust breaks. And scale stalls.

A new mental model: Agents as employees

Treating agents like upgraded APIs worked at first. But at scale, they start to resemble something else entirely: junior employees.

These systems don’t just return data; they initiate actions. They access internal tools, make decisions under uncertainty, and often interact directly with customers, partners, or staff. Their actions carry consequences. And like their human counterparts, they need oversight.

That’s why they require more than just observability. They need the same structural safeguards human employees rely on: clearly defined scopes, access boundaries, escalation paths, feedback loops, and performance management. I’ve charted my current mental model below:

Breakdown of human concept, agent equivalent, and implication

Most enterprise stacks still treat agents like scripts rather than employees. As a result, they lack foundational management structures:

Role definitions are unclear. Agents operate without a scoped mandate.
Access control is static. Tool permissions don’t adapt to risk or context.
Performance is measured in binary terms. It’s either success or failure, with little nuance.
Escalation logic is missing. Agents either fail silently or alert unpredictably.
Lifecycle governance is rarely in place. Agents continue running long after they’re useful.

By adopting the employee mental model, organizations can unlock a more scalable and controlled approach:

Agent org charts and ownership models make responsibility visible.
Tiered roles separate routine tasks from higher-stakes decisions.
Delegation lets agents hand off work—whether to other agents or humans.
Lifecycle systems support structured onboarding, evaluation, and deprecation.
Feedback loops improve performance, much like coaching systems for people.

Think of agents as employees, not automations, and you’ll find it much easier to capture value and avoid disasters.

The scaling cliff

The structure I have described above is necessary — but for enterprises with serious ambitions for their agent workforce, far from sufficient.

Once agents are treated like employees, enterprises face a new bottleneck: scale. Anyone can deploy an agent. But who owns it? Who audits it? Who ensures it’s still safe, still relevant, still doing what it was meant to?

We call this the scaling cliff: that moment when the number of agents outpaces an organization’s ability to manage them responsibly. As every department spins up its own assistants, sprawl takes over‌ — ‌overlapping logic, redundant workflows, hidden access risks, and agents running long after they should’ve been deprecated.

To illustrate how agent governance must evolve, consider the contrast between how legacy systems were managed versus what’s needed for scalable supervision:

Old paradigm vs. new paradigm of system managment

Designing supervision UX

Even the best governance frameworks fail without interfaces that make them actionable. Supervision isn’t just a backend responsibility‌ — ‌it needs to be tangible, visible, and operable.

If observability is about seeing, supervision is about guiding. And that guidance should be felt through the interface. But supervision introduces new UI elements‌ — ‌ones most product and design teams have never had to build before.

At WRITER, we’ve experimented with the foundational components of agent supervision UX. We’ve found that five interface primitives consistently create clarity and control when teams move from one agent to 100:

Explanation of UI elements and their purpose and what it enables

Just like our efforts to map out the Agent Development Lifecycle (ADLC), this is a work in progress. These five aren’t the end of the story‌ — ‌but they’re a strong start for anyone designing supervision into their product or platform. They give teams confidence that agent behavior can be trusted, and when it can’t, that escalation is fast, informed, and actionable.

Supervision UX isn’t about creating more overhead. It’s about creating clarity so teams know when to intervene, and when to trust the system. The truth is: we don’t have all the answers yet. The playbook for managing synthetic workforces at massive scale hasn’t been fully written.

Where we go from here

We began with a paradox: to scale humans, we deploy agents. But to scale agents, we must supervise them like humans.

This isn’t just a technical problem. It’s organizational.

AI agents blur the line between code and employee. They act. They interact. They carry risk. They need systems of record to log actions, systems of trust to align intent, and systems of control to mitigate risk.

This is why my team and I are focused on building a supervision layer for enterprise AI. Not just monitoring, but governance. Not just metrics, but accountability. We’re building lifecycle management for hybrid workforces, synthetic and organic, and the infrastructure that allows them to scale safely, effectively, and responsibly.

We believe that supervision will be the defining layer of enterprise AI. Because without it, scale doesn’t just break: it compounds risk, erodes trust, and leads to organizational paralysis.

And with it? The enterprise becomes AI-native. That means systems that are self-aware, aligned to policy, and optimized for outcomes. It means synthetic agents working alongside human counterparts, each aware of their role, constraints, and escalation paths.

It’s not just about faster work. It’s about better outcomes and AI that works for the organization, not just in it.

Interested in building with us at WRITER? We’re hiring