The previous post made the case that AI-assisted engineering is becoming a platform concern.
This post is about the follow-up question: what controls actually make that work at scale?
For me, the answer looks familiar, the useful controls are the same kind of controls that make cloud platforms dependable: ownership, cost visibility, sensible defaults, observability, policy, repeatable workflows, and feedback loops.
None of that sounds impressive in a demo. It is still the part that matters once real teams start depending on AI for day-to-day engineering work.
Cloud platforms do not work at scale because everyone remembers the right thing to do. They work because the common paths are already shaped properly. Identity is handled, logs are emitted and tags are applied. Quotas exist, cost can be attributed. Teams know who owns what, the platform gives people a safer route than inventing their own pattern every time.
AI-assisted engineering needs the same discipline!
Start with cost visibility
AI cost is awkward because it does not always map neatly to the things we are used to managing.
With cloud infrastructure, you can usually point at a resource, workload, subscription, namespace, or cost centre. It is not always clean, but there is at least a structure to work with.
With AI-assisted engineering, cost can be spread across IDE AI usage, chat, pull request generation, agent runs, model calls, tool calls, retries, embeddings, context retrieval, and background workflows. A lot of that usage is tied to people and workflows rather than long-running infrastructure.
That makes visibility harder, but more important.
GitHub Copilot usage metrics are already moving in this direction, with visibility across adoption, activity, code generation, and pull request lifecycle trends. GitHub has also added team-level usage reporting through the API, which makes it easier to join usage back to teams rather than only looking at individual activity. (GitHub Docs)
For platform teams, the important dimensions are usually team, product, repository, workflow, model, and outcome. A dashboard that only shows total usage will not tell you enough. It might show that tokens are being spent, but not whether the spend is helping delivery or creating noise.
The FinOps Foundation is also treating AI as a distinct cost management problem, with guidance around tracking usage, setting quotas, tagging resources, optimising allocation, and aligning spend to business outcomes. (FinOps Foundation)
The practical lesson is simple enough: do not wait until the invoice becomes uncomfortable before deciding how AI usage should be attributed.
Token usage needs a control point
Token visibility is useful, but it can quickly become a bad proxy.
High usage is not automatically good and low usage is not automatically efficient. A team could burn a lot of tokens and produce a valuable migration plan, a safer upgrade, or a well-tested change. Another team could use very little and still produce output that is rejected or rewritten.
Token usage tells you what was consumed. It does not tell you whether the engineering work was useful.
I would treat token usage the same way I treat CPU or memory. It tells you something about consumption, but you still need context. What was the task? Did it complete? Was the output accepted? How much review did it need? Did it reduce risk, or move the cost into someone else’s review queue?
This is where gateway controls start to matter.
Azure API Management’s GenAI gateway capabilities are a good example of the control plane thinking that is starting to appear here. APIM can apply token rate limits and quotas for LLM APIs, and Microsoft provides policies such as llm-token-limit and llm-emit-token-metric for controlling token spikes and emitting token consumption metrics to Application Insights. (Microsoft Learn)
That is the kind of boring control AI usage needs: who used how much, through which route, under which policy, against which model, and with what limit.
It will not tell you the full value story on its own, but it gives you a much better baseline than scattered client-side logs and a monthly invoice.
There is more than one control point
One mistake would be treating a single tool as the answer to AI engineering governance.
APIM is a good control point for shared AI API consumption. It gives you somewhere to validate identity, apply product-level policy, manage token limits, emit usage metrics, and build showback.
That does not cover every AI-assisted engineering workflow. GitHub Copilot usage, agent skills, repository instructions, pull request checks, MCP tools, and validation scripts sit closer to the developer workflow.
Workload-owned AI resources are different again. If a team needs AI Search, Storage, Foundry projects, Document Intelligence, Speech, Language, or a specialist model-backed service, the controls are more likely to live in Terraform modules, Azure Policy, diagnostics, private networking, tagging, and ownership metadata.
Those are different control points, and they solve different problems. The platform job is to make the right path clear, not force everything through the same abstraction.
For me, the pattern is:
- shared AI consumption should go through a governed gateway where that makes sense
- AI-assisted engineering workflows should have skills, repository standards, validation, and review expectations
- workload-owned AI resources should be deployed through approved modules and policy baselines
That gives teams a practical route without pretending one platform component can govern every part of AI engineering.
Ownership needs to be explicit
AI tooling often starts with individuals.
Someone tries Copilot. Someone adds an MCP server. Someone creates an agent skill. Someone builds a script that wraps a model. Someone plugs an AI workflow into a repository.
That is fine early on, but it does not scale well without ownership.
If an agent workflow starts modifying infrastructure code, who owns the standards it follows? If a skill is reused by multiple teams, who reviews changes to it? If an MCP server exposes access to internal systems, who decides which actions are safe? If AI-generated pull requests become common, who defines the review expectations? If a shared model endpoint is used by ten teams, who owns the quota model and cost attribution?
Without clear ownership, every team builds its own version of the same controls. Some will do it well. Some will not. Most will only discover the gaps when something becomes painful.
This is where platform teams have a useful role. Not to block AI adoption, and not to own every prompt or workflow, but to provide the shared rails: approved patterns, reusable skills, safe defaults, cost visibility, logging, and guidance on what should or should not be automated.
This is the same model that works for cloud platforms. Give teams useful defaults and a clear path, then make it easier to do the right thing than to invent another one-off pattern.
Sensible defaults beat policy documents
The best platform controls are often the ones teams barely notice.
A Terraform module that already includes diagnostic settings is better than a wiki page telling every team to remember logging. A GitHub Actions template with OIDC and least-privilege permissions is better than asking every repo to design its own auth pattern. A standard APIM product policy for token limits and observability is better than every team deciding how to protect a shared model endpoint.
AI engineering is no different.
Sensible defaults might include:
- standard repository instructions for AI-assisted work
- focused agent skills for common engineering tasks
- approved MCP servers with clear permissions
- default logging for model calls and tool usage
- APIM product policies for shared model access
- token limits and quotas for shared AI services
- pull request expectations for AI-generated changes
- evaluation checks for repeatable workflows
- guardrails that prevent unrelated refactoring or unsafe changes
- Terraform modules for workload-owned AI resources
The Terraform module point matters. Self-service AI should not mean every team manually wiring together identity, networking, private endpoints, diagnostics, tagging, policy, and baseline quota from scratch.
If a team needs AI Search, Storage, Foundry projects, Document Intelligence, Speech, Language, or a specialist workload-owned service, the platform should provide a governed module or pattern. That gives teams autonomy without making every team rediscover the same guardrails.
None of this needs to be heavy. The point is to remove repeated decision-making. Teams should not have to rediscover every standard before they can use AI safely.
A good default should feel boring. It should save attention for the actual engineering problem.
Observability has to include behaviour
Traditional observability asks whether the system is healthy. With AI-assisted workflows, that is only part of the problem.
You still need the basics: logs, metrics, traces, errors, latency, rate limits, and cost. But you also need to understand behaviour. Did the agent call the right tool? Did it loop? Did it retry unnecessarily? Did it pull the wrong context? Did it generate a large diff outside the task? Did it ignore a skill or repository instruction?
For shared AI APIs, the gateway is usually the best place to start. If model access goes through APIM, you have a consistent point to capture consumer identity, product or team context, model usage, token consumption, rate-limit events, failures, and latency.
That does not answer every engineering effectiveness question, but it gives you a much better baseline than trying to reconstruct usage later from scattered client logs.
OpenTelemetry is also moving into this space with GenAI semantic conventions for metrics and spans, including operations such as LLM requests, function calls, token usage, model attributes, and client-side GenAI telemetry. (OpenTelemetry)
Teams will need consistent telemetry if they want to compare workflows across tools and providers. Without a common shape for the data, every dashboard becomes a custom integration and every investigation starts from scratch.
For AI-assisted engineering, observability should help answer practical questions:
- which workflows are being used?
- which ones fail or loop?
- where is most of the cost coming from?
- which models are used for which tasks?
- which teams or products are driving shared model consumption?
- which skills or instructions are improving consistency?
- where does review effort increase after AI-generated changes?
- where are guardrails being bypassed or ignored?
That is more useful than a chart showing that usage went up.
The goal is not to turn every agent interaction into a surveillance exercise. The goal is to understand whether the system is helping teams produce better engineering outcomes, or just creating more activity.
Governance should not mean slowing teams down
Governance gets a bad name because it is often introduced as a blocker.
AI governance cannot work that way for engineering teams. If the approved route is slow, unclear, or disconnected from real workflows, teams will route around it. That is not because engineers hate governance. It is because they still need to deliver.
The better pattern is governance through paved roads.
If a team wants to use AI to help with Terraform provider upgrades, give them a focused skill, a repeatable workflow, review expectations, validation scripts, and a pull request shape that makes the change easy to assess.
If a team wants to consume a shared model, give them a standard route through APIM with identity, quota, telemetry, and cost attribution already handled.
If a team needs workload-owned AI resources, give them approved Terraform modules with the right defaults for networking, diagnostics, identity, tagging, and policy.
If a team wants to expose tools through MCP, provide a pattern for permissions, auditability, allowed actions, and ownership.
That is governance, but it is governance that helps the work happen safely. It is closer to platform engineering than policy enforcement.
Repeatable workflows need evaluation loops
The part many teams will underestimate is evaluation.
It is easy to create an AI-assisted workflow that works once. It is much harder to know whether it keeps working across repositories, edge cases, dependency changes, and different engineers using it.
Repeatable workflows need some kind of evaluation loop.
For agent skills, this could be as simple as testing the skill against a few representative tasks and checking whether the output stays inside the expected boundaries. For code generation, it might include build checks, tests, linting, security scanning, and review quality. For architecture or platform guidance, it might include whether the output follows standards, makes assumptions visible, and avoids unsupported claims.
For shared AI APIs, evaluation may also include whether the right model is being used for the task, whether token limits are sensible, whether prompts are being routed through the expected gateway, and whether teams are staying within agreed usage boundaries.
The point is not to make everything academic. It is to stop relying on vibes.
If a skill improves pull request quality, reduces rework, or makes agent output more predictable, that should become visible. If it causes the agent to overreach, miss context, or produce noisy changes, that should also become visible.
This is how cloud platforms improve. You observe behaviour, tighten defaults, fix the rough edges, and keep reducing the amount of thinking teams need to do for common paths.
AI engineering needs the same loop.
The boring controls are the useful controls
For shared model consumption, APIM is one of those controls. For workload-owned AI services, Terraform modules and policy baselines are controls. For agentic engineering workflows, skills, repository instructions, validation scripts, and evaluation checks are controls.
None of these are exciting on their own, but they are what make the capability dependable.
The question is no longer just whether AI can help with engineering work. It is whether teams can use it repeatedly without creating unclear costs, inconsistent workflows, weak ownership, and a new pile of review burden.
That is where platform teams should be paying attention. The useful AI engineering work will not only be in the model or the prompt. It will be in the boring controls around it.