I have been looking at AI billing, token visibility, agent skills, and MCP tooling across a few recent pieces of work, and the same pattern keeps showing up: the hard part is no longer proving that agents can do something useful. The hard part is making that work repeatable, reviewable, safe, and worth the cost.
The first wave has mostly been about proving capability. Agents can explain repositories, generate tests, change code, and open pull requests. Useful, but only the starting point.
Once teams start using these tools regularly, the harder question is not whether an agent can complete a task once. It is whether the workflow can be repeated, reviewed safely, kept within sensible cost, and improved over time.
I have seen this become more obvious in practical tasks like Terraform provider upgrades, Redis migration planning, APIM policy authoring, and repository-specific standards. The agent can often produce something useful, but the real value comes when the workflow is guided by clear examples, expected patterns, validation steps, and boundaries for what should and should not change.
That is when AI-assisted engineering starts to look less like a productivity feature and more like a platform capability.
Usage is not value
Token visibility is a good example, it helps to know which tools, models, users, repositories, and workflows are driving consumption. Without that, teams are guessing – but token usage on its own does not tell you whether the work was valuable.
A high-token session might be worthwhile if it helps with a difficult migration, reduces review effort, or gets a team through work that would otherwise take days. A low-token interaction can still be waste if the output is rejected, rewritten, or ignored.
The useful question is not “how do we use fewer tokens?” It is “what did those tokens produce?”
That means looking at task success, review effort, rework, guardrail failures, context quality, and whether the change helped the team move forward.
Cloud cost is similar, compute spend is not good or bad in isolation. You look at whether the workload is right-sized, reliable, secure, observable, and justified. AI usage needs the same sort of discipline.

Repeatability changes the measurement problem
This is one reason I keep coming back to agent skills.
A good skill gives the agent guidance before it produces or changes output. It can capture standards, examples, expected behaviour, and the boundaries of the task. That makes the work easier to repeat and easier to compare.
Without that, every AI-assisted task depends heavily on the prompt. One engineer gives detailed context, another gives a vague instruction, someone else pastes half the repository into the context window, and the results are difficult to reason about.
You can still measure token usage in that world, but you are not measuring a consistent workflow.
Skills do not make the output perfect, they give the agent a better starting point and make the behaviour easier to review, improve, and test over time.
The interesting direction is not bigger prompts. It is smaller pieces of reusable engineering judgement that can be applied consistently.

Context needs boundaries
A common mistake is assuming the agent should have as much context as possible.
Sometimes it does need more. Repository structure, platform standards, architecture decisions, API contracts, and existing conventions can all change the quality of the output.
But giving the agent everything is not a strategy. It increases cost, slows the workflow down, and gives the model more to interpret. It can also make the task less focused, especially when old decisions, unrelated files, or broad documentation get pulled in by accident.
The better pattern is controlled context reuse.
That might mean focused agent skills, small reference files, repository conventions, MCP servers, validation scripts, or evaluation checks. The implementation will vary, but the principle is the same: give the agent the right context for the task, not every bit of context available.
Tooling around skills, MCP, and skill evaluation starts to matter here. Not because every team needs another framework, but because repeatable AI-assisted engineering needs more than someone remembering the perfect prompt.
Guardrails are part of engineering quality
AI guardrails often get discussed as a security or compliance topic. Those matter, but for engineering teams the day-to-day guardrails are usually more practical.
Keep the change inside the task. Avoid unrelated refactoring. Follow repository standards. Check for breaking changes. Add tests where they make sense. Make assumptions visible. Produce a pull request a reviewer can understand.
Those are engineering quality controls.
A lot of AI-generated work looks impressive until someone has to review it. If the output creates a large diff, mixes unrelated changes, or hides assumptions, the cost has not disappeared. It has moved from generation to review.
Skills, scripts, repository standards, and evaluation checks can work together here. The skill sets the expected behaviour. Scripts validate the repeatable parts. Reviewers can spend more time on judgement and less time cleaning up avoidable mistakes.
Start measuring the right things
I would not start with a huge AI engineering dashboard. That usually creates more noise than insight.
The first measurements should be simple:
- token and model usage by team
- repeated workflow success
- review effort and rework
- guardrail failures
- context sources used
- cost per useful outcome
The important part is connecting usage to outcomes. Token spend by itself is just consumption data. It becomes useful when it is tied to whether the work was accepted, reviewed safely, and helped the team move forward.
This does not need to be perfect on day one. It just needs to avoid the obvious trap of treating activity as value.
What platform teams will end up owning
Once AI usage spreads across teams, the same questions keep appearing.
- Who is using which tools and models?
- What is the cost by team, product, repository, or workflow?
- Which tasks are succeeding?
- Where are agents looping or failing?
- Which skills are improving output quality?
- Which workflows are safe enough to automate further?
Those questions cut across developer experience, observability, governance, reliability, and cost management. They also need ownership.
If nobody owns the standards, evaluation approach, cost visibility, or guardrails, each team ends up solving the same problem slightly differently.
That is not very different from cloud platform work. At small scale, teams can work things out locally. At larger scale, you need sensible defaults, shared patterns, telemetry, and a way to improve the platform without blocking teams.
For AI-assisted engineering, platform teams will probably need to own or influence:
- usage visibility across tools, models, teams, repositories, and workflows
- common patterns for repeatable agent workflows
- standards for skills, repository instructions, and reusable context
- evaluation approaches for common AI-assisted tasks
- guardrails for reviewability, security, cost, and scope control
- reporting that connects AI usage to engineering outcomes
Usage dashboards are a start, but they are not enough. Token spend only becomes useful when it connects to delivery outcomes, review quality, risk reduction, and business value. Agent success rates only mean something when the task is defined clearly enough to compare. Skills only matter when they improve the quality and consistency of the work.
The first phase was proving agents could do useful work. The next phase is making that work dependable enough for teams to use without creating a new kind of operational mess.
That is the part platform teams should be watching: not whether AI creates more activity, but whether the work becomes easier to trust, easier to review, and easier to repeat.
