What Makes a Good GitHub Copilot Agent Skill?

After building a growing set of GitHub Copilot skills across areas such as Azure API Management, infrastructure as code authoring, and diagram generation with tools like Excalidraw and Draw.io, I have found that the difference between a skill that is genuinely useful and one that quietly disappoints usually comes down to a small number of design choices.

Most of those choices are easy to miss when you first start building skills.

A good skill is not just knowledge. It is task shape.

One of the easiest mistakes is to treat a skill as a place to dump expertise.

That usually gives you something that reads well to a human but performs less well in practice. A skill is not just a markdown file full of guidance. It is part trigger, part boundary, and part operating model. It has to help the agent recognise when it should engage, what it is responsible for, and how far that responsibility extends.

If it gets any of those wrong, the rest of the content often matters a lot less. The skill may never load. It may load at the wrong time. Or it may load into a conversation where it is competing with other instructions that would have been a better fit.

That is why the most important part of a skill is often not the long body of guidance people spend most of their time writing.

It is the short description line in the YAML frontmatter.

The description is doing more work than most people think

The description field is short, so it is easy to underestimate. In practice, it carries a lot of the real operational weight.

This is how the skill presents itself to the agent. It is the first clue about relevance. If it is too vague, too broad, or written like a category label instead of something that reflects a real task, the skill can be technically sound and still fail in the only way that matters: it does not get used when it should.

For example:

description: Helps with Azure cost analysis.

There is nothing obviously wrong with that, but it is weak. Engineers do not usually phrase requests that way. They ask to reduce spend, review an architecture for waste, right-size resources, or find out why a subscription has become expensive.

A stronger description sounds more like the work itself:

description: >-
  Analyze Azure architectures for cost optimization opportunities and provide
  savings recommendations. Use when reviewing Azure spending, asked to reduce
  costs, optimize resources, right-size VMs, or find savings across
  subscriptions. Do NOT use for general architecture design (use
  architecture-design skill instead).

That works better because it sounds closer to real prompt language. It is explicit about when to use the skill, and just as importantly, when not to.

That second part matters more as your skill library grows.

Skills need boundaries, not just purpose

Once you move past a couple of personal experiments and start building a real library, the question changes.

You are no longer asking whether a skill is useful on its own. You are asking whether it behaves well alongside everything else.

That is where overlap starts causing problems.

If you have one skill for APIM security review, another for APIM policy authoring, another for API architecture, and another for deployment patterns, the agent has to decide which one belongs to the request in front of it. Loose descriptions make that harder. The wrong skill can load, or several can load together and create noise instead of clarity.

This usually does not show up early. The first few skills often seem fine. The problems start once the library gets broader and more people depend on it. Then you begin to notice inconsistent behaviour for similar prompts, and confidence in the whole setup starts to slip.

The fix is simple in theory, but it does require discipline: skills need negative scope as well as positive scope.

They need to say what they are for, but also what they are not for.

That turns the description from a summary into something more like a contract. It reduces accidental triggering, makes adjacent skills easier to maintain, and gives the agent a better chance of routing a borderline request to the right place.

That is not just an authoring detail. It is part of how your customisation model scales.

Coexistence matters more than elegance

A lot of early skill examples look neat because they are written as self-contained documents. That is fine for demos. It is less useful when multiple skills live in the same environment and have to coexist.

I found fairly quickly that tidy conventions do not always help once several skills start triggering together. Generic headings like “Overview”, “Usage”, or “Examples” look organised, but they do not add much meaning. When multiple skill bodies are in play, they blur together. Broader examples can also expand the perceived scope of the skill without you meaning them to.

The better pattern is usually narrower and more deliberate.

I tend to think about skills a bit like microservices. They should have a clear job, a sensible interface, and a decent understanding of what belongs to their neighbours. That does not mean every skill has to be tiny. It means it should be specific enough that it behaves predictably when it shares space with other skills.

That distinction matters much more in a platform setting than in a single repo. In a local setup, the human usually carries most of the context anyway. At broader scale, the agent needs sharper separation because it is dealing with many repositories, many prompts, and many slight variations of the same request.

Keep the body lean and let detail load on demand

Another common mistake is trying to make the body of a skill complete for every possible path.

It feels thorough while you are writing it. It feels expensive when the skill is actually in use.

The body is not free. Every extra line adds context cost. If the skill carries deep detail for every framework, every output format, and every edge case, the agent pays for all of that whether or not the current task needs it.

That is why progressive disclosure is such a useful pattern.

Keep the body focused on the decision logic, the task framing, and the parts that genuinely need to be present when the skill loads. Put detailed format-specific or domain-specific content into references/ and load it when the conversation reaches that branch. The difference in usability is noticeable.

For example:

## Workflow

1. Choose your IaC format — Bicep, Terraform, ARM, or Pulumi
2. Load the format-specific guide:
   - Bicep: see [references/bicep.md](references/bicep.md)
   - Terraform: see [references/terraform.md](references/terraform.md)

This is one of those changes that looks obvious afterwards, but it improves both maintainability and usability. A lean core body is easier to keep sharp, and it lets the supporting material grow without turning every invocation into a context-heavy event.

Agents respond better to reasoning than blunt rules

One thing I have found interesting is that rigid instruction language is not always the most reliable.

A lot of people reach for MUST, NEVER, ALWAYS, and similar terms because they feel precise. Sometimes that is right, especially where there are genuine hard constraints. But for general engineering guidance, agents often do better when the skill explains the reasoning behind the instruction rather than just issuing the command.

For example:

ALWAYS use DefaultAzureCredential. NEVER hardcode credentials.

That is clear, but it is thin. It gives the rule without helping the agent understand why the rule exists.

This is usually stronger:

Use DefaultAzureCredential so credentials are never hardcoded and the same
code works in local development, CI/CD, and production through the standard
identity chain without requiring code changes between environments.

The second version gives the agent something it can generalise from. That matters because real engineering work is full of messy cases. No skill will anticipate all of them. If the agent understands the reason, it stands a better chance of making a sensible decision when the exact scenario was not written down.

That is a big part of what separates useful skills from restrictive ones. Good skills carry judgement, not just instruction.

Test the trigger, not just the content

A skill can be well written and still be ineffective because it never shows up when the user actually needs it.

That is why trigger testing matters.

Before I am happy with a skill, I want to know whether the description maps cleanly to the kinds of phrases real users are likely to type. Not the tidy example in my head when I wrote it. The messy versions.

If the skill is meant for APIM policy security review, then “review my APIM policy for security issues” should clearly match. “Audit my API gateway config” probably should too if the intent is still review. But “rate limit my API” may belong to policy authoring instead, and “deploy my APIM” clearly belongs somewhere else.

This kind of testing is simple, but it catches a lot. Missing synonyms. Scope that is too broad. Neighbouring skills that are likely to compete. It also forces you to think in user phrasing rather than author taxonomy, which is usually where trigger quality is won or lost.

A skill that cannot be activated reliably is just documentation wearing a different hat.

Bundled resources are often the point where skills become properly useful

There is usually a point where a skill stops feeling like a markdown file and starts feeling like a more complete working package.

That tends to happen when you notice the same supporting materials showing up again and again. Maybe you are validating the same output patterns. Maybe you need schemas, templates, coordinate maps, colour palettes, helper scripts, or boilerplate structures. Once that repetition appears, it usually makes sense to package those assets with the skill instead of forcing the agent to recreate the same logic every time.

This is where directories like scripts/, references/, and assets/ start to pay for themselves.

Scripts help with deterministic repeated work. References keep deep material out of the core body. Assets help the agent align to templates or output structures without dragging all of that detail into every invocation.

The main benefit is not neatness. It is consistency.

When the agent can rely on the same supporting materials each time, the output becomes less dependent on phrasing and more stable across different engineers and repositories. That matters a lot once a skill is being used by more than one person in more than one place.

The best skills usually come from workflows that already succeeded

The least glamorous way to build good skills is probably also the most effective: stop inventing them from scratch.

The best skills I have worked on usually came from workflows that had already worked reasonably well in live use. Not because the first attempt was perfect, but because it revealed the right things: where the agent drifted, where it needed steering, which decisions were repeatedly important, and what kind of output was actually accepted in the end.

Those course corrections are usually the useful bit. They show where the naive version goes wrong, which is exactly what the skill should help avoid next time.

So instead of starting from a blank page and trying to imagine the ideal instruction set, I find it more useful to mine real interactions, strip out the one-off specifics, and capture the decision logic that would have made the workflow cleaner in the first place.

That usually produces something much closer to genuine engineering guidance.

There is an important distinction here too. “Find the existing naming convention before introducing new resources” is a reusable behavioural instruction. “Use this exact resource group name” is not. One captures the pattern. The other captures a temporary local detail.

If you want skills that survive across teams, repos, and environments, that distinction matters a lot.

What actually makes a skill good

A good GitHub Copilot agent skill is not the one with the most content.

It is the one that helps the agent recognise the task properly, stay within the right boundaries, carry just enough context, and make sensible decisions when the prompt is a bit rough around the edges.

In practice, that usually means a description that is specific and scoped properly. A body that stays lean. Detail that loads when needed instead of arriving all at once. Instructions that explain the reasoning rather than just stating rules. Trigger testing against real prompt language. And, ideally, design shaped by workflows that already proved useful in real use.

The main thing I would stress is that good skills are less about showing the agent everything you know and more about helping it behave well under normal engineering conditions.

That is a different goal.

It is much closer to platform design than documentation writing.

Anyone can produce a skill that looks complete. The more interesting challenge is producing one that triggers when it should, stays quiet when it should not, cooperates with the rest of the library, and keeps being useful once more people start depending on it.

That is usually the point where you find out whether you built a clever demo or something that is actually engineered.

My agent skills are all in this GitHub repo if you want to check them out. 🙂

Leave a Reply

Discover more from Thomas Thornton Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading

Discover more from Thomas Thornton Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading