Author here, if you don't want to read all that, I'll post one excerpt that I think sums it up nicely:
> My point is, the spec must live somewhere, even if you don’t write it down. The spec is what you want the software to be. It often exists only in your head or in conversations. You and your team and your business will always care what the spec says, and that’s never going to change. So you’re better off writing it down now! And I think that a plain old list of acceptance criteria is a good place to start. (That’s really all that `feature.yaml` is.)
I wrote something similar recently about how agent-generated code lacks the institutional memory that human-written code has. There's nobody to ask why a decision was made (1).
“Specsmaxxing” is basically the right response to this. When you can't rely on authorial memory, you have to put the intent somewhere durable. Specs become the source of truth by default if we continue down the road of AI generated code.
the recursive-mode workflow has full traceability, including why decisions were made, what the original requirement was, what the previous state was, etc. https://recursive-mode.dev/introduction
I had a similar experience refactoring a large codebase• The only thing that made it possible was that each commit message had a JIRA ticket number tying it to a requirement or task. I could find the people behind the business logic and ask them about it.
All through the agile era I wrote detailed specs for projects and then followed an agile process. The most successful parts of every project were the ones that we were able to spec best even when they diverged significantly from the original spec.
You don't plan to follow the plan. You plan in order to understand the whole problem space. Obviously no plan survives contact with reality.
Another point of view is that LLM:s perform to an extent on the same level as outsourcing does. This interface requires a bit more contract mass than doing everything within single team.
We never left waterfall in the end.
Working with and for dozens, collaborating with probably a hundred software companies in different scales, every single one said:
We do agile
Guess what?
Every single one of them was doing waterfall.
Their agile included preplanning and pre-specifying the full spec and each task, before the project kicked off. We'd have meetings where we'd drill down into tasks, folks would write them down so detailed that there would be no other way than doing that. Agile would be claimed, but the start date, end date, end spec and number of developers was always concrete.
Sometimes, the end date was too late, so a panic would ensue. Most of the time, the date was too late because developers had "unknowns" which then had to be "drilled down and specced so they wouldnt be unknowns". Sometimes, nearly 50% of the workweek was spent on meetings.
A few times, a project was running late - so to make sure we are _really_ doing it agile, we'd have morning standups, evening standups, weekly plannings, retrospectives, and backlog refinement. It would waste the time, and the "unknowns" aka "tickets to refine" were again, as always, dependant upon the PM/PO/CEO's wishes, which wouldn't get crystallized until it was _really last minute_.
One customer wanted us to do a 2 year agile plan on building their product. We had gigantic calls with 20+ people in them, out of which at least half had some kind of "Agile SCRUM Level 3 Black belt Jirajitsu" certificates.
To them, Agile was just a thing you say before you plan things. Agile was just an excuse to deal with project being late by pinning it on Agile. Agile was just a cop out of "PM didn't know what to do here so he didnt write anything down". Agile was a "we are modern and cool" sticker for a company.
And unfortunately, to most of them, agile was just a thing you say for the job, as their minds worked in waterfall mode, their obligations worked in waterfall mode, companies worked in waterfall mode, and if they failed their obligation to the waterfall, their job would go down one.
So while we were doing the Agile ceremonies, prancing around with our Scrum master hats, using the right words to fit into the Agile™ worldview - we were doing waterfall all along.
And after 15 years, I'm not even sure - did agile really ever exist?
> When rewriting the entire codebase is very quick and cheap, why bother iterating on small components?
We are nowhere near this scenario tbh. Token cost is very high and is currently heavily subsidized by VC money to gain market share. Also this realistically only applies to small projects, small codebases and mostly greenfield ones. No way you can rewrite the whole codebase quickly and cheaply in any mid-sized+ projects
But even assuming token cost plummets, any non-trivial piece of software that is valuable enough to generate income for the company is also big, complex, interconnected enough that cannot be rewritten quickly even by AI, also for business reasons too. If a piece of code works, is stable and is tested, then rewriting it will always bring a high degree of risk and uncertainty that in a lot of business critical applications is just not worth it. A stable system can stay untouched for years besides minor dependencies updates.
What's the difference between this and Jira. Your specs already live somewhere, it's where you defined them. That's why it's nice to put the Jira ticket number in your code / commit, so you can refer back to the spec when something breaks
A specification isn't a series of change requests! Using Jira as your source of truth is no different to just recording all your prompts. There's nothing you can easily review to spot contradictions or how things interact with one another.
I've been doing "specmaxxing" for a few months now. Unlike the author I don't use Yaml, I use a mix of Markdown and Gherkin. If you haven't encountered Gherkin before, it's not new and you might know it under the name Cucumber or BDD.
Gherkin is basically a structured form of English that can be fed into a unit testing framework to match against methods.
The nice thing about writing acceptance criteria this way is that they become executable and analyzable. You write some Gherkin and then ask the model to make the tests execute and pass. Now in a good IDE (IntelliJ has good support) you can run the acceptance criteria to ensure they pass, navigate from any specific acceptance criteria to the code which tests it (and from there to the code that implements it), you can generate reports, integrate it into CI and so on.
And when writing out acceptance tests that are quite similar, the IDE will help you with features like auto-complete. But if you need something that isn't implemented in the test-side code yet, no big deal. Just write it anyway and the model will write the mapping code.
There's a variant of Gherkin specifically designed for writing UI tests for web apps that also looks quite interesting. And because it's an old ecosystem there's lots of tooling around it.
Another thing I've found works well is asking the models to review every spec simultaneously and find contradictions. I've built myself a tool that does this and highlights the problems as errors in IntelliJ, like compiler errors. So I can click a button in the toolbar and then navigate between paragraphs that contradict each other. It's like a word processor but for writing specs.
Once you're doing spec driven development, you don't need to write prompts anymore. Every prompt can just be "Update the code and tests to match the changes to the specs."
The problem with gherkin is that it was a badly designed language.
The general idea of "readable specification language" was an inspired one but it failed on execution - it has gnarly syntax, no typing and bad abstractions.
This results in poor tests which are hard to maintain and diverge between being either too repetitive to be useful or too vague to be useful.
The ecosystem is big but it's built on crumbling foundations which is why when most people used it most of them got frustrated and gave up on it.
Annoyingly there's a certain amount of gaslighting around it too ("it didnt work for you coz you werent using it correctly") which is eleven different kinds of wrong.
Jira is only a set of changes though. What happens on a long (10+ year) and complex (10+) developer project with many changes and revisions? Eventually you need an explicit specification that itself has a "current state", and a change log. Theoretically you could generate this from Jira, but in my experience it eventually became a mess on any larger project that didn't have explicit and maintained writen requirements.
Nice! Your spec-maxxing is very resonant. I've been doing working with explicit requirements: elicit them from conversation with me or introspecting another piece of software; one-shot from them; and keep them up-to-date as I do the "old man shouts at Claude" iterations after whatever one-shotting came up with.
Unlike you, I wish for the LLM to do as much of the work as possible -- but "as possible" is doing a lot of work in that sentence. I'm still trying to get clear on exactly where I am needed and where Opus and iterations will get there eventually.
It has really challenged me to get clearer on what a requirement is vs a constraint (e.g., "you don't get to reinvent the database schema, we're building part of a larger system"). And I still battle with when and how to specify UI behaviours: so much UI is implicit, and it seems quite daunting to have to specify so much to get it working. I have new respect for whoever wrote the undoubtedly bajillion tests for Flutter and other UI toolkits.
Forgot to add: I get several benefits from doing this.
1. Specifications that live outside the code. We have a lot of code for which "what should this do?" is a subjective answer, because "what was this written to do?" is either oral legend or lost in time. As future Claude sessions add new features, this is how Claude can remember what was intentional in the existing code and what were accidents of implementation. And they're useful for documenters, support, etc.
2. Specifications that stay up to date as code is written. No spec survives first contact with the enemy (implementation in the real world). "Huh, there are TWO statuses for Missing orders, but we wrote this assuming just one. How do we display them? Which are we setting or is it configurable?" etc. Implementer finds things the specifier got wrong about reality, things the specifier missed that need to be specified/decided, and testing finds what they both missed.
I have a colleague working on saving architecture decisions, and his description of it feels like a higher-abstraction version of my saving and maintaining requirements.
I do (1) the same but (2) differently. In my workflow, (2) are AI generated specs using human written (1) as the input. It's an intermediate stage between (1) and the codebase, allowing for a gradual token expansion from 30k to 250k to the final code which is 2-3M. The benefit I've found with this approach is it gives the AI a way to iterate on the details of whole system in one context window, whereas fitting the whole codebase into one prompt is impossible. The code is then nothing more than a style transfer from (2).
I actually read it all since it did not contain any hints of being AI generated (although I wouldn't be surprised to learn you did use AI to write it), so thank you for that. It's kind of crazy how I now have the default expectation that posts posted here are AI slop with little thought or care put in.
I am also stealing the idea of talking to LLMs as if it's an email. So funny, we need to be joymaxxing a bit more I think :)
:) Here is a crazy thought - what if we had some kind of a narrowed down, specific subset of normal language which would translate into specific computer-level instructions. So for example, instead of telling computer to read something from a file and transform it in a certain way, you actually had a specific instruction to open a file, which worked the same each time you used it and guaranteed to fail if you used it the wrong way? Wow, the possibilities are endless :)
Don’t be ridiculous, that would be extremely hard. Oppressive even, because it’s unattainable to an average person. And it is, otherwise there would be millions of programmers in the world. Was it unattainable or “we have to pay these suckers money, and they have rights and lives outside of work”? Bah! Just make sure to renew your subscription, agent will do the thinking and you bring the money.
But Paul Graham says that the guy from Replit whom he funded told him the source code is "object code" now, so we don't need to look at it all ? It must be utter wisdom since PG managed to get wealthy by selling some website during dotcom-mania so he must have insights we are missing?
Behaviour Driven Development or Spec Driven Development are, loosely, forms of Test Driven Development where you encode the specification into the code base. No impedance, full insight, formality through code.
I think people get really dogmatic about “test” projects, but with a touch of effort a unit test harness can be split up into integration tests, acceptance tests, and specification compliance tests. Pull the data out as human readable reports and you have a living, verifiable, specification.
Particularly using something comparable to back-ticks in F#, which let test names be defined with spaces and punctuation (ie “fulfills requirement 5.A.1, timeouts fail gracefully on mobile”), you can create specific layers of compiled, versioned, and verifiable specification baked into the codebase and available for PMs and testers and clients and approval committees.
Fair, I could have made that point clearer. It's a couple things. First is that I finally stopped experimenting with TUIs, harnesses, models, subagents, roles, skills, mcp, md libraries etc. and have mostly settled on this approach, and got back to building other things with it. I'm sure that won't last forever though.
Second is that I'm doing a lot less "seat of my pants prompting" and doing more engineering and ideating, which was a big goal of mine. So I'm feeling less psychotic there too.
And sort of tangentially to that, I think a significant subset of devs actually are willing to just prompt their way to nirvana, day in and day out. I'm not. I think the spec will carry a lot of weight for a long time. Maybe they will get further than I give them credit for? Maybe the whole digital world becomes a single chat box?
Some people seem to give very little thought to semantics and semiotics lately, to the point where people use words vaguely without even looking it up.
I guess I misappropriated the term then, woops. AI OCD? AI obsession? Whatever you call the behavior that I saw myself and others falling in to. Getting obnoxiously fixated on the tooling and the models to a counterproductive degree.
AI psychosis: (informal) A phenomenon wherein individuals reportedly develop or experience worsening psychosis, such as paranoia and delusions, in connection with their use of chatbots.
That’s the best part: you don’t. “You would extend the prompt to improve it”. They’ll just ask Claude to write an AI tool to overcome psychosis (the program will spam Anthropic servers with racial slurs which will promptly cause ban of the user, success).
So...is this just Cucumber cough cough behavior driven design again, but stored in YAML so that LLMs can read it easier by loading the AST instead of tokenizing the text?
It’s like a yaml of an event model but less graphical. Right? I think I will prefer Event Modeling especially with Martin Dilger now building tooling very much with agents in mind. There is no one place to read about his most recent efforts except for his LinkedIn feed though I fear.l so won’t post any urls, but information is easy enough to find.
A full blown event model facilitates all communication, human (management, devs, ops) and agentic. But maybe I’m missing something, maybe the dashboard can have this function I didn’t dig into it too much.
Old ist new I guess. This is independent of whether A"I" or a human executes, the point is that you need this if specifying and execution lie apart, be it in time or space. This is basically the whole point of the V-Model and processes (if used correctly as a tool and not preferred as goals) and was already researched an formalized in the 60s and 70s.
I use OpenSpec for my spec management, and I scrolled down to the comparison. The gripe seems to be with a semantic difference. Specs describing a current system is the basis for AS/IS Gap Analysis.
Also, I mainly pursue these tools so that I can have AI accelerate this process and broker an agreement after negotiating specs with the agent.
I'm also doing openspec for a few months now and it's really good if you invest enough in the specs (in the beginning I skimmed over much, now I pay attention to all details and fix anything that's wrong or where I see a gap).
The one thing I like that OP brings is to tie specs and code together. The openspec flow does help a lot in keeping code synced with specs, but when a spec changes, AI needs to find the relevant code to change it. It's pretty easy to miss something in large codebase (especially when there is lots of legacy stuff).
Being able to search for numbered spec tags to find relevant bits of code makes it much more likely to find what needs to be changed (and probably with less token use too).
I can see one benefit to a structured yaml for specs like the OP is doing: it gives you more control over what you include in the context window. But coming up with a good schema that doesn't handicap you or add cognitive burden, compared to the freeform flexibility of md/txt, is a challenge.
If the selling point is a new file format for spec management, it would be more interesting to provide an offering with org-mode. The author admits they were unaware of other pre-existing solutions before this project so I am providing context to their critique of OpenSpec.
But have you thought about “fun factor”? It’s where you sit like an addict in a casino for weeks and burn tokens in a hope of winning a software that you could’ve written?
Who doesn’t consider “fun” thinking about work crap all the time, writing to your agent, verifying walls of slop?
I just spent a week training up in spec driven development through bmad, which was awful, and speckit which was ok but not great. Both had what seemed like unnecessary ceremony around the specs, generating fields of spec documents which presumably fill up the context window quickly. I just kept thinking "this should be using something simpler, all this markdown is unnecessary"
What is yours agentic development experience with elixir? I used to like elixir a lot during a pre agentic era, but with coding agents it feels like the language isn't the best choice - slow compile time, weak type system (at least it was a year ago, I know there is work on that front), small ecosystem...
I’m using Opus 4.6 and I’m so confused! Maybe I should try Opus 4.7, which is almost twice as expensive to get some clarity (but not too much, I need to save money for Opus 4.8)?
That's what the article is about - overcoming problems with AI cooding tools using specs in Yaml. If we've got that far, it might be better to write specs in a proper programming language instead and skip the AI layer altogether
Think the idea is to still get monumental acceleration between fancy YAML specs (bullet points with some indentation that an intelligent technical manager could write) and production ready code.
Completely subjective take, but I feel like 95% of these "tools" that are prompt-engineering inventions created by the authors with their bias and to suit their needs don't have anything supporting them besides the authors' subjective experience.
I have seen the same idea with processes, pipelines, lists, bullet points, jsons, yamls, trees, prioritization queues all for LLM context and instruction alignment. It's like the authors take the structure they are familiar with, and go 100% in on it until it provides value for them and then they think it's the best thing since sliced bread.
I would like, for once, to see some kind of exploration/abalation against other methods. Or even better, a tool that uses your data to figure out your personal bias and structure preference for writing specs, so that you can have a way of providing yourself value.
> We are entering the post-slop era. My software is more robust, better tested, better integrated, and more observable than ever before. And my velocity keeps increasing!
Don't we just love the hard fact conclusions based on sample size N=1 and hand-waving arguments?
I also have started numbering my Acceptance criteria and pushing that across the team(s). It’s going pretty well. Some note however are
1. Don’t write in yaml. It’s really hard for humans. Write in markdown and use a standard means to convert to lists / yaml.
2. Think beyond you writing your own specs - how does this expand into teams of tens or more. The ticketing system you have (jira? Bugzilla) is not designed for discussion of the acceptance criteria. I think we are heading into a world of waterfall again where we have discussions around the acceptance criteria. This is not a bad thing - is used to be called product management and they would write an upfront spec.
If this new world of a tech and a business user lead the writing of a new spec (like a PEP) and then then AI implements it and it’s put into a UAT harness for larger review and a daily cycle begins, we might have something.
It's ok friend, all I did is put acceptance criteria in a list so I can parse it and quickly track cross-references. The rest is just Elixir/Phoenix and some creative writing.
YAML is one of the worst technologies ever invented, it has more warts than features. One of the benefits of LLMs is that they can write YAML for me, wherever I am forced to use it.
Otherwise, I like the idea of machine-readable specs.
the token usage isn’t sustainable. formal english is a barrier but requirement for specification. brevity is the language of money and that’s the premise of management using ai.
fyi language alone can’t define/describe requirements which is why UML existed.
Natural language is a fully general system and can define and describe everything.
You could deterministically process any UML diagram into a prose equivalent.
And in fact you couldn't do the other way around (any prose -> UML) because UML is less powerful than natural language and actually can't express everything that natural language can.
Can it also fully describe a composition by Bach or a Rembrandt's painting? In some weird, overly complex way it probably 'could', but it would be very painful. That's why we pick other forms of expression. We use other forms of expression to compact and optimise information delivery. Another benefit is that we cut out the noise. So yes UML cannot describe everything natural language can, but then again why should it - it was designed as a specific framework for designing relations between objects. Not more and not less. Similar for sequence diagrams or other forms of communicating ideas efficiently.
Could it be that slop PRs are less frequently rejected/commented due to (unfortunate) increased acceptance of it?
As it turns out when maxxing AI on leaf parts of a program, the quality of the code doesn't matter that much anymore when compared to building the fundament.
What is it with people and procrastinating with the most useless shit you can imagine?
First it was choice of editor: people were micro optimizing every aspect of their typing experience, editor wars where people would literally slaughter over suggesting another camp.
Editor wars v2: IDEs arrived and second editor war began.
Revenge of the note taking apps: Obsidian/Roam/Joplin/Apple Notes/Logseq. Just one plugin, just one more knowledge graph, bro, and I’ll have peak productivity. 10x is almost here.
AI: you’re witnessing it now.
Do people NOT have anything else in life? How are y’all finding time to do all of this shit? Are you doing it on company time? Do you have hobbies, do you learn foreign languages, travel, have kids or spouses, drive a car, other thousand “normie” things outside of staring at the freaking monitor or thinking about this shit 24/7? Did I miss the invention of a Time Machine?
Lmfao. Going to a site for computer geeks and complaining that they are computer geeks.
Also, a lot of folks don't write code anymore, and barely have the time to read the volume of code that AI produces. This may just be one of the most profound changes in an industry, and some folks are excited about it and want to get better at building with it.
I think the person who wrote this post made a good faith effort to share his learnings while promoting his tool.
It's fun how people brag of their agentmaxxing, but if you ask them what those agents are busy actually producing, it's invariably another agent harness so they can agentmaxx better. NFT/blockchain ecosystem was much the same.
>Do people NOT have anything else in life? How are y’all finding time to do all of this shit? Are you doing it on company time? Do you have hobbies, do you learn foreign languages, travel, have kids or spouses, drive a car, other thousand “normie” things outside of staring at the freaking monitor or thinking about this shit 24/7? Did I miss the invention of a Time Machine?
How are any of those things even remotely as interesting as arguing with people about an Emacs config?
Should I apologize for being excited about something I built and use daily and for wanting people to try it, discuss it, critique it? Not sure by the tone of your message.
Read the room. What you "built" is neither exciting, nor something most people want to "try". Why? Because just like other AI boosters, you are still trying to somehow optimise the usage of natural language to make it work. But it will never "work" because the way the stochastic ML system is built, it has a failure built into the system.
Totally agree it's not exciting, even though I am personally excited by it, and I also agree it's not something most people want to try, even though some people do want to try it-- and I found a few of them right here on HN.
Disagree on the bit about it "never going to work" though.
Failure-prone stochastic ML systems produce testable, auditable code... just like failure-prone human brains can produce testable, auditable code. And in fact, in both cases, changes to our process can reduce the amount of failures that slip past testing and audit. Or can reap other rewards. Finding the a better process is what I'm interested in right now.
> My point is, the spec must live somewhere, even if you don’t write it down. The spec is what you want the software to be. It often exists only in your head or in conversations. You and your team and your business will always care what the spec says, and that’s never going to change. So you’re better off writing it down now! And I think that a plain old list of acceptance criteria is a good place to start. (That’s really all that `feature.yaml` is.)
“Specsmaxxing” is basically the right response to this. When you can't rely on authorial memory, you have to put the intent somewhere durable. Specs become the source of truth by default if we continue down the road of AI generated code.
1: https://ossature.dev/blog/ai-generated-code-has-no-author/
Did I miss something or is everyone back in 1970s, working in waterfall processes now?
You don't plan to follow the plan. You plan in order to understand the whole problem space. Obviously no plan survives contact with reality.
Another point of view is that LLM:s perform to an extent on the same level as outsourcing does. This interface requires a bit more contract mass than doing everything within single team.
We do agile
Guess what? Every single one of them was doing waterfall.
Their agile included preplanning and pre-specifying the full spec and each task, before the project kicked off. We'd have meetings where we'd drill down into tasks, folks would write them down so detailed that there would be no other way than doing that. Agile would be claimed, but the start date, end date, end spec and number of developers was always concrete.
Sometimes, the end date was too late, so a panic would ensue. Most of the time, the date was too late because developers had "unknowns" which then had to be "drilled down and specced so they wouldnt be unknowns". Sometimes, nearly 50% of the workweek was spent on meetings.
A few times, a project was running late - so to make sure we are _really_ doing it agile, we'd have morning standups, evening standups, weekly plannings, retrospectives, and backlog refinement. It would waste the time, and the "unknowns" aka "tickets to refine" were again, as always, dependant upon the PM/PO/CEO's wishes, which wouldn't get crystallized until it was _really last minute_.
One customer wanted us to do a 2 year agile plan on building their product. We had gigantic calls with 20+ people in them, out of which at least half had some kind of "Agile SCRUM Level 3 Black belt Jirajitsu" certificates.
To them, Agile was just a thing you say before you plan things. Agile was just an excuse to deal with project being late by pinning it on Agile. Agile was just a cop out of "PM didn't know what to do here so he didnt write anything down". Agile was a "we are modern and cool" sticker for a company.
And unfortunately, to most of them, agile was just a thing you say for the job, as their minds worked in waterfall mode, their obligations worked in waterfall mode, companies worked in waterfall mode, and if they failed their obligation to the waterfall, their job would go down one.
So while we were doing the Agile ceremonies, prancing around with our Scrum master hats, using the right words to fit into the Agile™ worldview - we were doing waterfall all along.
And after 15 years, I'm not even sure - did agile really ever exist?
When rewriting the entire codebase is very quick and cheap, why bother iterating on small components?
We are nowhere near this scenario tbh. Token cost is very high and is currently heavily subsidized by VC money to gain market share. Also this realistically only applies to small projects, small codebases and mostly greenfield ones. No way you can rewrite the whole codebase quickly and cheaply in any mid-sized+ projects
But even assuming token cost plummets, any non-trivial piece of software that is valuable enough to generate income for the company is also big, complex, interconnected enough that cannot be rewritten quickly even by AI, also for business reasons too. If a piece of code works, is stable and is tested, then rewriting it will always bring a high degree of risk and uncertainty that in a lot of business critical applications is just not worth it. A stable system can stay untouched for years besides minor dependencies updates.
distributed teams do well when proposals, decision, etc, are written down, and can be easily found and referenced
it doesn't mean docs are frozen in time and can't be patched like code
I've been doing "specmaxxing" for a few months now. Unlike the author I don't use Yaml, I use a mix of Markdown and Gherkin. If you haven't encountered Gherkin before, it's not new and you might know it under the name Cucumber or BDD.
https://cucumber.io/docs/
Gherkin is basically a structured form of English that can be fed into a unit testing framework to match against methods.
The nice thing about writing acceptance criteria this way is that they become executable and analyzable. You write some Gherkin and then ask the model to make the tests execute and pass. Now in a good IDE (IntelliJ has good support) you can run the acceptance criteria to ensure they pass, navigate from any specific acceptance criteria to the code which tests it (and from there to the code that implements it), you can generate reports, integrate it into CI and so on.
And when writing out acceptance tests that are quite similar, the IDE will help you with features like auto-complete. But if you need something that isn't implemented in the test-side code yet, no big deal. Just write it anyway and the model will write the mapping code.
There's a variant of Gherkin specifically designed for writing UI tests for web apps that also looks quite interesting. And because it's an old ecosystem there's lots of tooling around it.
Another thing I've found works well is asking the models to review every spec simultaneously and find contradictions. I've built myself a tool that does this and highlights the problems as errors in IntelliJ, like compiler errors. So I can click a button in the toolbar and then navigate between paragraphs that contradict each other. It's like a word processor but for writing specs.
Once you're doing spec driven development, you don't need to write prompts anymore. Every prompt can just be "Update the code and tests to match the changes to the specs."
The general idea of "readable specification language" was an inspired one but it failed on execution - it has gnarly syntax, no typing and bad abstractions.
This results in poor tests which are hard to maintain and diverge between being either too repetitive to be useful or too vague to be useful.
The ecosystem is big but it's built on crumbling foundations which is why when most people used it most of them got frustrated and gave up on it.
Annoyingly there's a certain amount of gaslighting around it too ("it didnt work for you coz you werent using it correctly") which is eleven different kinds of wrong.
> I use a mix of Markdown and Gherkin
Gherkin also has a Markdown based syntax that is not well known:
https://github.com/cucumber/gherkin/blob/main/MARKDOWN_WITH_...
I prefer that to the 'verbose' original syntax. MDG also renders nicely in code forges.
Unlike you, I wish for the LLM to do as much of the work as possible -- but "as possible" is doing a lot of work in that sentence. I'm still trying to get clear on exactly where I am needed and where Opus and iterations will get there eventually.
It has really challenged me to get clearer on what a requirement is vs a constraint (e.g., "you don't get to reinvent the database schema, we're building part of a larger system"). And I still battle with when and how to specify UI behaviours: so much UI is implicit, and it seems quite daunting to have to specify so much to get it working. I have new respect for whoever wrote the undoubtedly bajillion tests for Flutter and other UI toolkits.
1. Specifications that live outside the code. We have a lot of code for which "what should this do?" is a subjective answer, because "what was this written to do?" is either oral legend or lost in time. As future Claude sessions add new features, this is how Claude can remember what was intentional in the existing code and what were accidents of implementation. And they're useful for documenters, support, etc.
2. Specifications that stay up to date as code is written. No spec survives first contact with the enemy (implementation in the real world). "Huh, there are TWO statuses for Missing orders, but we wrote this assuming just one. How do we display them? Which are we setting or is it configurable?" etc. Implementer finds things the specifier got wrong about reality, things the specifier missed that need to be specified/decided, and testing finds what they both missed.
I have a colleague working on saving architecture decisions, and his description of it feels like a higher-abstraction version of my saving and maintaining requirements.
I am also stealing the idea of talking to LLMs as if it's an email. So funny, we need to be joymaxxing a bit more I think :)
You probably don't want people associating your work with abusing crystal meth and hitting yourself in the face with a hammer.
For anyone missing the reference, SNL has a pretty good explainer:
https://www.youtube.com/watch?v=4XMPLdiXB1k
It's why famously, programmers always say, the code is the documentation, because writing detailed docs is very tedious and nobody wants to do it.
Behaviour Driven Development or Spec Driven Development are, loosely, forms of Test Driven Development where you encode the specification into the code base. No impedance, full insight, formality through code.
I think people get really dogmatic about “test” projects, but with a touch of effort a unit test harness can be split up into integration tests, acceptance tests, and specification compliance tests. Pull the data out as human readable reports and you have a living, verifiable, specification.
Particularly using something comparable to back-ticks in F#, which let test names be defined with spaces and punctuation (ie “fulfills requirement 5.A.1, timeouts fail gracefully on mobile”), you can create specific layers of compiled, versioned, and verifiable specification baked into the codebase and available for PMs and testers and clients and approval committees.
Second is that I'm doing a lot less "seat of my pants prompting" and doing more engineering and ideating, which was a big goal of mine. So I'm feeling less psychotic there too.
And sort of tangentially to that, I think a significant subset of devs actually are willing to just prompt their way to nirvana, day in and day out. I'm not. I think the spec will carry a lot of weight for a long time. Maybe they will get further than I give them credit for? Maybe the whole digital world becomes a single chat box?
https://en.wiktionary.org/wiki/AI_psychosis
A full blown event model facilitates all communication, human (management, devs, ops) and agentic. But maybe I’m missing something, maybe the dashboard can have this function I didn’t dig into it too much.
Also, I mainly pursue these tools so that I can have AI accelerate this process and broker an agreement after negotiating specs with the agent.
The one thing I like that OP brings is to tie specs and code together. The openspec flow does help a lot in keeping code synced with specs, but when a spec changes, AI needs to find the relevant code to change it. It's pretty easy to miss something in large codebase (especially when there is lots of legacy stuff).
Being able to search for numbered spec tags to find relevant bits of code makes it much more likely to find what needs to be changed (and probably with less token use too).
https://haskellforall.com/2026/03/a-sufficiently-detailed-sp...
This seems like the answer to that thought!
An executable spec like gherkin or hitchstory is config - it has no loops or conditionals. There are a number of rarely recognized benefits to this.
People do that? Actual professionals?
If you're genuinely confused, and haven't tried Opus for coding, then it's not surprising you're confused!
It is also okay for you to just not like the idea of LLMs for coding (but say that!).
I wanted to star the project to track the progress but it feels a bit weird.. Which repo shall I track? Server? Cli? Sounds like a misc repos.
I have seen the same idea with processes, pipelines, lists, bullet points, jsons, yamls, trees, prioritization queues all for LLM context and instruction alignment. It's like the authors take the structure they are familiar with, and go 100% in on it until it provides value for them and then they think it's the best thing since sliced bread.
I would like, for once, to see some kind of exploration/abalation against other methods. Or even better, a tool that uses your data to figure out your personal bias and structure preference for writing specs, so that you can have a way of providing yourself value.
"Don't write prompts like that, do it like this! I swear it's better. Claude says so!"
[1] https://www.lat.md/
Don't we just love the hard fact conclusions based on sample size N=1 and hand-waving arguments?
1. Don’t write in yaml. It’s really hard for humans. Write in markdown and use a standard means to convert to lists / yaml.
2. Think beyond you writing your own specs - how does this expand into teams of tens or more. The ticketing system you have (jira? Bugzilla) is not designed for discussion of the acceptance criteria. I think we are heading into a world of waterfall again where we have discussions around the acceptance criteria. This is not a bad thing - is used to be called product management and they would write an upfront spec.
If this new world of a tech and a business user lead the writing of a new spec (like a PEP) and then then AI implements it and it’s put into a UAT harness for larger review and a daily cycle begins, we might have something.
Good luck
This industry has become a parody of itself, and people are celebrating.
Otherwise, I like the idea of machine-readable specs.
fyi language alone can’t define/describe requirements which is why UML existed.
You could deterministically process any UML diagram into a prose equivalent.
And in fact you couldn't do the other way around (any prose -> UML) because UML is less powerful than natural language and actually can't express everything that natural language can.
Can it also fully describe a composition by Bach or a Rembrandt's painting? In some weird, overly complex way it probably 'could', but it would be very painful. That's why we pick other forms of expression. We use other forms of expression to compact and optimise information delivery. Another benefit is that we cut out the noise. So yes UML cannot describe everything natural language can, but then again why should it - it was designed as a specific framework for designing relations between objects. Not more and not less. Similar for sequence diagrams or other forms of communicating ideas efficiently.
This industry is just getting more and more bonkers.
First it was choice of editor: people were micro optimizing every aspect of their typing experience, editor wars where people would literally slaughter over suggesting another camp.
Editor wars v2: IDEs arrived and second editor war began.
Revenge of the note taking apps: Obsidian/Roam/Joplin/Apple Notes/Logseq. Just one plugin, just one more knowledge graph, bro, and I’ll have peak productivity. 10x is almost here.
AI: you’re witnessing it now.
Do people NOT have anything else in life? How are y’all finding time to do all of this shit? Are you doing it on company time? Do you have hobbies, do you learn foreign languages, travel, have kids or spouses, drive a car, other thousand “normie” things outside of staring at the freaking monitor or thinking about this shit 24/7? Did I miss the invention of a Time Machine?
Also, a lot of folks don't write code anymore, and barely have the time to read the volume of code that AI produces. This may just be one of the most profound changes in an industry, and some folks are excited about it and want to get better at building with it.
I think the person who wrote this post made a good faith effort to share his learnings while promoting his tool.
How are any of those things even remotely as interesting as arguing with people about an Emacs config?
People are people.
Disagree on the bit about it "never going to work" though.
Failure-prone stochastic ML systems produce testable, auditable code... just like failure-prone human brains can produce testable, auditable code. And in fact, in both cases, changes to our process can reduce the amount of failures that slip past testing and audit. Or can reap other rewards. Finding the a better process is what I'm interested in right now.