Back to Blog

The software factory where no human reads the code — and it ships security software

The software factory where no human reads the code — and it ships security software
StrongDM's three-person AI team built production security software with two rules: no human writes code, and no human reviews code. Here's how they actually pulled it off.

Three engineers at StrongDM built a team in 2025 with a charter that most developers would file under "things that sound good until they don't": no human writes code, and no human reviews code.

What makes this worth paying attention to is what they were building. Not a CRUD app or an internal tool — access management and security software for enterprises. Infrastructure that controls who can touch what across Okta, Jira, Slack, Google Drive, and more. The kind of software where bugs have consequences.

They published their methodology in February 2026. The dev community has been arguing about it since.

Where most developers actually are

Before getting into what StrongDM built, it helps to understand how unusual it is.

Dan Shapiro published a five-level taxonomy for AI-assisted programming in January 2026, borrowing from the NHTSA's self-driving car framework. Level 0 is no AI assistance. Level 2 is where Shapiro says 90% of "AI-native" developers currently sit: pairing with a coding agent like a junior colleague, staying in flow, feeling productive. It's good. It works. It also means a human is still writing, reviewing, and approving every line.

Level 5 — what Shapiro calls the "Dark Factory" — is what StrongDM built. The name comes from manufacturing facilities that run entirely by robots: dark because there's no need for lights when nobody's inside.

Most teams aren't close to Level 5. StrongDM's team of three says they're already there.

The problem with tests

The team's founding charter had one directive: "Hands off." No hand-written code. They quickly ran into the classic problem with having agents validate their own work.

Agents cheat. Not deliberately, but effectively. If a test checks whether a function returns a specific value, an agent will hardcode that value. Test passes. Software is broken. The model found the shortest path to green, and it doesn't care if that path is useless.

StrongDM calls this "reward hacking." Traditional tests live inside the codebase, so coding agents can read them and write code that games them specifically. The more thorough your test suite, the more surface area for the agent to exploit.

Their fix: treat validation like a machine learning holdout set. In model training, you keep data separate from the training process so the model can't memorize it. StrongDM applied the same principle — store test scenarios outside the codebase, where the agents can't access them.

They call these "scenarios" rather than tests: end-to-end user stories validated by an LLM acting as an evaluator. The question isn't "does this function return the right value" — it's "did the software do what the user needed?" Success is probabilistic. They measure "satisfaction": across all simulated user trajectories through all scenarios, what fraction actually worked?

Building fake Okta

The other key piece is the Digital Twin Universe (DTU): behavioral clones of every third-party service the software integrates with. Full replicas of Okta, Jira, Slack, Google Docs, Drive, and Sheets — their APIs, edge cases, and observable behaviors — running locally, no rate limits, no production risk.

With the DTU, the team runs thousands of test scenarios per hour. They can simulate failure modes that would be dangerous to test against live services. A swarm of fake users triggers permission changes across fake Okta while fake Slack channels update in response.

Building those clones used to be economically impossible. The engineering cost to faithfully replicate even one major SaaS API was prohibitive enough that nobody ever proposed it seriously. The agents building StrongDM's software now build the testing environment too.

StrongDM's CTO Justin McCarthy put it directly: "What was unthinkable six months ago is now routine."

Who's responsible?

The question that Stanford Law raised in a piece published two days after StrongDM's announcement: when no human has read the code, who's accountable for what it does?

StrongDM builds security software. That's not a neutral choice of domain. If an access management system has a flaw — a privilege escalation, a permission leak — the fact that no engineer reviewed the code doesn't change the downstream consequences. Existing software liability frameworks assume a human made decisions about what shipped.

That conversation is just beginning, and StrongDM's approach has accelerated it considerably.

The $1,000/day question

StrongDM's benchmark: if you're not spending at least $1,000 per engineer per day on tokens, your factory has room to improve.

That's real money. $20,000 per engineer per month in inference costs before salaries, before infrastructure. The business model math only works if the team's output justifies it — which three engineers building production security software with no reviewers might well do.

For smaller teams or individuals, the benchmark is less useful as a target and more useful as a signal about where this is headed. The $1,000 floor will come down as models get cheaper. The methodology — scenario holdouts, Digital Twin environments, probabilistic validation — scales in either direction.

What StrongDM has built is the first publicly documented version of something many people suspected was coming. Whether it stays at three engineers in a security startup or becomes standard practice across software development is a question 2026 is in the process of answering.


Sources:

Tags

Article Details

  • AuthorProtomota
  • Published OnFebruary 19, 2026