We are terrible at estimating progress with AI
Our assumptions are broken - and the first step to a solution, is admitting we have a problem
There’s an idea that has been rattling around in my head about why tech execs continue to overestimate AI and continue to think they are near the finish line for AI-driven projects - when they aren’t. Apple’s announcements for Apple Intelligence are a particularly public example, but there are many others. We might wonder, how can executives get the timelines so wrong? My hypothesis is that AI presents new challenges to our preconceived notions about estimating software projects that we haven’t internalized yet. Read on for more…
Let’s take a detour through Software History
There’s some precedent for this in software, that will sound familiar to anyone experienced with delivering software projects. A project *should* be a linear progression from start to finish, perhaps with a certain amount of deviation from the plan. But somehow that isn’t how it worked in practice. It is so commonly not correct that we have rules and laws that amount to jokes on the whole profession of software engineering:
The Ninety-Ninety Rule. “The first 90% of the code accounts for the first 90% of the development time. The remaining 10% of the code accounts for the other 90% of the development time” (credited to Tom Cargill).
Hofstadter’s Law. “It always takes longer than you expect, even when you take into account Hofstadter’s Law”. (credited to Douglas Hofstadter)
Brook’s Law. “Adding manpower to a late software project makes it later.” (credited to Fred Brooks in the Mythical Man Month).
And yet. These rules and laws were all popularized in the 70’s and 80’s. As a profession, the software industry has learned to create scaffolding that helps us avoid the worst of each of these laws (in most cases). We can riff on a few of these that might sound familiar to you:
Agile development and iterations/iterative delivery. Breaking work into sprints and focusing on delivering working software early, and often, so that done means it is well tested integrated and shippable.
Definition of Done (DoD). Teams have agreed that “done” doesn’t mean “works on my machine”. Tests, code, documentation, integration.
Test Driven Development (TDD). This forces designing for correctness, before implementing the code, knowing that you have a test to prove whether it works or not.
TimeBoxing. Instead of promising all the features, promise to have working code with most important features in a fixed period of time.
MVP mindset. Delivering a minimum viable product rather than a maximal coverage product. This mindset helps engineering teams make tradeoffs in favor of completion, rather than missing deadlines.
Pair programming - working with a colleague directly as you code, rather than doing your work separately.
Dev Ops and CI/CD - making builds and automation and documentation automate-able and repeatable for each new person joining the team.
And of course many more…
Also, experienced engineers and managers have developed some intuition about how close to “done” they really are, based on velocity, defect rates, test coverage, and other statistics, as well as via familiarity with the software being developed.
In the old world, a convincing demo could be built that was known to be 90% fakery. But when the “real software” reached parity or better with what the demo pretended to show, an engineering manager would have a good sense of how real it was, and how close to complete it was.
I suspect that rather than leaning on all of the statistics and methods, many executives focus on their intuition. It has served them well over the years when it comes to estimating projects and second-guessing the rosy estimates their teams give them.
But, assumptions, when they are broken, can really bite you. I experienced this in my first job at Trilogy. I had grown accustomed to the way our engineering team delivered software with our early products, built by a very experienced engineering team: SalesBUILDER, PriceBUILDER, and QuoteBUILDER to name a few. I knew the style of our company in terms of how products or user interfaces could be expected to be extended (or not), customized (or not). And I knew what to expect from each release (steady forward progress, very few regressions).
And then I tackled my first project with what we called SellingChain. It was largely built by a new software development team (new to the company). But it was also built by new software developers (fresh college graduates). And it ignored the tenets of our bread-and-butter software. It also didn’t improve in a steady march forward; in fact, new releases sometimes regressed significant features. Extensions and customizations were not offered in a way that was consistent with the old products.
Needless too say, I lost my intuition about how to estimate projects on this new software. It was a rude awakening. And a lesson learned.
What does that have to do with AI?
I think we’re experiencing the same disconnect from expectations and intuition with AI.
Our intuition tells us the thing our team built with AI is almost done. It looks done. It sounds done. But, the team tells us that we’re 80% done. But it isn’t. This alone fools many - in this same way that a traditional software project would fool executives. This isn’t new.
But, even if I apply my standard skepticism and think, surely we’re no worse than 50% done, and no better than 80% done, And I’d still be wrong.
AI breaks those old software assumptions. This is a new world.
It may be that the thing we’re trying to build can not be built at all without leveraging other technical approaches as well.
It may be that we have to do R&D to develop the technique to finish the project.
It may be that we can’t reach the level of quality we need- basically everything works, but without that quality bar, we can’t use it. I picture regulated industries or medical or financial applications where the answers must be correct.
There is no guarantee that the right methods and tech are out there. And there’s no guarantee that the R&D we have to do to develop a new technique will finish in reasonable time, nor that it will work at all.
To the extent that your team is vibe-coding their solution, you may be finding that productivity takes a non linear path downward as they spend more time redirecting the coding agents and correcting them, rather than having their simple directives met with simple code. Unfortunately, this situation can just continue to deteriorate, especially if the time to validate a code change is lengthy or high compute.
Example: Fully Autonomous Self Driving Cars - Any day now for 20 years.
As an example. The self driving cars that, 18 years ago, my friends told me would mean that my kids would never have to learn how to drive. Everyone thought we were just a few years away from a solved problem.
Arguably, it is still not *quite* a solved problem, nearly 2 decades later. Waymo is steadily expanding their geographic reach, but it was only a month or two ago when one of them dropped off a passenger on the highway, and when they were removed from the road temporarily for blocking emergency vehicles. There’s still work to do, and yet we’re closer than ever.
Certainly we are in the “last 10%” part of the self-driving-car progression - and yet progress remains very incremental. This isn’t because people didn’t understand cars, or AI, or machine vision, or propulsion or any of those things.
But AI projects don’t move linearly in proportion with effort - and new technology needed to be developed to make this work. Acceptable failure rates are very low for driving applications.
Autonomous self-driving cars are not, by and large, an LLM-driven technology, but the long timeframes and cost overruns are instructive nonetheless.
Another Example: Radiology
Over on social media, Deena Mousa shared a great example of this inability to define “done” in AI, and therefore an inability to predict how well AI will perform the job.
“In 2016 Geoffrey Hinton said ‘we should stop training radiologists now’ since AI would soon be better at their jobs. He was right: models have outperformed radiologists on benchmarks for ~a decade. Yet radiology jobs are at record highs, with an average salary of $520k, why? “
That got my attention. I have friends in Radiology and I had assumed AI was wrecking that job, and I know already before 2016 that offshoring was having some impact on that job as well.
Deena goes on to share that there are over 700 models, approved by the FDA, for radiology alone! That is both astonishing, as well as proof of something that I’ve always said : AI tends to be super-human in one extremely narrowly defined way - rather than in some generalized way - and the best AI is the really narrowly defined AI.
Apparently the AI algorithm don’t provide a holistic view as a doctor would, they need to make sense of an array of answers (some conflicting in medical implications), and use their knowledge of the patient and medicine to formulate medical advice.
Deena’s phrase “many narrow islands of automation” really rings true, about AI in the radiology sphere.
Much like self-driving cars haven’t eliminated drivers yet, amazing AI-driven algorithms for radiology have not replaced doctors yet - despite outperforming them on benchmarks. So when we see AI do better than humans on a law exam or medical exam, we should take this with a grain of salt. Benchmarks for computer algorithms are not “real world applicability”.
And this is cautionary for executives looking to apply AI. Don’t get taken in by the pace of progress in the press, nor by the fact that AI meets a certain benchmark for a certain task. Focus on the real-world impact in your operations - and when the AI algorithms can positively impact those operations then you’re talking money.
Right now you’re thinking, okay fine, that's driving and radiology. But that’s 2 examples or anecdotes. What about all those other industries?
We’re not replacing jobs in other industries either
Greg Kamradt’s post on X puts this succinctly:
Since we’re -3pp away from human level parity in 44 industries but haven’t seen mass job displacement yet, what can we deduce?
He posits three answers:
Perhaps AI is augmenting humans in their jobs, rather than replacing them
Perhaps the AI is great, but integrating with your business operations isn’t happening yet (he would frame this as the app layer, or a product issue)
Perhaps Expert level benchmarks are interesting, but don’t predict job displacement well (see above, radiology)
I think it is a winner: all three are true. There’s a fourth possibility:
The AI hasn’t been distributed to all the places it can be most effective yet, so these 44 industries’ jobs are *about* to be severely impacted, it just hasn’t happened yet.
The supporting data for this one would be trucking. Autonomous driving now exists, but has not been widely rolled out for trucking use cases. That might be correct - I’m not close enough to the needs for trucking to posit whether AI is ready to supplant human drivers. Highway driving seems simple for AI to master whereas city driving may not be. Conversely, human drivers tend to fall asleep driving long highway routes. Maybe the outcome is a marriage of AI for long-haul and navigation, and humans for driving in tight quarters?
Yet another posits that AI may actually be eroding productivity - by generating “slop” that requires humans to intervene and clean up.
The comments to Greg’s post are fascinating. Votes and anecdotes in favor of each of these rationales.
Another Example: Our very own Software Industry
Anthropic’s CEO Dario Amodei, who surely deserves plenty of credit for the success that Anthropic has had, and for producing a product (Claude) that is generally well received by its target market (software developers), nevertheless also falls into the trap of overestimating how far, how fast, progress will travel. Six months ago (in Business Insider):
Dario Amodei, the CEO of the AI startup Anthropic, said on Monday that AI, and not software developers, could be writing all of the code in our software in a year.
“I think we will be there in three to six months, where AI is writing 90% of the code. And then, in 12 months, we may be in a world where AI is writing essentially all of the code,” Amodei said at a Council of Foreign Relations event on Monday.
John Gruber on Daring Fireball’s response, on August 20th:
Complete bullshit, but, I guess he still has one month to go. - John Gruber
Well, that month has come and gone. If the product is “software that helps software developers achieve super-human levels at certain tasks with the aid of AI” then by all accounts Claude is a success - I’ve been a satisfied user myself. But if the goal is to build a product that “does 90% of what a software developer does, or more” then I’d have to say it not only fails to meet that objective, but also that it isn’t clear when it could meet that standard.
In what other market would we consider it a success if our software product routinely gives us wrong answers, to which we then have to point out the errors, only to get responses like “you’re right! I did get that wrong. I think I know how to address that issue by modifying… [..]”. And often you’ll note that that fix doesn’t work - and it will have to try to guess again what might be a correct solution.
Imagine if this was a coffee robot AI - constantly making the wrong drink, which you would explain and it would helplessly admit to its failure and then exclaim at an obvious fix the the mistake it had just made, while whipping you up a new coffee.
This is not acceptable in a medical field, banking, investing, insurance, or in any field where physical damage could be done, regulations broken, or money lost. Or, driving cars. Or making medical conclusions from radiological exams. We have a long way to go, and it is *very* hard to estimate how far we are from the finish line for a fully autonomous coding agent - because likely, new R&D needs to happen with new techniques to make that possible. LLMs on their own, it seems, will not get us there.
Fortunately, developers are a forgiving sort, and will engage with their AI chatbots to shape the code correctly. And, also fortunately, these products *can* produce reasonable solutions to well-trod problems that are well-described often enough that software developers will continue to try to use them.
None of this is to say we shouldn’t build with AI - this is just evidence that our ability to estimate how close we are to the answer, with AI-driven systems, is still formative and inaccurate. This is why it pays to have people with realistic experience and a deeper understanding of the mechanisms employed, who can offer better advice.
In the meantime, I’d suggest defining the product definitions of our AI products as assistants to people getting their work done, rather than as replacements for those people.
Humanizing AI with our use of Language interferes with Understanding
Tech execs and pundits keep using humanizing language to characterize AI and what AI is doing - and that is interfering with their understanding of what is actually happening. This is generally called anthropomorphism.
Dwarkesh has a fantastic blog, podcast, and video platform of content. I wish I’d landed on it sooner. In one recent episode he interviews Richard Sutton, who has recently won the Turing Award (effectively the Nobel prize for Computer Science). Sutton is a luminary in AI circles and his views carry weight due to his deep expertise. In the discussion with Dwarkesh, they start off with LLMs and whether they might be a dead-end, and you can see the difficulty in the conversation quite quickly:
Dwarkesh Patel: You would think that to emulate the trillions of tokens in the corpus of Internet text, you would have to build a world model. In fact, these models do seem to have very robust world models. They’re the best world models we’ve made to date in AI, right? What do you think is missing?
Richard Sutton: I would disagree with most of the things you just said. To mimic what people say is not really to build a model of the world at all. You’re mimicking things that have a model of the world: people.
The opening question has a framing that Sutton has to largely disagree with. It’s sort of inserting facts into the official record in the form of a question, but the facts are wrong. But Dwarkesh doesn’t give up too easily, positing that you can train LLMs on experience, overcoming this issue. Sutton’s response:
“No. I agree that it’s the large language model perspective. I don’t think it’s a good perspective. A prior bit of knowledge should be the basis for actual knowledge. What is actual knowledge? There’s no definition of actual knowledge in that large-language framework. […] There’s no ground truth. […]”
Effectively, the first three questions start with Sutton definitively, and with great context, answering “No.” The fourth answer is disagreement. The fifth is clarifying why Dwarkesh’ statement is incorrect. Then a disagreement on what a goal is (I think we can all agree, it isn’t “next token”.
They move onto human learning and whether they do imitation learning or not. Dwarkesh seeks to equate human learning methods with LLM learning methods. But they are fundamentally different, as Richard points out in painstaking detail. Honestly it is a great discussion - but the crux of it is that Dwarkesh comes off as believing in the similarity of LLMs and. humans and Sutton does not buy it. It is clear as well who is speaking from actual expertise of how all the underlying technology works, and who is operating from “I’ve talked to smart people and I’ve been a user of these technologies and it *feels* like it works this way”.
I think a lot of tech executives are sitting in a similar position to Dwarkesh. Obviously smart, polymaths, who understand a fair amount about many topics, and may even have been quite deep technically at some point in their career. But they are mostly not deep technically on AI techniques and specifically LLMs, and they are likely to have trouble predicting future course of events (erring on the side of being too optimistic). Unfortunately, this is sometimes true of the executives of AI companies as well.
All of this leads to misconstruing what AI can and will do, and. how close to AGI we are.
It culminates with a question around transfer - if learning one thing (in AI) will lead to understanding another thing (in AI), without being trained on it. A human exhibits this kind of learning all the time. Richard Sutton’s response:
We’re not seeing transfer anywhere. Critical to good performance is that you can generalize well from one state to another state. We don’t have any methods that are good at that.
They go on to clarify that humans are setting up the conditions for success, rather than it emerging from the behavior of AI algorithms. In fact, in deep learning training on one thing can really impede the prior training on all other things. An interference, if you will.
The whole podcast is so fascinating. Worth a read *and* a listen. Update: And lest my words above appear to be too harsh a critique, Dwarkesh has published a great followup to his own interview of Richard Sutton that is nearly as interesting as the podcast.
I’ve been thinking about it myself. I have a much better understanding of Sutton’s perspective now than I did during the interview itself. So I want to reflect on it a bit.
Richard, apologies for any errors or misunderstandings. It’s been very productive to learn from your thoughts.
I think it is this openness and willingness to be humble that makes Dwarkesh a compelling listen/read. He doesn’t elevate himself above his guests, nor does he just cave to their opinions without challenging them to further explain. It’s good stuff.
However, a nitpick: Dwarkesh is still stuck on the idea that LLMs build a world-model. When that is clearly probably not the case - because they can have the text of the rules of something, without actually understanding it (no model, to go with the text). Similarly, you can have an AI with a “world model” that isn’t as interesting or “intelligent-seeming” as an LLM. Fun times ahead!
This mismatch of expertise to narrative is punishing traditionally excellent journalists as well. Melanie Mitchell writes about how wrong Thomas Friedman’s views and explanations of AI are, as a result of “Magical Thinking”. It’s also a sobering read on how easily we can convince ourselves of things that just aren’t correct. I’m being kind. Melanie’s work is excellent.
Wrapping it up
All of this leads me to believe that our hard-earned software engineering / estimating the future algorithms have been broken by the introduction of AI, and Generative AI in particular. Having 90% of a solution solved does not, in the world of AI, say anything about how much cost and effort is required to solve the last 10%, except that if you need that last 10%, you can assume it will be much more expensive than the first 90%.
In order to predict how close you are to having your product or project meet your expectations, you need build new intuition about the kinds of blind corners that AI-driven solutions can uncover, to understand why certain approaches will have diminishing returns, or cannot be generalized. And why you might need to train several specific AI algorithms that are leveraged at run time to build a composite view (because a generalized model won’t get the job done).
We have much to learn about building systems with generative AI or machine learning as a central tent-pole, but all is not lost. I imagine these issues will look quaint in 30 years, looking back, much as the software engineering issues of the 1980s seem kind of charming by comparison. Perhaps a few new "AI Laws” will be put into black and white to reflect this new dynamic for building solutions that we are not good at estimating.




