A Fresh Perspective on Forecasting in Software Development

Published in

A Path Less Taken

18 min readApr 16, 2023

In a recent post I delved into some foundational concepts from the book Noise: A Flaw in Human Judgement, by Kahneman, Sibony, and Sunstein. I’m now going to dig deeper into some of the specific ramifications of both bias and noise, in terms of how they surface in forecasts. boo

Examples of Problems with Forecasts

In just about any domain where human judgement is involved, and where there is a need to make a forecast, the evidence of both bias and noise is present. As the authors observe:

“Analysts of forecasting — of when it goes wrong and why — make a sharp distinction between bias and noise (also called inconsistency or unreliability). Everyone agrees that in some contexts, forecasters are biased. For example, official agencies show unrealistic optimism in their budget forecasts. On average, they project unrealistically high economic growth and unrealistically low deficits. For practical purposes, it matters little whether their unrealistic optimism is a product of cognitive bias or political considerations.
… Forecasters are also noisy. A reference text, J. Scott Armstrong’s Principles of Forecasting, points out that even among experts, ‘unreliability is a source of error in judgemental forecasting.’ In fact noise is a major source of error. Occasion noise is common; forecasters do not always agree with themselves. Between-person noise [system noise] is also pervasive; forecasters disagree with one another, even if they are specialists.”

A strange thing about forecasts is that there are some where we have inherent skepticism (weather forecasts being an obvious example). And there are some where many of us may tend to lend a higher degree of credibility to the forecast, often because we place trust in people we view as having an expert opinion, or because we have a “gut feeling” about it (such as which team is favored to win a basketball game). And yet, there is error in just about very form of forecast, and in some cases, the error rates are quite high — much higher than we might care to believe.

The Good Judgement Project

The authors describe the Good Judgement Project in some detail, and it’s worthwhile to summarize the nature of this project. The project, launched in 2011, was led by the behavioral scientists Philip Tetlock, Barbara Mellers, and Don Moore. Not only did they seek to achieve a better understanding of forecasting, they also wanted to determine why some people are better at it than others.

It’s important to point out the following things about this study:

it was massive, with tens of thousands of volunteers
it included a broad sample of people, with varied backgrounds (not just specialists or experts)
it included hundreds of broad questions about world events, such as whether Russia would officially annex Ukrainian territory within the next three months, or whether any country would withdraw from the Eurozone within the next year

Based on the preceding description, the following things might not be evident about the study:

a large number of forecasts were used (to eliminate random chance being responsible for success or failure); that is, the researchers evaluated how respondents did, on average, across numerous forecasts
questions were articulated in terms of probability that an event would happen, not an absolute will/will not happen (that is, they used probabilistic forecasts, not deterministic forecasts — we’ll return to these concepts later)
participants had the opportunity to update their forecasts as new information became available (in much the same way that professional forecasters update their judgements in light of emerging information)
to score forecaster performance, a measure called a Brier score was used, where the idea is to measure the differential between what people forecast and what actually transpires

To summarize the findings from the Good Judgement Project:

most respondents did poorly on their forecasts
about 2% of forecasters stood out as doing far better than would be possible due to mere chance (Tetlock used the term “superforecasters” for this group)

How well did the superforecasters do? “Remarkably, one government official said the group did significantly ‘better than the average for intelligence community analysts, who could read intercepts and other secret data.’” And thus, all of this leads to the question of in what ways do superforecasters differ from everyone else. In the superforecaster group, there was evidence of a couple of the more obvious things that might come to mind, for example higher than average intelligence, and math aptitude. However, what stood out, above all else, was that

superforecasters are good at thinking analytically and probabilistically

In practice, Kahneman, Sibony, and Sunstein observe, thinking analytically and probabilistically looks a lot like this:

the ability to structure and disaggregate problems

“Rather than form a holistic jugdement about a big geopolitical question (whether a nation will leave the European Union, whether a war will break out in a particular place, whether a public official will be assassinated), they break it up into its component parts. They ask, ‘What would it take for the answer to be yes? What would it take for the answer to be no?’ Instead of offering a gut feeling or some kinds of global hunch, they ask and try to answer an assortment of subsidiary questions.”

the tendency to take an outside view and consider base rates

“Asked whether the next year will bring an armed clash between China and Vietnam over a border dispute, superforecasters do not focus only or immediately on whether China and Vietnam are getting along right now. They might also have an intuition about this, in light of the news and analysis they have read. But they know their intuition about one event is not generally a good guide. Instead, they start looking for a base rate: they ask how often past border disputes have escalated into armed clashes. If such clashes are rare, superforecasters will begin by incorporating that fact and only then turn to the details of the China-Vietnam situation.”

Perpetual Beta

Conveniently, since I will be turning my attention to forecasting in a software development context soon, Tetlock used the phrase “perpetual beta,” where he stated that

“the strongest predictor of rising into the ranks of superforecasters is perpetual beta, the degree to which one is committed to belief updating and self-improvement… What makes them so good is less what they are than what they do — the hard work of research, the careful thought and self-criticism, the gathering and synthesizing of other perspectives, the granular judgements and relentless updating.”

The good news is that via another study that Tetlock and his team did, they were able to determine that it is possible for people to get better at forecasting. For this study, they divided ordinary (non-superforecasters) into three groups, and they tested the extent to which the following interventions helped them improve:

Training. The training consisted of a tutorial that explored probabilistic reasoning, exploration of different forms of bias, and the need to average more than one prediction from different sources, among other things.
Teaming. Some forecasters were instructed to work in groups, where there was active dialog and debate about one another’s predictions.
Selection. All of the participants were scored for accuracy, and the top 2% were selected at the end of a full year, and put into elite teams that worked together the following year.

Ultimately, all three interventions were effective, in that each one improved the participants’ ability to forecast, where:

training made a difference
teaming made a bigger difference
selection had the largest overall effect

Definitions of Terms

Before we dive into the wonderful world of forecasting in a software development context, let’s revisit the definitions of some terms that the authors use, which I mentioned in my previous post. One key difference is that in the definitions below, I use examples not from people working in a judiciary context, as the others do, but instead I use examples of decision-makers working in a software development context.

Bias and Noise

Bias. A skew in the results for judgements of the same problem that follows an observable pattern, where most errors are in the same direction. (Bias can be seen as the average error, e.g., when people making forecasts on when something will be done are overly optimistic, and more corrosively, when leaders choose to ignore forecasts and insist that something must be done by a deadline that practitioners insist is impossible.)
Noise. Variability in results for judgements of the same problem that does not follow an observable pattern, and where the judgements can reasonably be expected to be identical. (Noise can also be seen as any errors that remain after bias has been removed, that is, it’s often manifested as system noise (see definition below), where a company has the option to hear from differ practitioners about forecasts, where they might hear from engineering leads in different business units, who might give considerably different answers about how long the same or a similar body of work might take.)
Level noise. Variability in results for judgements of the same problem, where the variability may be associated with the values or pre-dispositions of the decision-maker. (That is, level noise is “the variability of the average judgements made by different individuals,” e.g., a practitioner who is known for giving particularly pessimistic or optimistic forecasts.)
Pattern noise. Variability in results for judgements of the same problem, where the variability may be associated with a particular judgement scenario or with the particulars associated with a particular case. (That is, pattern noise is often manifested as “principles or values that the individuals follow, whether consciously or not,” e.g., a practitioner is asked to make a forecast where they have to consider things outside their area of expertise when doing so, and thus, even if they might typically give more optimistic forecasts in general, they are understandably less likely to be optimistic when trying to factor in unknowns.)
System noise. The sum of level noise and pattern noise, or, to be more precise, System Noise squared equals the sums of Level Noise squared and Pattern Noise squared.
Occasion noise. Variability in results for judgements of the same problem, where the variability may be associated with a person’s personal circumstances at the moment of decision, due to factors such as what mood they’re in, how well-rested they are, what time of day it is, what the weather is like, and whether they are/are not hungry. (It’s a transient component of pattern noise, e.g., two Managers have just had a heated debate with a Director they report to before walking into a meeting where a key decision is to be made.)

Group Decision-Making

In a group decision-making setting, the following additional definitions apply:

Noise amplification. The appearance of any one of the following factors, which can introduce variability into a group’s judgement of the same problem:

Social influence. The extent to which the members of the group are exposed to the judgements of another individual or group, before making their own judgement. (Example: members of a software development team always defer to the judgement of their Tech Lead when choosing a way to solve a complex problem.)
Informational cascade. The extent to which the order in which a member of the group gets a chance to offer their opinion affects the opinion that they give. (Example: A Vice President is the first person to give an opinion in a meeting, then a Director, and little dialog occurs among the rest of the meeting participants before making a decision about what deployment strategy to take for an upcoming release.)
Group polarization. The extent to which a pre-existing view moves further in the same direction when hearing similar points of view expressed by others in the group. (Example: The members of a team who need to make a recommendation on whether to move from one cloud provider to another, and who start the conversation with only mild reservations about the move, leave the meeting where as a group, they are absolutely sure that it would be a terrible idea).

Challenges with Forecasting in Software Development

Even in light of the preceding information about how problematic forecasting is in general, and how bias and noise impact the ability to forecast, what is it about software development that makes it unique? A great deal has been written about software estimation and making forecasts which I will not attempt to repeat here.

I will instead attempt to summarize with this set of assertions:

Each implementation is unique. Although there might be similarities with other work that has been done, the unique combination of the problem to be solved, and <n> ways to solve it, is characteristic of “knowledge work” in general, and is especially true in software development.
Taking shortcuts leads to disaster. Especially when under schedule pressure, teams often feel they have no choice other than to use a third party library that they have security or performance concerns about, to put off bug fixes, or to do less testing, to name a few examples. All of these things are manifestations of technical debt, and nothing is more pernicious if allowed to continue over time. Ultimately the results of increasing technical debt mean the team needs more time, not less, to release features, and eventually, could make an application difficult or impossible to maintain.
Estimates are prone to error. For reasons including, but not limited to, the assertions I make here, it is folly to believe that it’s possible to make software estimates “more exact.” In the face of uncertainty, given a number of variables that require consideration, and the normal process of discovery that occurs, spending more and more time on estimation is not only unlikely to result in better estimates (whatever that means ;), it is a form of waste.
Adding more people is often less helpful than it might appear on the surface. Leaders often jump to the conclusion that throwing more people at a problem when things are tracking on a slower trajectory than desired will automatically mean getting what they want sooner. As Fred Brooks famously asserted, “adding people to a late project makes it later.” This maxim, often known as Brooks’ Law, reflects the reality that when one or more people are added to a team, it initially slows the team down, since the new people need to partner with existing members of the team to get up to speed. Adding more people could result in faster delivery, just not right away.
Keeping people highly utilized impedes flow. It is a common mistake that leaders in particular are prone to making, to insist that every person needs to be 100% utilized. The metrics of flow tell a completely different story, where keeping Work In Progress (WIP) low positively correlates with improvements in both Throughput and Cycle Time.

Bias and Noise in Software Estimates

There are plenty of other things that I could mention, but I’ll stick to the five assertions above, which collectively account for much of the consternation and uncertainty around software development forecasts. Let’s also consider the extent to which bias and noise typically exist when making forecasts in a software development context, and to that end, I make the following additional assertion:

A huge amount of bias and noise exists in software forecasts

I certainly could go into an excruciating amount of detail about manifestations of bias and noise in software forecasts, just based on what I’ve seen over the course of a long career, not to mention what I’ve read or heard about from others. I’m going to do my best to keep it simple, and make the next assertion:

Software estimates inherently have both bias and noise

Bias

Let’s start with bias. Software estimates are most often (and some would say almost always) biased in the direction of being overly optimistic. There are many reasons why this is so. To name just a few of them:

Vague requirements. Opinions might differ on what a “well-formed user story” might look like, in terms of level of detail (some of which is context-dependent), however, there is no question that missing or incorrect information results in tasks taking longer, often manifested as rework.
Discovery. Even when user stories are consistently well-articulated, based on the information that is available, it is a common occurrence for information to emerge that can have significant schedule ramifications.
Technical debt. As mentioned above, the shortcuts that teams might take have consequences, and the cumulative impact of those decisions grows with the passage of time, making it harder to deliver.
Repeatability. Although it is certainly true that some tasks are essentially clones of one another, it is often the case that what might have seemed on the surface to be virtually identical tasks prove to have important differences which could manifest in all sorts of ways.
Wait time. The greater the number of hand-offs that need to occur, the higher the likelihood that something will take longer to complete than originally planned.
Availability. A lack of availability can be manifested in both a technical and a people sense. In a technical sense, it might mean that a job failed to run or that an environment needs to have a data refresh, just to name a couple of examples. And in a people sense, it could be because of changes to team composition, because team members are frequently context-switching, or because a team member is simply unavailable, since “life happens.”
Telling people what they want to hear. It is painfully common for teams to provide estimates that are more optimistic than they think is realistic, most often because they fear that a realistic estimate would not be acceptable to one or more stakeholders.

Noise

The simplest way to express why noise exists in software estimation is to consider how it is most often done in an agile team context. Given many of the challenges articulated above, it’s common to use abstract units called story points. I’m not going to go into all the why’s and wherefore’s related to story points here; that’s well-trodden territory.

What I will say that as commonly practiced, software estimation is likely to suffer from all of the noise amplification which is characteristic of group decision-making, as described above:

Social influence
Informational cascade
Polarization

Furthermore, at least one type of noise is likely to exist when doing software estimation. Let’s list those once again:

Level noise
Pattern noise
Occasion noise

Reminder: Level noise and pattern noise are sub-components of system noise.

What Are the Implications for Software Estimation?

If we accept that some bias and some noise is inherent in software estimation, then it’s not a huge leap to make assertions such as the following:

Even under the best of circumstances, software estimates have a significant amount of error
In zero-sum terms, time spent on software estimation is time NOT available for other forms of work

I’ll pause here, because I fully recognize I’m wading into potentially treacherous waters, where some readers may feel like I’m pushing them toward “no-estimates.”

What I’ll offer instead is a more nuanced view, and to do that, I’ll use Ron Jeffries’ 3 C’s (Card — Conversation — Confirmation). One of the things that the 3 C’s does so beautifully is to articulate the importance of alignment in a team context, that is, shared understanding. How teams arrive at a shared understanding is up to them. For example, it could mean any of the following (and certainly other variations exist):

Every story a team works on has to have an estimate
Stories don’t have to include estimates, but they do need to be clearly understood with respect to complexity, dependencies, and relative effort
As long as a story is small enough to be completed in <team-agreed time delimeter>, it can be worked on

Decision-Making Heuristic for Minimalist Estimation

Let’s zero in Option 3 for a moment. Returning to the 3-C’s, a team working in this fashion may very well need to drill down into details, however, the time they spend in conversation tends to result in capturing important nuances and possibly clarifying the requirements.

Feel free to leverage these steps to come up with your own decision-making process, for your team:

Agree on a maximum duration for a ticket. (To start with, you might want to set a relatively easy target for the team, like a week, or even a bit more, depending on your situation.)

For instance, let’s say three days is the maximum duration that the team decides on
Let’s further assume that the duration of tickets the team works on ranges from 1 to 3 days
If the ticket size is at or below the maximum, there is no need to talk about it further (at least not from a sizing perspective)
If the ticket size is larger than the maximum, it likely needs to be split, such that none of the split (child) tickets exceed the maximum

Note: Some teams may also find it useful to agree on a sizing convention they can use, to help them make the transition to either spending less time on estimation, or none at all. Let’s say they agree on T-shirt-sizes, and start with a convention something like this:

Small. Less than a day.
Medium. From one to three days.
Large. Three or more days

Concluding Thoughts About Estimation

Coming back to what no-estimates advocates might say, the way I would put it is it comes down to YAGNI — You Ain’t Gonna Need It — for estimation in general. And by way of extension, YAGNI applies to Velocity as well, the ever-popular, and yet much-maligned, metric so closely associated with so many agile teams.

An important reason behind this assertion is that what tends to surface, when looking at team data closely, is that Throughput (a simple count of completed work items) is at least as good an indicator for making forecasts as Velocity.

As before, I will refrain from a lengthy discussion of Velocity and Throughput here. If you’d like to read more, there is plenty to read on this topic. If you’d like to check out my perspective, see:

Velocity is a House Built of Straw

medium.com

Probabilistic Forecasting

So, dear reader, if you’ve stuck around this long, I’m not going to just leave you hanging. Regardless of what I think or what anyone else thinks about fraught topics like estimation and Velocity, the good news is that making probabilistic forecasts is open to all of us. And on that note, I will make another assertion:

Probabilistic forecasting is a great way to have respect for people when making forecasts in a software development context

The reason I articulate it this way is because:

if and when we are transparent with stakeholders (which is much easier to do with a probabilistic forecast than a deterministic forecast), we are showing them respect
the very process of making a probabilistic forecast — which consists of a range of outcomes — is a way of respecting our teams, because it’s far better at reflecting the uncertainty that is inherent in software development, and does so in a way that makes a “death march” scenario less likely

Note: Probabilistic forecasting does not magically make all problems go away. However, when it comes to expectation-setting, it is a step in a positive direction for all parties.

The Difference Between Deterministic and Probabilistic Forecasts

Let’s start with a conventional example of a forecast, where me might be looking about a calendar quarter into the future, and say something like
this, if we’re working on the next release of a “Consumer Confidence Assessor” app:

“Based on the work in the Product Backlog that’s in-scope for Release 2.0 of the Consumer Confidence Assessor (and some additional assumptions, like how many teams will be working on it), we think we can finish Release 2.0 in 12 weeks” (or 6 Sprints, if we’re using Scrum and are on a 2-week Sprint cadence).

The example above, which is focused on a single possible outcome (finishing in 12 weeks, in this case), is a “deterministic forecast.”

A probabilistic forecast has two components, not one:

A range
A probability

Getting back to the example above, a probabilistic forecast would sound more like this for the “Consumer Confidence Assessor” app:

“There is an 85 percent chance we can finish Release 2.0 in 12 weeks (6 Sprints).”

It’s important to point out that the outcome in the example above is one of a range of outcomes. As part of that same analysis, we might also
have concluded that there is a “60 percent chance we can finish Release 2.0 in 10 weeks (5 Sprints)” (And similarly, if we set a date further out
on the calendar, the probability would increase accordingly.)

Reminder: When doing probabilistic forecasting, it’s possible to use either Velocity or Throughput.

Bias and Noise in Software Forecasts

It will probably not come as a surprise, based on what I described above, that I also make the following assertion:

Software forecasts inherently have both bias and noise

A large component of that bias and noise is directly attributable to software estimates. So, as a starting point, it’s helpful to take a critical — and yes, unbiased — look at how the software forecasting process works in your organization, and the extent to which estimates contribute to bias and noise. AND, it’s important to look at all of the factors that do (or don’t) factor into making those forecasts, some of which might (or might not ) be directly related to estimates. To name several examples:

How frequently people move from team to team
How familiar (or unfamiliar) a particular business domain might be to a team
How common it is for a team to need to pivot in a different direction
How much unplanned work there is, for example, production incidents that pull one or more team members away

Conclusion

It is my hope that by starting this post with the valuable perspective on bias and noise provided by Kahneman, Sibony, and Sunstein, it helps frame the conversation about forecasting in software development in a different light. If nothing else, I hope it leads to open and honest dialogue on this topic.

Regardless of what methods an organization might choose to use, continuous learning and continuous improvement are vital. May we all consider the Good Judgement Project and its “perpetual beta” as a model when seeking to improve with respect to decision-making, recognize bias and noise for what they are, and take steps to minimize how often they surface.