On hidden context: Behavioral science is the key to studying engineering effectiveness

In the last eight months, I've talked about our Developer Thriving framework with junior developers, with industry analysts, with crowds at conferences, and most of all, with engineering leaders and managers. It has been such an amazing journey since we launched this research project last March. I've gotten coffee and ice cream and tea with folks, walking down the streets of Stockholm and London and Salt Lake City and Chicago. I've heard about software teams that work on ships, that run hospitals and banks, that build core infrastructure or make art. And over and over again engineers ask me this question: how do I know what's working for my team.

Sometimes when I talk to engineering managers about doing diagnosis and triage and evaluation, I can *feel* the frustration radiating off them. Things that once seemed so simple from the outside -- our planning practices! How we diagnose friction! Use the DORA four and everything is fixed! which software metric is good and which is bad! -- turn out to be wildly complex once you're the one on the ground, in charge of making the decisions about them. I can also feel how deeply these folks care. How much they want support in making the world of their software team a little bit better.

Often when we're faced with a challenge, particularly in this data-driven world, we begin to ask where the one true answer lives. We seek the security of the most immediate measures around us. "Just give me a score of team health!" This kind of call is poignant to me because I see the grain of truth in it. Engineering managers need better support. Everyone needs simple and actionable steps they can put into practice. People need measures that are tractable and translatable, that can travel into a quick sliver of time with a leader's ear. Sometimes a small measure in the right direction is better than no measure.

But the call to understand the sociocognitive reality of software teams will not be answered by something like DORA. That framework simply wasn't designed to do that. The SPACE framework which approaches the software sociocognitive a bit more tells you that "Satisfaction" matters, but as an exploratory set of ideas, doesn't unpack how to create it or link back to the larger areas of work on this in social science. Different work is designed to do different things, and that's why we need many areas brought in and many forms of evidence.

It doesn't surprise me that so many engineering managers come up to me after my talks to share their overwhelm. Becoming an evidence-based engineering manager is not easy, and I don't think we're giving managers the skills they need to succeed at first diagnosing, and then creating sociocognitive structures that support their people.

But those skills do exist. Running an effective engineering team is a practice. Without much guidance, new engineering managers are trying to figure out what matters and why. And they're realizing figuring out what matters means becoming aware of an entire world of hidden moderators and mediators. "A boss" can be a life-devastating tyrant or an incredible life-changing mentor. "A meeting" can be an amazing time of getting work done or a useless interruption. Practices change depending on how they're implemented in the real world. This is why when you have little information to go on, it's often more useful to ask "what makes a boss good & where do we see it" than set the bar incredibly high and ask, "are bosses good and by the way what's the best single score for a good boss that will always correctly identify it...?" The second questions depends on understanding the first.

The thing is, measuring software team effectiveness and the related ways in which developers on a software team might feel well, fulfilled, happy and unhappy, engaged and motivated -- this is full of hidden moderators and mediators. Why does a planning ritual work for one team but not another? Because there's some key difference that changes *what that ritual is doing.* This is because "software team effectiveness" is about human behavior in its real world context. This doesn't render "metrics" meaningless. It just means they are happening at a certain layer of the problem. Many software metrics contain, but are not at sufficient resolution to diagnose, the consequences of our larger sociocognitive environments.

But we shouldn't stop at throwing our hands up and saying "oh well, it's all context" as if it's totally unknowable. Studying human behavior IS about mapping and giving us a compass through this context. We can measure those larger elements and by treating them as important moderators and mediators, we can understand why we see the same rituals and practices fail or succeed across many examples. We can then turn that over to engineering managers and help them gain access to a body of evidence they can adapt and translate to their own lives and the lives of their teams. This is the role behavioral science can play in the world, if we try very hard, if we take it seriously, if we build enough pieces of this plane while we're all flying in it.

Behavioral science is the key. It's where we test interventions. It's the place to learn the toolkits for context. Behavioral science is where we have bodies of evidence to draw on about how human beings interpret their performance environments. Some contexts matter more than other contexts, because they're causally related to the largest decisions we make about how to behave in our environment: I believe learning culture, agency, belonging, and self-efficacy are very good foundational signals for the vital sociocognitive contexts of our teams. They're amazing entry points for change. In our Developer Thriving framework, we didn't adapt & create these four software-team-specific measures of social science constructs because we thought that they made a good soundbite or a good acronym. We identified each factor because it was already robustly evidenced as a driver of sustainable human achievement and performance, by areas of work that we thought had strong cognitive translatability to software engineering. And then we tested that hypothesis, and found it supported.

And there are loads of other factors that also matter a lot to the human experience of developers -- like overall resources, and socioeconomic stability of the country the work is happening in, and quality of sleep, and home life, and health. But doing behavioral science and attempting to define important elements of context doesn't also mean you have to fall down the rabbit hole of infinite measurement and analysis paralysis. Applied science always means weighing things and choosing to focus on factors that we have reason to believe will matter to the most people in the most immediately accessible ways. As applied scientists, we also use community-based methods like qualitative interviews and pilot testing to ensure developers' voices are casting the vote for what matters to them and generating unique and important edge cases/categories to consider for later measurement. And we build evidence in series of studies, not from a single study. And we consider choosing things to measure and document in science not just because they exist and we're nerds who love to learn but fundamentally because we think they are strong candidates for interventions that will work. We try to help people now, because people need help now. The factors in Developer Thriving are also good practical targets because they are less distal than "wellbeing"; they are things we think managers and organizations can actually do something about. I consider myself an action researcher -- and I think software needs a lot more of this.

Here is a very magical thing about action research grounded in behavioral science. When you challenge yourself to really measure and include the hard context, even if imperfectly, it turns out all your other stuff can get easier. Once we have gone through this design process and can identify big sociocognitive levers that are practically significant and attainable, we grasp the shape of other things better. It's like you've turned on the light in your lab. Sure, you still have experiments to do, but many things will become obvious.

So it means when we get to the level of individual "work" measures, we suddenly understand a bit of the previously unmapped "context." For example: inside of a learning culture, software teams might use a velocity measure entirely differently, and use slowdowns as a signal for learning investment. But software teams that have a high contest culture, and expect to get punished for learning? Those teams might shy away entirely from velocity measures and consider them wildly inaccurate. So we can see learning culture as a core thing that changes the effect of "using a velocity measure" on a software team.

Now, this doesn't mean that there isn't good work to be done to refine the overall quality of measures inside of our engineering organizations. But past a threshold of quality, consistency, and accuracy -- we can all easily figure out highly motivated reasoning that justifies choosing one software metric over another. And we will forever see them working and failing because of outside context. We can't even validate these measures if we're not taking this larger ecosystem seriously -- we'll be forever grabbing different pieces of the elephant. Are software metrics right or wrong? It depends on how we use them.

I believe that these conversations will never progress -- and software researchers will never help the engineering managers and teams that truly need our support -- if we don't take this fundamental multi-systems view of human behavior. That includes org culture, sociocognitive factors, and the messages in our orgs that maintain those individual beliefs. It includes bringing in the sciences that may fill in fundamental theories and mechanisms that haven't yet gotten their own methods in software teams. It includes being collaborative, translational, and seeking methodological understanding.

A chronic misconception that behavioral/social scientists have to face, in every industry and area we apply our work, is the idea that social science is "soft," that studying human things is "fuzzy," or "squishy." This word gets thrown at work like our Dev Thriving studies, both intentionally and unintentionally devaluing the technicality of our work. But the study of sociocognitive mechanisms is also *technically* interesting, grounded in quantitative methods, and difficult. Often much more difficult than aggregating simple velocity and activity metrics, and necessary for interpreting these metrics. Fully taking a multi-systems view of human behavior is what will unlock our ability to see the intervention points, model change, and understand cause and effect inside of our most important environments -- our workplaces, our schools, our hospitals, our households. Our software teams.

Over the last eight months, I have come to believe in the real world impact of behavioral science more than ever before. There are millions of developers in this world who deserve this work. And we ignore it at our peril -- because all of us depend on their work.