Productivity

Rethinking On-Call: Incident and Postmortem Lessons from Top Dropbox and AWS Engineers

In my previous post, The Science of On-Call Anxiety: Phantom Pages, Sleep Debt, and Alert Fatigue, I focused on the cost of oncall and how to defend against it. This article takes a different angle: mindset.

I recently listened to two interviews, one with James Cowling, formerly Dropbox’s most senior engineer, and another with AWS Distinguished Engineer Marc Brooker. Both were eye-opening, but what surprised me most was their shared attitude toward oncall and incident response: they did not avoid it. In Marc Brooker’s case, he even actively chose the work, staying on call for 15 years by choice.

That made me ask a different question: what if incident response is not something we should automatically resist? Maybe I had been treating it as a pure burden and threat, and missing its other side: a fast track for learning and a way to demonstrate ownership. This post is my reading notes and reflection.

TL;DR: Three mindset shifts

If you only take away three things, let it be these:

Oncall is one of the best places to learn: Marc Brooker said that most of his practical knowledge about distributed systems came from being on call and reading postmortems. Treat incidents as first-hand evidence of how systems really behave, not just as annoying work.
Separate firefighting from learning: Repeatedly closing the same kind of ticket is a sign that the work should be automated. Unusual incidents are the gold mine worth deep investigation.
Replace complaining with ownership: James Cowling’s core mindset was simple: “if there is no one else, it is us.” Incident response is not something that happens to you; it is a choice to say, “I will help fix this.”

mindmap
  root(("Positive Oncall Mindset"))
    Oncall is a learning field (Marc Brooker)
      Practical knowledge comes from oncall
      See how systems behave in reality
      See how customers really use them
    Separate two kinds of work
      Repeated tickets should be automated
      Unusual incidents deserve deep dives
      The goal is understanding, not just closure
    The power of postmortems
      First understand what really happened
      Ask why at multiple levels
      Look across incidents for patterns
      Turn lessons into tools
    Watch out for heroism traps
      Being paged 100 times feels impressive
      Inside, it looks like ownership
      Outside, root causes remain unfixed
    Ownership over complaint (James Cowling)
      If there is no one else, it is us
      High ownership does not mean doing everything alone
      Ask whether we should fix this now
      Commit to a cause

1. Why top engineers lean toward oncall

In the interview, the host asked Marc Brooker a simple question: many senior engineers try to negotiate their way out of oncall because it looks like low-leverage, high-friction work. Why did he stay on call for 15 years?

His answer challenged my assumptions:

I would say that the majority of my in practice knowledge about how to build distributed systems has come from being on call and analyzing and deeply understanding these post mortems and COEs. - Marc Brooker

His point was that junior engineers usually have solid CS fundamentals, coding ability, and math skills, but they often lack grounded knowledge: how systems actually behave, how they fail, and how customers really use them, including the ways they misuse them. Oncall is one of the most direct ways to gain that knowledge.

In other words, oncall is not only an operational cost. It is also a high-bandwidth learning channel. When you are on the front line of an incident, you see how the system behaves under real pressure, not just how the design doc says it should behave. For engineers who want to grow faster, that is an advantage other people cannot easily copy.

2. Separate “firefighting” from “learning”

Marc also made it clear that this does not mean you should become a mindless firefighting team. He separated oncall work into two categories:

Repeated, predictable tickets: if you keep closing the same ticket over and over, that is toil and should be automated. Automation is easier and more powerful now than it used to be, and senior engineers should not spend all their time on this.
Unusual, unexpected incidents: these are the cases worth investigating deeply. The goal of oncall should be to understand weird or surprising system behavior, bring that learning back into the system, and share it across the company and the wider community.

If you have folks in your teams who are on call and they’re just closing the same ticket over and over and over, well, that’s where you need to just build some automation. - Marc Brooker

This distinction was especially helpful to me. In the previous post, I talked about reducing noise: if a shift has more than two incidents that truly need attention, alerting should be reviewed. The point here is the next step: once you remove noise, invest the saved attention in the incidents that are actually worth understanding.

The value of oncall is not how many tickets you close. It is how much you learn about system behavior that you did not understand before.

3. What a good postmortem looks like

Marc estimates that throughout his career he has read around 3,000 to 4,000 industry postmortems and Amazon internal COEs (Correction of Errors). As he put it, even if each one teaches you only a little, that knowledge compounds.

So what does a good postmortem look like?

First, understand what actually happened. Do not start with your own assumptions.

If you can’t understand what happened, well, that teaches you something about your logging and metrics and observability. - Marc Brooker

In other words, if you cannot reconstruct the incident itself, that is a signal that your logging, metrics, or observability have a gap.

Second, keep asking why at different layers. Do not stop at the proximal cause:

There was a code bug. Good, fix it. But do not stop there.
Why did tests or validation miss it?
Why is our test process shaped this way?
Why did we make that assumption about system behavior in the first place?

A strong postmortem does not just fix one line of code. It also improves the testing process, team process, and organizational process.

Third, look across incidents for patterns. If the same class of issue keeps showing up, step back and ask how to abstract it: can it become a service, a library, a community of practice, or a technical control that removes an entire category of problems? Marc mentioned Aurora DSQL as an example. While designing it, the team reviewed many relational-database-related postmortems and deliberately designed out common failure modes, such as paused or disconnected clients holding transaction locks for too long.

How do we turn all of these lessons into new services and into service improvements? - Marc Brooker

The interview also mentioned AWS’s weekly Wednesday COE meeting: cross-team leaders read postmortems together, discuss what they learned, and think about how to apply those lessons across the company. Marc sees this mechanism as one of the near-causal core factors behind AWS’s success, because it forces the best engineers to spend their time deeply understanding why systems behave the way they do.

4. Beware the heroism trap: it feels good, but it is wrong

This part matters a lot if, like me, you tend to equate “I held the line” with “I was responsible.”

Marc observed that teams with weak postmortem culture usually fall into two failure modes:

Failure mode 1: not caring enough about outcomes. The team does not care enough whether the product is actually working well or whether customers are satisfied. That is a leadership and culture problem, because it means the team is not setting the right standards.

Failure mode 2: normalizing operational heroism.

We don’t need to fix these root causes because our on calls are superheroic and they’re going to stay up all night and they don’t mind being paged 100 times a week. - Marc Brooker

From the inside, this can look like strong ownership: the people are highly committed, extremely capable, willing to hack through the night, and never complain about being paged 100 times. But from the outside, the truth is different: the team is not fixing root causes. It is just feeding a group of highly responsible people into an expensive break-fix loop.

The most dangerous part is that the heroism makes the whole operation feel good. That sense of “we care about customers and we are working hard” makes it difficult to admit that effort is being spent at the wrong layer. The real move is to redirect that energy into postmortems, root-cause analysis, and strategic improvements. Once you break the loop, you often free up a surprising amount of time to actually improve the product.

Being proud of getting paged 100 times in a week and surviving the night is not a badge of honor. It is a signal that the system needs attention. Individual toughness cannot solve structural problems.

5. James Cowling: replace complaining with “I will fix it”

If Marc Brooker gave me the lens that “oncall is a learning field,” James Cowling gave me the deeper mindset for handling incidents and system problems.

He recalled a time when Dropbox’s infra team was only seven or eight people. Someone who joined from Google said, “Someone should build a logging framework, because I cannot do my job without it.” James’s reaction was:

Well, who is someone? Because it’s just us. It’s just us. We build it or we don’t build it. There’s no other idiots out there. We’re the idiots. - James Cowling

This kind of accountability is a useful mindset for incident response too. Too often, we treat incidents as a burden that was dropped on us and quietly wait for “someone better suited” to show up. But very often, that “someone” is you, the person who just got paged.

When James led teams, he intentionally tried to model that kind of ownership around oncall and paging:

Very specifically, with regards to on-call and people getting paged, I wanted people to have high ownership. I’d be the first one to respond to a page. I would always be writing up the reports, really falling over myself to show how I wanted people to be. But from their perspective, all they see is that the lead is just doing all these jobs. - James Cowling

He wanted the team to jump in quickly when something happened, so he personally went first: first to respond, first to write the reports, first to grab the bug. But the result was a little counterproductive. What others saw was not “I should jump in too,” but “this must be James’s job” or “maybe he is the only one who knows how to do this.” That is the subtle trap: ownership cannot be demonstrated by one person doing everything. The more pages you absorb yourself, the more likely the team is to treat incident response as one person’s job instead of everyone’s responsibility.

“What’s the point? Who cares? It’s a huge organization, and nothing I do will change it.”

James also talked about a bigger barrier: he had seen too many junior engineers become cynical over time at big companies and start believing that nothing they do matters. That sense of helplessness is exactly what turns oncall and incident response into something that feels unrelated to you.

His counterexample was his own experience:

Some of the happiest times in my life have been dedicating myself to a cause, trying really hard, and trying to do the right thing. - James Cowling

He added that this may sound old-fashioned, but it really works, especially when you intentionally move closer to people who genuinely want to get things right. If big-company politics have left you discouraged, look for teams like that instead of turning cynical.

For him, this mindset was not naive or a compromise. It was the essence of engineering: not building something complex or flashy just to look good on a promotion packet, but building the coolest thing that actually solves the problem.

One thing that helps sustain this mindset is turning complaints into action. James described how, at Dropbox early on, he poured himself into the work almost obsessively - sometimes 16 hours a day at first, although he does not recommend that. Over time, people knew he truly cared. That earned him enough trust to speak bluntly when needed. He said he was not afraid of losing his job; he was afraid of watching the wrong decision get made. Later, when he managed many staff+ engineers, he often heard complaints like “that area is inefficient” or “we are not doing the right thing.” He would not join the complaint. Instead, he used a sharp question to reframe the problem as prioritization:

Do you think we should solve this problem right now? - James Cowling

If the answer is yes, then tell me which team to pull from and which product to pause. I will reallocate the resources and we will fix it now. The point is simple: instead of staying in complaint mode, bring the conversation back to “should we solve this, and if so, how?” If it truly matters, own it and put in the time. If it does not, be honest that there is something more important right now and redirect attention there.

James’s conclusion was that people who move up are usually not the best complainers. They are the people willing to say, “I think this direction is wrong, I think there is a better way, and I am willing to own the work needed to get us there.”

Applied to oncall, incident response is no longer something forced on you. It becomes a choice to say, “I will fix this.” When I reframe it that way, the passive dread fades.

6. How we can rethink our own oncall mindset

Combining these two interviews with the structure-and-sleep perspective from the previous post, here are a few ways to reframe how we think about oncall:

Treat each incident as first-hand material: after the fact, ask yourself, “What system behavior did I learn that I did not understand before?” Write it into a personal runbook or notes. The real output of oncall is the quality of what you learn from incidents, not the number of tickets you close.
Split your attention: if the work is repetitive toil, ask whether it can be automated. If it is a weird new problem, allow yourself to slow down and dig in. That is where the value is.
Write multi-layer postmortems: do not stop at the code bug. Keep asking about test coverage, team process, and even the original assumption. When a pattern appears across incidents, propose turning it into a tool.
Reject heroism: being paged 100 times is not a medal. If I keep putting out the same fire, that is a signal to raise the structural issue in the retrospective, not a reason to keep gritting my teeth.
Replace waiting with ownership: when the pager goes off, stop waiting for “the right person” to show up. Remind yourself that it is us; there is no one else. Incident response is not a burden someone else handed you. It is a problem you choose to pick up and own.

Conclusion

These two engineers took very different career paths, but when they talked about oncall, they converged on something surprising: this is not a kind of work we should try to dodge at all costs. Marc Brooker stayed on call for 15 years because he treated incidents and postmortems as the fastest route to system knowledge. James Cowling’s version is simpler: “if there is no one else, it is us.”

The cost of oncall is real. Anxiety, interrupted sleep, and always being on standby are real burdens, and I spent a lot of the previous post on how to defend against them. But putting these interviews side by side makes one thing clear: rejecting oncall does not change anything by itself. What they suggest instead is that if we change our perspective, learn a little more about the system in every incident, and treat response as a chance to own the problem, we can build a healthier and more constructive relationship with the work.

References

Dropbox’s Former Most Senior Eng: Building Great Systems and Advice for the AI Era - James Cowling (The Peterman Pod, 2026-05-25)
AWS Distinguished Eng: Learnings From 3000 Incidents And How Engineering Is Changing - Marc Brooker (The Peterman Pod, 2026-04-13)
Further reading: The Science of On-Call Anxiety: Phantom Pages, Sleep Debt, and Alert Fatigue

06 Jun 2026

« Solving Stuck GKE Upgrades: The Hidden Admission Webhook

Does AI Have Consciousness? Why ChatGPT Fails So Confidently »

Eason Cao Follow Eason is an engineer working at FANNG and living in Europe. He was accredited as AWS Professional Solution Architect, AWS Professional DevOps Engineer and CNCF Certified Kubernetes Administrator. He started his Kubernetes journey in 2017 and enjoys solving real-world business problems.