Reduce Urgent Crises
In 2015, I quit my job to start a company with some buddies of mine.
I did 14 hour days. I built real-time data pipelines with Kafka. I’m still impressed with what we built.
But after 9 months, we ran out of money.
I was so tired. I couldn’t work 14 hour days for the rest of my life. I had to figure out a way to work less.
Luckily, right before I got another job - I found a book called “The Effective Engineer”.
I learned there's more to being an effective engineer than writing code. I learned that engineering should be a 40 hour/week job. But most importantly, I learned about about “Engineering Leverage”.
“Engineering Leverage” is a lens to help identify what activities add more value with smaller time investments.
And the activities with the best leverage are in Quadrant 2.
Quadrant 2 activities don’t have any natural deadlines and won’t get prioritized as a result of urgency. But in the long run, they help us grow both personally and professionally.
I maximize my Quadrant 2 time by minimizing the time I spend in Quadrant 1 and 3.
Minimizing Quadrant 1 Activities: Reducing Urgent Crises
On my 1st day at Jet.com, my coworker gave his 2 week's notice.
My coworker owned a system that generated Jet.com’s catalog for advertising. It sent the catalog to Google, Facebook, and other partners. It was responsible for about half of Jet.com’s revenue ($150m/year).
On week 3, I inherited their system.
And it paged me 4 of 5 days a week.
I regularly lost the entire morning to debugging issues with it.
That’s 16 hours of crises a week. I spent 40% of my week firefighting. I wasn't adding new value - only maintaining what had already been promised.
Addressing The Underlying Cause
I was spending too much time addressing the symptoms.
The system’s caching job joined together data from catalog, inventory, and price streams. It exited successfully only if there were a certain number of items in the catalog. The problem was we didn’t know how many items should be in the catalog.
And the number of items changed every day.
For example, if the catalog had at least 26 million items, the job exited successfully. If it was 25.9M, the job ended in failure.
The underlying issue was that the job’s exit criteria was probabilistic; not deterministic.
By defining what a valid item was by attributes on the streams instead of the number of items, we were able to make the job exit deterministically.
It took 6 weeks to implement the change to the caching job. We sped up the job by 600%. By addressing the underlying cause, we reduced the frequency of pages by 88%.
I reduced the time spent in Quadrant 1 from 16 hours to 2 hours a week.
I got back 14 hours a week to learn new skills.
I got back 14 hours a week to work proactively; not reactively.
I got back 14 hours a week to build relationships with my coworkers.
How Reducing Urgent Crises Helps Our Teams
Reducing urgent crises creates more Quadrant 2 time - but it also makes our entire team less error prone and more inclusive.
Working too many hours leads to decreased productivity and burnout. Output may even turn negative to repair the mistakes made by fatigued engineers.
Being on-call is a big ask. It’s usually a 7 day, 24 hour shift.
Even for someone with limited obligations, it means carrying your laptop and hotspot everywhere. Now imagine what'd it'd be like for people with young children. What about someone taking care of an elderly person?
The burden of a noisy on-call will force many to choose between their personal life and their job.
And a team full of only one kind of human is a team full of blind spots.