These are the pitfalls we've encountered when implementing You Build It You Run It. Each pitfall drains confidence in the operating model, and ultimately puts it in jeopardy.
It's particularly important to guard against these pitfalls if your organisation previously migrated away from Ops Run It. A predisposition to central operations teams may still exist in your organisation. If one or more pitfalls have a negative impact, your senior leadership may see a return to Ops Run It as a quick, cost efficient way to safeguard the reliability of digital services.
These pitfalls aren't in any particular order.
This looks like:
- Run cost directly linked to product team count. Your product teams each have a product team member on-call out of hours, and each one impacts on-call funding.
This pitfall happens when you have an on-call standby cost model of $R(n), where R is the remuneration rate per person and n is the number of teams on-call. As you increase your number of product teams, your on-call standby costs increase in kind. 1 team costs $R(1), 10 teams costs $R(10), and 20 teams costs $R(20).
A linear run cost can't happen with Ops Run It. One app support analyst on-call for all your foundational systems produces a cost model is $R(1). In addition, the remuneration rate per person R can be lower, if your app support team is outsourced.
On-call standby costs are a key operating model cost, and a linear run cost creates a perception that You Build It You Run It is too expensive. If you also suffer from the responsible but unaccountable pitfall, on-call funding is a line item in a strained opex budget and it shows linear growth.
You Build It You Run It has a risk of high run costs, but it's not inevitable. Furthermore, run cost alone is a flawed comparison of You Build It You Run It and Ops Run It. Operating models are different insurance policies for different business outcomes, and their costs are multi-faceted. The high opportunity costs of Ops Run It are unacceptable for digital services.
To avoid this pitfall:
- Select availability targets on financial exposure, so each product team balances availability with engineering effort
- Select out of hours schedules on financial exposure, so product teams at scale do not produce linear run costs
This looks like:
- Inflated availability targets. Your product managers are tempted to select the maximum availability target and highest on-call level.
- Low priority operational features. Your product managers have little reason to prioritise operational features alongside product features.
- Weak operability incentives. Your product teams have low motivation to constantly build operability into digital services.
- On-call funding pressure. Your product teams come under pressure to cut corners on on-call spend, whenever opex funding is scrutinised.
This pitfall happens when you’ve adopted You Build It You Run It and your Head of Operations is still accountable for the reliability of digital services. Your product teams may feel some responsibility for their digital services, but they are not held accountable for production incidents. In addition, their on-call funding comes from an opex budget owned by the Head of Operations.
Keeping accountability away from product teams dilutes operability incentives all round. Product managers select whatever availability target they want, because someone else is paying their on-call costs. Product teams cut corners on designing for adaptive capacity, because they won’t be questioned about inoperability later on.
To avoid this pitfall:
- Make product team budget holders accountable for business outcomes, so product teams have sole accountability for their digital services
- Fund on-call costs from product team budgets, so product teams have to fund on-call themselves
The impact of this pitfall can be lessened if your Head of Operations is supportive of You Build It You Run It, and seeks to empower on-call product teams at every opportunity.
This looks like:
- Slow delivery of planned work. Your product teams don't deliver in a timely fashion the product and operational features prioritised by product managers.
- High number of callouts. Your product teams spend most of their time fighting operational problems e.g. intermittent alerts, deployment failures, infrastructure errors.
Unplanned operational work doesn't count towards planned product or operational features. We've heard this expressed as "BAU work". Day to day operational tasks simply need to be completed by a product team, and they constitute rework. If the amount of BAU work is excessive, a product team cannot complete planned work when expected by the product manager.
This pitfall happens when digital services cannot gracefully degrade on failure, when production deployments aren't consistently applied in all environments, and when the telemetry toolchain isn't fully automated. Excessive BAU can be hard to detect, because product teams don't always track operational rework in a ticketing system like Jira. Product team members might fix intermittent alerts, deployment failures, infrastructure errors etc. without a ticket. It can be spotted by measuring the percentage of time product team members spend on planned work, each week.
To avoid this pitfall:
- Treat unplanned operational rework like planned work, so operational rework can be tracked and managed like any planned work
- Re-architect digital services for adaptability, so graceful degradation on failure is possible
- Create a fully automated deployment pipeline, so failed deployments are minimised and can be quickly reverted
- Establish an automated telemetry toolchain, so dashboards and alerts are reliable and can be updated at any time
We don't see this pitfall in practice as often as it's feared. We believe the fear of it comes from the historic levels of operational rework incurred for on-premise software services. For cloud-based digital services, a lot of operational changes are automatically handled by the cloud provider.
This looks like:
- Insufficient deployment frequency. Your product teams can't achieve weekly deployments or more.
- Prolonged change management process. Your change management team insists each deployment completes a time-consuming change management process.
We refer to slow, prolonged change management processes as treacle, and they have a significant impact on deployment lead times. It naturally encourages fewer, larger deployments, which makes it harder to understand the changeset and diagnose any production problems.
This pitfall happens when your change management process is entirely reactive, a single change category is applied to all production deployments, and each deployment requires a change approval. It creates a fraught relationship between your change management team and your product teams.
At the same time, it's important that product teams comply with internal requirements on change management, particularly if your organisation follows IT standards such as ITIL v3. A discussion with a change management team to streamline change approvals needs to be accompanied by a commitment to preserve change auditing.
To avoid this pitfall:
- Pre-approve low risk, repeatable changes to accelerate a majority of deployments.
- Automate change auditing for compliance with change management processes.
This looks like:
- Lack of incident response collaboration. Your product teams don't know how to involve other product teams and/or operational enabler teams in major incidents
- Major incident ignorance. Your incident management team don't know when digital services are experiencing major incidents
- No crisis communications. Your senior leadership don't know when major incidents are creating significant financial losses/
A lack of incident management creates inconsistent behaviours and communication pathways during major incidents. It has a negative impact on resolution times and financial losses incurred, when an incident requires more than one product team to be resolved.
This pitfall happens when your incident management team is excluded from the incident response process for digital services. It means different product teams will have distinct behaviours and communication methods during production incidents. It creates an impression that product teams aren't rigorous during incident response.
Crisis communications are particularly important during a major incident, and product teams won't know who to contact or how often to contact them with incident updates. A lack of clear, timely information to senior leadership during a major incident is an easy way to create doubts about an entire operating model.
To avoid this pitfall:
- Integrate into incident management as is, to ensure incident managers can be incident commanders for digital services as well as foundational systems.
This looks like:
- Embedded specialists shared between product teams. Your organisational model calls for N embedded specialists, each dedicated to one of your N product teams, but you've got less than N specialists and they're assigned to multiple teams.
- Unpredictable workload for embedded specialists. Your embedded specialists are either bored from a lack of work, or burned out from too much work across multiple teams.
- Loneliness for embedded specialists. Your embedded specialists don't have opportunities to work together, learn from one another, or even talk to one another.
This applies to any technology specialty tied to operational work - DBAs, InfoSec analysts, network admins, operability engineers, and more. Balancing breadth of cross-functional product teams and depth of specialist expertise is hard.
This pitfall happens as a countermeasure, after a small, central team of specialists can't handle demand from your growing number of product teams. There's little desire for developers to debug Postgres in production or learn Terraform on the job, so the answer is to embed a specialist in each product team. However, there's a scarcity of affordable specialists in the marketplace, and the need for expertise in product teams fluctuates. The result is a large expertise bottleneck is split into multiple expertise bottlenecks. Instead of spending hours or days waiting on a central specialist team, a product team waits for hours or days for its own embedded specialist to be available.
To avoid this pitfall:
- Establish specialists as a service, so repeatable, common tasks are automated as self-service functions, and specialists in a central team are freed up to offer ad hoc expertise on demand.
This looks like:
- Product teams have a minority of team members in out of hours on-call schedules
- Product team members participating in on-call have significant disruption to their personal lives, and are on the verge of burnout
Each product team with a limited on-call schedule has digital services at risk of lengthy incident resolution times out of hours. If product team members need time off work to cope with burnout, team morale will suffer and planned product features will take longer to complete than expected.
This pitfall happens when product team members feel unprepared for on-call support, are unhappy with their remuneration, or burn out from too much time on-call over a period of time. It's important to respect the circumstances and decisions of different product team members.
To avoid this pitfall:
- Prepare for on-call from day one, so product team members are well equipped to handle out of hours production alerts.
- Re-architect digital services for adaptability, so digital services don't require substantial human intervention, and are fast to fix on failure
- Ensure fair remuneration for on-call developers, so product team members feel compensated for the disruption to their personal lives.
- Craft a sustainable on-call schedule, so no one product team member spends too much time on-call out of hours.
This pitfall affects operations teams as well. Your app support team may have some useful organisational context and experiences to contribute.
Lost accountability in retail |
---|
I worked on a team at a retail customer, building a shifts app for staff mobile devices in bricks and mortar stores. We built a cloud-based platform in Azure and the mobile app with You Build It You Run It, and the benefits were clear to our customer sponsor. We achieved a time to repair of less than 10 minutes, and we deployed on average twice a day for over six months. |