Replies: 4 comments 3 replies
-
👏🏻this 👏🏻 is 👏🏻 YES 👏🏻 -- this is much yes. honestly this is i think, the highest ratio of importance to lack of coverage i've seen in topic ideas to date. this is so critical and i don't think i've ever seen a thorough article about this beyond product-type content that's like 'you can get alerts!'. what do you do with those alerts?? that's the hard part. anyway, i love this. |
Beta Was this translation helpful? Give feedback.
-
to kick things off - i can outline the baseline that i see implemented usually:
that's it. what i would like to see in the wild is not just the data team getting alerts, but folks using metadata panels in BI tools so that stakeholders can see the problem and contact the right owner. dbt exposures allow for this but i think are under-utilized. i'd also love more pager duty style ownership rotations, SWEs now that there will be incidents, and those incidents don't care about day or time -- but most data teams still seem to operate as if they expect nothing to ever break on a Saturday. |
Beta Was this translation helpful? Give feedback.
-
I really like this topic and I feel like in my last team we had a really nice set up. When I first joined, we had errors on daily basis and it took many of our mornings. One year after the errors were drastically reduced and resolving them was also taking less time from us in average. On top of the slack alerts that @gwenwindflower mentioned, which are a must do, we also had some other stuff set up:
|
Beta Was this translation helpful? Give feedback.
-
Gloria Lin shared this great article describing what engineering teams can learn from the handling of the April 2022 Atlassian outage: |
Beta Was this translation helpful? Give feedback.
-
What topic would you like to see discussed more in-depth?
As an analytics engineer, I want to know successful patterns for responding to data incidents
Why do you think this is important?
Due to the nature of the way data moves and flows, we can expect unexpected breakages within any analytics pipeline.
In an ideal world, there's lots of people using the data that we produce as analytics engineers, which increases the surface area of the impact during incidents.
Software engineers know there plan and procedures for dealing with their incidents -- we should know our plans too!
Optional:
Are there any existing resources that you’ve seen explore this topic?
I haven't seen any write-ups focused on analytics engineering on the topic of incident management. (I also don't have any go-to resources for incident management for software engineers.)
What do you think they’re missing?
N/A
Beta Was this translation helpful? Give feedback.
All reactions