Stitcher for Podcasts

Get the App Open App
Bummer! You're not a
Stitcher Premium subscriber yet.
Learn More
Start Free Trial
$4.99/Month after free trial
HELP

Show Info

Episode Info

Episode Info: Why real-time operations matters: Alex Solomon (CTO and Co-Founder of PagerDuty) kicks us off with a definition of real-time operations and why it matters. Alex: “Real-time operations to me, what that means is, it’s about dealing with problems and incidents and alerts in real-time. Making sure that the right people are pulled in whenever you have an issue with your production software, and only the right people. Those teams and individuals are looped in quickly, looped in via multiple channels to make sure they get there fast. Then once they are paged and looped in it’s about collaborations, it’s about communication, it’s about coordinations, it’s about defining clear roles for all the individuals and making sure they can collaborate and communicate effectively to make decisions quickly and resolve the underlying problems with those systems.” Matt hops in to discuss that real-time operations also encompasses how we learn about incidents and how we continue to learn from them. George talks about how real-time operations extends to every facet of online operations that might impact our team, whether it’s web services or code we write and how it operates in production, and how the definition of real-time operations is very broad. The Myth of Real-Time Operations Alex talks about the main myth he sees with real-time operations. Alex: “The myth that you can buy a software platform like a PagerDuty or a DataDog or a New Relic or any of these toolboxes that we all have when running digital systems, and that buying the platforms will solve all your problems and be a silver bullet. In my experience what I see over and over is that yes you can buy the platform but the hard part is changing culture and transforming culture and transforming the way people work, and that comes down to people and process.” Alex goes on to mention that it’s about the people supporting the services and full-service ownership. Matt talks about the myth that we can prevent failure. Matt: “The reality is we can do a lot to kind of steady ourselves and be ready to respond and take information we’ve already had, but our systems are so complex there’s no way to be fully predictive, and we need to understand how to make our systems - our socio-technical systems - more resilient rather than thinking if we just build in enough failover, enough automation, or write the best runbook ever, will be able to prevent failure.” The discussion moves towards how systems are designed for failure, and that we have ways to detect problems and rectify them quickly so we can detect and resolve problems quickly. Sharing What We Have Learned at PagerDuty The conversation moves to what we have each learned during our collective time at PagerDuty, whether it is the incident response process or postmortems. Scott talks about how his time at PagerDuty has been entirely remote and how to be successful as a remote worker by being vocal about your wins, taking time...
Read more »

Discover more stories like this.

Like Stitcher On Facebook

EMBED

Episode Options

Listen Whenever

Similar Episodes

Related Episodes