Created with Sketch.
74 minutes | Jul 9, 2021
Spoons (Daniel Spoonhower) - On building Lightstep, being customer focused, developing systems at Google scale and much more - #12
Spoons is the Co-founder and Chief Architect of Lightstep. He joins the show to talk about building systems at Google scale and various aspects that make Google a weird place than other companies. We talked about Spoons's journey of leaving Google and deciding to join Lightstep as a co-founder. We dig into the challenges during the early days of Lightstep and discuss the importance of speaking to customers to build the right product. We talk about what it's like to start a family and run a startup and how one can be intentional about building a company’s culture. As always, we go through some of the misadventures and one of them involves a cable being cut under the English channel.
73 minutes | Jun 11, 2021
Emmanuel Ameisen - On production ML at Stripe scale, leading 100+ ML projects, iterating fast, and much more - #11
Having led 100+ ML projects at Insight and built ML systems at Stripe scale, Emmanuel joins the show to chat about how to build useful ML products and what happens next when the model is in production. Throughout the conversation, Manu shares stories and advice on topics like the common mistakes people make when starting a new ML project, what’s similar and different about the lifecycle of ML systems compared to traditional software, and writing a technical book.
68 minutes | May 7, 2021
Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10
Todd is a Sr Director of Engineering at Google where he leads Site Reliability Engineering teams for Machine Learning. Having recently presented on how ML breaks in production, by examining more than a decade of outage postmortems at Google, Todd joins the show to chat about why many ways that ML systems break in production have nothing to do with ML, what’s different about engineering reliable systems for ML, vs traditional software (and the many ways that they are similar), what he looks for when hiring ML SREs, and more.
73 minutes | Apr 23, 2021
Evan Estola - On recommendation systems going bad, hiring ML engineers, giving constructive feedback, filter bubbles and much more - #9
Evan Estola (https://twitter.com/estola) is a Director of Engineering at Flatiron Health where he's leading software engineering teams focused on building Machine Learning products. Throughout this episode, Evan shares various stories when recommendation systems didn’t work as expected, like this one time when members saw mathematically worst recommendations for meetups near them. He also shares why Schenectady, NY pops up on some lists of most popular cities and the story behind the Wall Street Journal article titled 'Orbitz steers Mac users to pricier hotels'. We also discuss skills Evan looks for when hiring ML engineers, how to give constructive feedback, filter bubbles and much more.
62 minutes | Apr 9, 2021
Uma Chingunde - On managing migrations, growing engineering teams and much more - #8
Uma is a VP of Engineering at Render. In this episode, she shared with us her insights on how to successfully manage infrastructure migrations. We discussed the importance of communicating the "why" behind a migration, identifying success metrics, creating a culture where migrations are identified as highly impactful projects and much more. Uma also shared stories where parts of a migration didn’t go as planned, how the team fixed the issue and the kind of engineers she thinks would make good tech leads for these projects. We had a great time speaking with Uma! Our major focus in this episode was large scale infrastructure migrations and Uma shared many insights on how to manage them successfully. We discussed the importance of communicating the “why” behind a migration, identifying success metrics, creating a culture where migrations are identified as highly impactful projects and much more. Uma also shared stories where parts of a migration didn’t go as planned, how the team fixed the issue and the kind of engineers she thinks would make good tech leads for these projects. There’s a lot to learn from Uma’s experience. Please enjoy this highly educational conversation with Uma Chingunde!
67 minutes | Mar 20, 2021
Charity Majors - On database outages, journey as a co-founder, thriving under pressure and growing as an engineer - #7
Charity Majors (https://twitter.com/mipsytipsy) is the co-founder and CTO of Honeycomb.io. Before this she worked at Facebook, Parse and Linden Lab on infrastructure and developer tools, and always seemed to wind up running the databases. She is the co-author of Database Reliability Engineering book and also has an amazing blog at charity.wtf. We love the content in her blogs and have learned a lot from them. We had a lot of fun speaking with Charity in this lively conversation! We learned about her journey from being an engineer to co-founding Honeycomb, what it was like being on-call when she was only 17, and staying calm during production incidents. We talked about various production outages throughout the episode and our favorite involved driving to a datacenter to flip a DB switch. Charity also shares what it takes to build an awesome engineering culture, the engineer/manager pendulum, and qualities Charity looks for when hiring senior engineers.
64 minutes | Mar 7, 2021
Tammy Bryant Butow - On failure injection, chaos engineering, extreme sports and being curious - #6
Tammy Bryant Butow is a Principal SRE at Gremlin where she works on Chaos Engineering. In this episode, we discuss how her curiosity led her to the world of infrastructure engineering, an outage from her early days where a core switch took down half the datacenter, her experience running a disaster recovery test and how it taught her about the importance of injecting failures into a system to make it more resilient. We also touch on advanced failure injection techniques, how chaos engineering is evolving and how extreme sports help Tammy keep calm under pressure. Lastly, Tammy has some great advice for teams looking to get started with chaos engineering.
61 minutes | Feb 19, 2021
Oliver Leaver-Smith - On how "just a monitoring change" took down the entire site and resilience engineering - #5
Oliver Leaver-Smith, better known as Ols, is a Senior Devops Engineer at Sky Betting and Gaming. In this episode, we discuss how a seemingly simple monitoring change ended up taking down the entire site. We also talk about chaos and resilience engineering. We discuss how the team at Sky Betting and Gaming conducts fire drills (chaos engineering exercises) where they not only test the resiliency of their software systems but also their people systems. We walk through a recent example of a fire drill, how they have evolved over the past few years and the lessons learned in the process.
63 minutes | Feb 6, 2021
Ryan Underwood - On debugging the Linux kernel - #4
Ryan Underwood is a Staff SRE and tech lead on the Helix and Zookeeper SRE team at LinkedIn. Prior to LinkedIn, he was an SRE at Machine Zone and Google. Apart from his regular responsibilities, Ryan’s interest and expertise include debugging production kernel, I/O and containerization issues. His opinion about not treating software as a black box and his persistent approach to debugging complex problems are truly inspiring. On several occasions, Ryan’s colleagues have leaned on him to solve an esoteric problem that everyone thought was insurmountable. Our main focus today is one such problem that Ryan and team ran into while upgrading machines to 4.x kernel that resulted in elevated 99th percentile latencies. We dive into what the problem was, how it was identified and how it was fixed. We discuss some of the tools and practices that are helpful in debugging system performance issues. And we also talk about Ryan’s background and how his curiosity landed him a career in Site Reliability Engineering. Please enjoy this deeply technical and highly educational conversation with Ryan Underwood. Website link: https://softwaremisadventures.com/ryan Music Credits: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en
58 minutes | Jan 23, 2021
David Henke - On building a culture of "Site Up" at LinkedIn and Yahoo! - #3
David is LinkedIn’s former SVP of Engineering and Operations. He came out of retirement to join LinkedIn in 2009 during a time of rapid growth. After 4 years at LinkedIn, he retired in 2013. Throughout his career, David has been in multiple leadership positions and has been recognized as one of the best Operations Executives. This was an extremely fascinating conversation. David shares insightful stories from early days at LinkedIn and what it took to develop the culture of “Site Up and Secure”. He shares one of the most severe outages he has experienced in his career - this one was at Yahoo!, which he calls the 10g massacre. We talk about David’s 3 retirements throughout his career, his advice on developing operational excellence and lessons on being an effective leader. Throughout this conversation you’ll also hear various nuggets of wisdom from David, better known as Henkeisms. Please enjoy this highly entertaining and deeply insightful conversation with David Henke. Website link: https://softwaremisadventures.com/henke Music Credits: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en
46 minutes | Jan 6, 2021
Julia Evans - On kubernetes scheduler bugs, TCP performance regressions and debugging tips - #2
In this episode, we speak with Julia Evans. Julia runs a programming zines business, called Wizard Zines (https://wizardzines.com/), where she creates comics about various programming concepts. She has been creating zines, when she was still a software engineer at Stripe. Her zines are extremely approachable and highly educational. In addition to creating zines, Julia is a prolific blogger and has around 500 posts on her blog at jvns.ca. Her blogs are another great source to learn about fundamental programming concepts. We had a lot of fun speaking with Julia for this episode. We discuss two bugs she came across at Stripe. We talk about how she identified and fixed a bug in Kubernetes Scheduler and how her understanding of TCP helped her fix a performance regression. We also cover other topics like blogging, zines, debugging and learning new things. Please enjoy this fun conversation with the amazing Julia Evans! Website link: https://softwaremisadventures.com/julia Links: https://jvns.ca https://wizardzines.com/ https://twitter.com/b0rk https://github.com/jvns Music Credits: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en
61 minutes | Dec 4, 2020
Kelsey Hightower - On ways kubernetes can break, being an effective leader and much more (#1)
In this episode, we speak with Kelsey Hightower who is currently a Principal Developer Advocate at Google and one of the most influential individuals in the Kubernetes community. He is also an author and a keynote speaker, with a knack for demystifying complex topics, doing live demos and enabling others to succeed. In this insightful conversation, we cover wide ranging topics from his role at Google to the art of storytelling. We get into some very interesting details of how Kubernetes can break in production and practices that work for Kelsey in being an effective leader. Links: https://twitter.com/kelseyhightower https://github.com/kelseyhightower Music Credits: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en
4 minutes | Nov 28, 2020
Introducing Software Misadventures Podcast - #0
In this episode, Ronak, Austin and Guang share the origin story - who they are, what this podcast is about and why they are doing this. They've seen first hand how stressful it is when something breaks in production but also found it to be the best opportunity to learn about a system more deeply. They started this podcast to have in-depth conversations with software and devops experts and hear their stories from the trenches about how software breaks in production. In upcoming conversations, they discuss the principles and practical tips to build resilient software as well as advice to grow as technical leaders. Learn more at https://softwaremisadventures.com. Music Credits: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en
Terms of Service
Do Not Sell My Personal Information
© Stitcher 2021