stitcherLogoCreated with Sketch.
Get Premium Download App
Listen
Discover
Premium
Shows
Likes
Merch

Listen Now

Discover Premium Shows Likes

Data Engineering Podcast

378 Episodes

54 minutes | Jun 4, 2023
Reduce The Overhead In Your Pipelines With Agile Data Engine's DataOps Service
Summary A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Tevje Olin about Agile Data Engine, a platform that combines data modeling, transformations, continuous delivery and workload orchestration to help you manage your data products and the whole lifecycle of your warehouse Interview Introduction How did you get involved in the area of data management? Can you describe what Agile Data Engine is and the story behind it? What are some of the tools and architectures that an organization might be able to replace with Agile Data Engine? How does the unified experience of Agile Data Engine change the way that teams think about the lifecycle of their data? What are some of the types of experiments that are enabled by reduced operational overhead? What does CI/CD look like for a data warehouse? How is it different from CI/CD for software applications? Can you describe how Agile Data Engine is architected? How have the design and goals of the system changed since you first started working on it? What are the components that you needed to develop in-house to enable your platform goals? What are the changes in the broader data ecosystem that have had the most influence on your product goals and customer adoption? Can you describe the workflow for a team that is using Agile Data Engine to power their business analytics? What are some of the insights that you generate to help your customers understand how to improve their processes or identify new opportunities? In your "about" page it mentions the unique approaches that you take for warehouse automation. How do your practices differ from the rest of the industry? How have changes in the adoption/implementation of ML and AI impacted the ways that your customers exercise your platform? What are the most interesting, innovative, or unexpected ways that you have seen the Agile Data Engine platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Agile Data Engine? When is Agile Data Engine the wrong choice? What do you have planned for the future of Agile Data Engine? Guest Contact Info LinkedIn (https://www.linkedin.com/in/tevjeolin/?originalSubdomain=fi) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? About Agile Data Engine Agile Data Engine unlocks the potential of your data to drive business value - in a rapidly changing world. Agile Data Engine is a DataOps Management platform for designing, deploying, operating and managing data products, and managing the whole lifecycle of a data warehouse. It combines data modeling, transformations, continuous delivery and workload orchestration into the same platform. Links Agile Data Engine (https://www.agiledataengine.com/agile-data-engine-x-data-engineering-podcast) Bill Inmon (https://en.wikipedia.org/wiki/Bill_Inmon) Ralph Kimball (https://en.wikipedia.org/wiki/Ralph_Kimball) Snowflake (https://www.snowflake.com/en/) Redshift (https://aws.amazon.com/redshift/) BigQuery (https://cloud.google.com/bigquery) Azure Synapse (https://azure.microsoft.com/en-us/products/synapse-analytics/) Airflow (https://airflow.apache.org/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
43 minutes | May 29, 2023
A Roadmap To Bootstrapping The Data Team At Your Startup
Summary Building a data team is hard in any circumstance, but at a startup it can be even more challenging. The requirements are fluid, you probably don't have a lot of existing data talent to manage the hiring and onboarding, and there is a need to move fast. Ghalib Suleiman has been on both sides of this equation and joins the show to share his hard-won wisdom about how to start and grow a data team in the early days of company growth. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Ghalib Suleiman about challenges and strategies for building data teams in a startup Interview Introduction How did you get involved in the area of data management? Can you start by sharing your conception of the responsibilities of a data team? What are some of the common fallacies that organizations fall prey to in their first efforts at building data capabilities? Have you found it more practical to hire outside talent to build out the first data systems, or grow that talent internally? What are some of the resources you have found most helpful in training/educating the early creators and consumers of data assets? When there is no internal data talent to assist with hiring, what are some of the problems that manifest in the hiring process? What are the concepts that the new hire needs to know? How much does the hiring manager/interviewer need to know about those concepts to evaluate skill? What are the most critical skills for a first hire to have to start generating valuable output? As a solo data person, what are the uphill battles that they need to be prepared for in the organization? What are the rabbit holes that they should beware of? What are some of the tactical What are the most interesting, innovative, or unexpected ways that you have seen initial data hires tackle startup challenges? What are the most interesting, unexpected, or challenging lessons that you have learned while working on starting and growing data teams? When is it more practical to outsource the data work? Contact Info LinkedIn (https://www.linkedin.com/in/ghalibs/) @ghalib (https://twitter.com/ghalib) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Polytomic (https://www.polytomic.com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
56 minutes | May 21, 2023
Keep Your Data Lake Fresh With Real Time Streams Using Estuary
Summary Batch vs. streaming is a long running debate in the world of data integration and transformation. Proponents of the streaming paradigm argue that stream processing engines can easily handle batched workloads, but the reverse isn't true. The batch world has been the default for years because of the complexities of running a reliable streaming system at scale. In order to remove that barrier, the team at Estuary have built the Gazette and Flow systems from the ground up to resolve the pain points of other streaming engines, while providing an intuitive interface for data and application engineers to build their streaming workflows. In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing David Yaffe and Johnny Graettinger about using streaming data to build a real-time data lake and how Estuary gives you a single path to integrating and transforming your various sources Interview Introduction How did you get involved in the area of data management? Can you describe what Estuary is and the story behind it? Stream processing technologies have been around for around a decade. How would you characterize the current state of the ecosystem? What was missing in the ecosystem of streaming engines that motivated you to create a new one from scratch? With the growth in tools that are focused on batch-oriented data integration and transformation, what are the reasons that an organization should still invest in streaming? What is the comparative level of difficulty and support for these disparate paradigms? What is the impact of continuous data flows on dags/orchestration of transforms? What role do modern table formats have on the viability of real-time data lakes? Can you describe the architecture of your Flow platform? What are the core capabilities that you are optimizing for in its design? What is involved in getting Flow/Estuary deployed and integrated with an organization's data systems? What does the workflow look like for a team using Estuary? How does it impact the overall system architecture for a data platform as compared to other prevalent paradigms? How do you manage the translation of poll vs. push availability and best practices for API and other non-CDC sources? What are the most interesting, innovative, or unexpected ways that you have seen Estuary used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Estuary? When is Estuary the wrong choice? What do you have planned for the future of Estuary? Contact Info Dave Y (mailto:dave@estuary.dev) Johnny G (mailto:johnny@estuary.dev) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Estuary (https://estuary.dev) Try Flow Free (https://dashboard.estuary.dev/register) Gazette (https://gazette.dev) Samza (https://samza.apache.org/) Flink (https://flink.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57/) Storm (https://storm.apache.org/) Kafka Topic Partitioning (https://www.openlogic.com/blog/kafka-partitions) Trino (https://trino.io/) Avro (https://avro.apache.org/) Parquet (https://parquet.apache.org/) Fivetran (https://www.fivetran.com/) Podcast Episode (https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93/) Airbyte (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Snowflake (https://www.snowflake.com/en/) BigQuery (https://cloud.google.com/bigquery) Vector Database (https://learn.microsoft.com/en-us/semantic-kernel/concepts-ai/vectordb) CDC == Change Data Capture (https://en.wikipedia.org/wiki/Change_data_capture) Debezium (https://debezium.io/) Podcast Episode (https://www.dataengineeringpodcast.com/debezium-change-data-capture-episode-114/) MapReduce (https://en.wikipedia.org/wiki/MapReduce) Netflix DBLog (https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b) JSON-Schema (http://json-schema.org/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
27 minutes | May 15, 2023
What Happens When The Abstractions Leak On Your Data
Summary All of the advancements in our technology is based around the principles of abstraction. These are valuable until they break down, which is an inevitable occurrence. In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm sharing some thoughts and observances about abstractions and impedance mismatches from my experience building a data lakehouse with an ELT workflow Interview Introduction impact of community tech debt hive metastore new work being done but not widely adopted tensions between automation and correctness data type mapping integer types complex types naming things (keys/column names from APIs to databases) disaggregated databases - pros and cons flexibility and cost control not as much tooling invested vs. Snowflake/BigQuery/Redshift data modeling dimensional modeling vs. answering today's questions What are the most interesting, unexpected, or challenging lessons that you have learned while working on your data platform? When is ELT the wrong choice? What do you have planned for the future of your data platform? Contact Info LinkedIn (https://www.linkedin.com/in/tmacey/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links dbt (https://www.getdbt.com/) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Dagster (https://dagster.io/) Podcast Episode (https://www.dataengineeringpodcast.com/dagster-software-defined-assets-data-orchestration-episode-309/) Trino (https://trino.io/) Podcast Episode (https://www.dataengineeringpodcast.com/presto-distributed-sql-episode-149/) ELT (https://en.wikipedia.org/wiki/Extract,_load,_transform) Data Lakehouse (https://www.forbes.com/sites/bernardmarr/2022/01/18/what-is-a-data-lakehouse-a-super-simple-explanation-for-anyone/?sh=5c0e333f6088) Snowflake (https://www.snowflake.com/en/) BigQuery (https://cloud.google.com/bigquery) Redshift (https://aws.amazon.com/redshift/) Technical Debt (https://en.wikipedia.org/wiki/Technical_debt) Hive Metastore (https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration) AWS Glue (https://aws.amazon.com/glue/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
55 minutes | May 7, 2023
Use Consistent And Up To Date Customer Profiles To Power Your Business With Segment Unify
Summary Every business has customers, and a critical element of success is understanding who they are and how they are using the companies products or services. The challenge is that most companies have a multitude of systems that contain fragments of the customer's interactions and stitching that together is complex and time consuming. Segment created the Unify product to reduce the burden of building a comprehensive view of customers and synchronizing it to all of the systems that need it. In this episode Kevin Niparko and Hanhan Wang share the details of how it is implemented and how you can use it to build and maintain rich customer profiles. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Kevin Niparko and Hanhan Wang about Segment's new Unify product for building and syncing comprehensive customer profiles across your data systems Interview Introduction How did you get involved in the area of data management? Can you describe what Segment Unify is and the story behind it? What are the net-new capabilities that it brings to the Segment product suite? What are some of the categories of attributes that need to be managed in a prototypical customer profile? What are the different use cases that are enabled/simplified by the availability of a comprehensive customer profile? What is the potential impact of more detailed customer profiles on LTV? How do you manage permissions/auditability of updating or amending profile data? Can you describe how the Unify product is implemented? What are the technical challenges that you had to address while developing/launching this product? What is the workflow for a team who is adopting the Unify product? What are the other Segment products that need to be in use to take advantage of Unify? What are some of the most complex edge cases to address in identity resolution? How does reverse ETL factor into the enrichment process for profile data? What are some of the issues that you have to account for in synchronizing profiles across platforms/products? How do you mititgate the impact of "regression to the mean" for systems that don't support all of the attributes that you want to maintain in a profile record? What are some of the data modeling considerations that you have had to account for to support e.g. historical changes (e.g. slowly changing dimensions)? What are the most interesting, innovative, or unexpected ways that you have seen Segment Unify used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Segment Unify? When is Segment Unify the wrong choice? What do you have planned for the future of Segment Unify? Contact Info Kevin LinkedIn (https://www.linkedin.com/in/kevin-niparko-5ab86b54/) Blog (https://n2parko.com/) Hanhan LinkedIn (https://www.linkedin.com/in/hansquared/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Segment Unify (https://segment.com/product/unify/) Segment (https://segment.com/) Podcast Episode (https://www.dataengineeringpodcast.com/segment-customer-analytics-episode-72/) Customer Data Platform (CDP) (https://blog.hubspot.com/service/customer-data-platform-guide) Golden Profile (https://www.uniserv.com/en/business-cases/customer-data-management/golden-record-golden-profile/) Reverse ETL (https://medium.com/memory-leak/reverse-etl-a-primer-4e6694dcc7fb) MarTech Landscape (https://chiefmartec.com/2023/05/2023-marketing-technology-landscape-supergraphic-11038-solutions-searchable-on-martechmap-com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
45 minutes | Apr 24, 2023
Realtime Data Applications Made Easier With Meroxa
Summary Real-time capabilities have quickly become an expectation for consumers. The complexity of providing those capabilities is still high, however, making it more difficult for small teams to compete. Meroxa was created to enable teams of all sizes to deliver real-time data applications. In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing DeVaris Brown about the impact of real-time data on business opportunities and risk profiles Interview Introduction How did you get involved in the area of data management? Can you describe what Meroxa is and the story behind it? How have the focus and goals of the platform and company evolved over the past 2 years? Who are the target customers for Meroxa? What problems are they trying to solve when they come to your platform? Applications powered by real-time data were the exclusive domain of large and/or sophisticated tech companies for several years due to the inherent complexities involved. What are the shifts that have made them more accessible to a wider variety of teams? What are some of the remaining blockers for teams who want to start using real-time data? With the democratization of real-time data, what are the new categories of products and applications that are being unlocked? How are organizations thinking about the potential value that those types of apps/services can provide? With data flowing constantly, there are new challenges around oversight and accuracy. How does real-time data change the risk profile for applications that are consuming it? What are some of the technical controls that are available for organizations that are risk-averse? What skills do developers need to be able to effectively design, develop, and deploy real-time data applications? How does this differ when talking about internal vs. consumer/end-user facing applications? What are the most interesting, innovative, or unexpected ways that you have seen Meroxa used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Meroxa? When is Meroxa the wrong choice? What do you have planned for the future of Meroxa? Contact Info LinkedIn (https://www.linkedin.com/in/devarispbrown/) @devarispbrown (https://twitter.com/devarispbrown) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Meroxa (https://meroxa.com/) Podcast Episode (https://www.dataengineeringpodcast.com/meroxa-data-integration-episode-153/) Kafka (https://kafka.apache.org/) Kafka Connect (https://docs.confluent.io/platform/current/connect/index.html) Conduit (https://github.com/ConduitIO/conduit) - golang Kafka connect replacement Pulsar (https://pulsar.apache.org/) Redpanda (https://redpanda.com/) Flink (https://flink.apache.org/) Beam (https://beam.apache.org/) Clickhouse (https://clickhouse.tech/) Druid (https://druid.apache.org/) Pinot (https://pinot.apache.org/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
49 minutes | Apr 16, 2023
Building Self Serve Business Intelligence With AI And Semantic Modeling At Zenlytic
Summary Business intellingence has been chasing the promise of self-serve data for decades. As the capabilities of these systems has improved and become more accessible, the target of what self-serve means changes. With the availability of AI powered by large language models combined with the evolution of semantic layers, the team at Zenlytic have taken aim at this problem again. In this episode Paul Blankley and Ryan Janssen explore the power of natural language driven data exploration combined with semantic modeling that enables an intuitive way for everyone in the business to access the data that they need to succeed in their work. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Paul Blankley and Ryan Janssen about Zenlytic, a no-code business intelligence tool focused on emerging commerce brands Interview Introduction How did you get involved in the area of data management? Can you describe what Zenlytic is and the story behind it? Business intelligence is a crowded market. What was your process for defining the problem you are focused on solving and the method to achieve that outcome? Self-serve data exploration has been attempted in myriad ways over successive generations of BI and data platforms. What are the barriers that have been the most challenging to overcome in that effort? What are the elements that are coming together now that give you confidence in being able to deliver on that? Can you describe how Zenlytic is implemented? What are the evolutions in the understanding and implementation of semantic layers that provide a sufficient substrate for operating on? How have the recent breakthroughs in large language models (LLMs) improved your ability to build features in Zenlytic? What is your process for adding domain semantics to the operational aspect of your LLM? For someone using Zenlytic, what is the process for getting it set up and integrated with their data? Once it is operational, can you describe some typical workflows for using Zenlytic in a business context? Who are the target users? What are the collaboration options available? What are the most complex engineering/data challenges that you have had to address in building Zenlytic? What are the most interesting, innovative, or unexpected ways that you have seen Zenlytic used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Zenlytic? When is Zenlytic the wrong choice? What do you have planned for the future of Zenlytic? Contact Info Paul Blankley (LinkedIn) (https://www.linkedin.com/in/paulblankley/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Zenlytic (https://zenlytic.com/) OLAP Cube (https://analyticsengineers.club/whats-an-olap-cube/) Large Language Model (https://en.wikipedia.org/wiki/Large_language_model) Starburst (https://www.starburst.io/) Prompt Engineering (https://en.wikipedia.org/wiki/Prompt_engineering) ChatGPT (https://openai.com/blog/chatgpt) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
72 minutes | Apr 10, 2023
An Exploration Of The Composable Customer Data Platform
Summary The customer data platform is a category of services that was developed early in the evolution of the current era of cloud services for data processing. When it was difficult to wire together the event collection, data modeling, reporting, and activation it made sense to buy monolithic products that handled every stage of the customer data lifecycle. Now that the data warehouse has taken center stage a new approach of composable customer data platforms is emerging. In this episode Darren Haken is joined by Tejas Manohar to discuss how Autotrader UK is addressing their customer data needs by building on top of their existing data stack. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Darren Haken and Tejas Manohar about building a composable CDP and how you can start adopting it incrementally Interview Introduction How did you get involved in the area of data management? Can you describe what you mean by a "composable CDP"? What are some of the key ways that it differs from the ways that we think of a CDP today? What are the problems that you were focused on addressing at Autotrader that are solved by a CDP? One of the promises of the first generation CDP was an opinionated way to model your data so that non-technical teams could own this responsibility. What do you see as the risks/tradeoffs of moving CDP functionality into the same data stack as the rest of the organization? What about companies that don't have the capacity to run a full data infrastructure? Beyond the core technology of the data warehouse, what are the other evolutions/innovations that allow for a CDP experience to be built on top of the core data stack? added burden on core data teams to generate event-driven data models When iterating toward a CDP on top of the core investment of the infrastructure to feed and manage a data warehouse, what are the typical first steps? What are some of the components in the ecosystem that help to speed up the time to adoption? (e.g. pre-built dbt packages for common transformations, etc.) What are the most interesting, innovative, or unexpected ways that you have seen CDPs implemented? What are the most interesting, unexpected, or challenging lessons that you have learned while working on CDP related functionality? When is a CDP (composable or monolithic) the wrong choice? What do you have planned for the future of the CDP stack? Contact Info Darren LinkedIn (https://www.linkedin.com/in/darrenhaken/?originalSubdomain=uk) @DarrenHaken (https://twitter.com/darrenhaken) on Twitter Tejas LinkedIn (https://www.linkedin.com/in/tejasmanohar) @tejasmanohar (https://twitter.com/tejasmanohar) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Autotrader (https://www.autotrader.co.uk/) Hightouch (https://hightouch.com/) Customer Studio (https://hightouch.com/platform/customer-studio) CDP == Customer Data Platform (https://blog.hubspot.com/service/customer-data-platform-guide) Segment (https://segment.com/) Podcast Episode (https://www.dataengineeringpodcast.com/segment-customer-analytics-episode-72/) mParticle (https://www.mparticle.com/) Salesforce (https://www.salesforce.com/) Amplitude (https://amplitude.com/) Snowplow (https://snowplow.io/) Podcast Episode (https://www.dataengineeringpodcast.com/snowplow-with-alexander-dean-episode-48/) Reverse ETL (https://medium.com/memory-leak/reverse-etl-a-primer-4e6694dcc7fb) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) BigQuery (https://cloud.google.com/bigquery) Databricks (https://www.databricks.com/) ELT (https://en.wikipedia.org/wiki/Extract,_load,_transform) Fivetran (https://www.fivetran.com/) Podcast Episode (https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93/) DataHub (https://datahubproject.io/) Podcast Episode (https://www.dataengineeringpodcast.com/acryl-data-datahub-metadata-graph-episode-230/) Amundsen (https://www.amundsen.io/) Podcast Episode (https://www.dataengineeringpodcast.com/amundsen-data-discovery-episode-92/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
62 minutes | Apr 3, 2023
Mapping The Data Infrastructure Landscape As A Venture Capitalist
Summary The data ecosystem has been building momentum for several years now. As a venture capital investor Matt Turck has been trying to keep track of the main trends and has compiled his findings into the MAD (ML, AI, and Data) landscape reports each year. In this episode he shares his experiences building those reports and the perspective he has gained from the exercise. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) today to learn more Your host is Tobias Macey and today I'm interviewing Matt Turck about his annual report on the Machine Learning, AI, & Data landscape and the insights around data infrastructure that he has gained in the process Interview Introduction How did you get involved in the area of data management? Can you describe what the MAD landscape report is and the story behind it? At a high level, what is your goal in the compilation and maintenance of your landscape document? What are your guidelines for what to include in the landscape? As the data landscape matures, how have you seen that influence the types of projects/companies that are founded? What are the product categories that were only viable when capital was plentiful and easy to obtain? What are the product categories that you think will be swallowed by adjacent concerns, and which are likely to consolidate to remain competitive? The rapid growth and proliferation of data tools helped establish the "Modern Data Stack" as a de-facto architectural paradigm. As we move into this phase of contraction, what are your predictions for how the "Modern Data Stack" will evolve? Is there a different architectural paradigm that you see as growing to take its place? How has your presentation and the types of information that you collate in the MAD landscape evolved since you first started it?~~ What are the most interesting, innovative, or unexpected product and positioning approaches that you have seen while tracking data infrastructure as a VC and maintainer of the MAD landscape? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the MAD landscape over the years? What do you have planned for future iterations of the MAD landscape? Contact Info Website (https://mattturck.com/) @mattturck (https://twitter.com/mattturck) on Twitter MAD Landscape Comments Email (mailto:mad2023@firstmarkcap.com) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links MAD Landscape (https://mad.firstmarkcap.com) First Mark Capital (https://firstmark.com/) Bayesian Learning (https://en.wikipedia.org/wiki/Bayesian_inference) AI Winter (https://en.wikipedia.org/wiki/AI_winter) Databricks (https://www.databricks.com/) Cloud Native Landscape (https://landscape.cncf.io/) LUMA Scape (https://lumapartners.com/lumascapes/) Hadoop Ecosystem (https://www.analyticsvidhya.com/blog/2020/10/introduction-hadoop-ecosystem/) Modern Data Stack (https://www.fivetran.com/blog/what-is-the-modern-data-stack) Reverse ETL (https://medium.com/memory-leak/reverse-etl-a-primer-4e6694dcc7fb) Generative AI (https://generativeai.net/) dbt (https://www.getdbt.com/) Transform (https://transform.co/) Podcast Episode (https://www.dataengineeringpodcast.com/transform-co-metrics-layer-episode-206/) Snowflake IPO (https://www.cnn.com/2020/09/16/investing/snowflake-ipo/index.html) Dataiku (https://www.dataiku.com/) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/tabular-iceberg-lakehouse-tables-episode-363) Hudi (https://hudi.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209/) DuckDB (https://duckdb.org/) Podcast Episode (https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/) Trino (https://trino.io/) Y42 (https://www.y42.com/) Podcast Episode (https://www.dataengineeringpodcast.com/y42-full-stack-data-platform-episode-295) Mozart Data (https://www.mozartdata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/mozart-data-modern-data-stack-episode-242/) Keboola (https://www.keboola.com/) MPP Database (https://www.techtarget.com/searchdatamanagement/definition/MPP-database-massively-parallel-processing-database) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
74 minutes | Mar 25, 2023
Unlocking The Potential Of Streaming Data Applications Without The Operational Headache At Grainite
Summary The promise of streaming data is that it allows you to react to new information as it happens, rather than introducing latency by batching records together. The peril is that building a robust and scalable streaming architecture is always more complicated and error-prone than you think it's going to be. After experiencing this unfortunate reality for themselves, Abhishek Chauhan and Ashish Kumar founded Grainite so that you don't have to suffer the same pain. In this episode they explain why streaming architectures are so challenging, how they have designed Grainite to be robust and scalable, and how you can start using it today to build your streaming data applications without all of the operational headache. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) today to learn more Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender (https://www.dataengineeringpodcast.com/timextender) where you can do two things: watch us build a data estate in 15 minutes and start for free today. Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) today Your host is Tobias Macey and today I'm interviewing Ashish Kumar and Abhishek Chauhan about Grainite, a platform designed to give you a single place to build streaming data applications Interview Introduction How did you get involved in the area of data management? Can you describe what Grainite is and the story behind it? What are the personas that you are focused on addressing with Grainite? What are some of the most complex aspects of building streaming data applications in the absence of something like Grainite? How does Grainite work to reduce that complexity? What are some of the commonalities that you see in the teams/organizations that find their way to Grainite? What are some of the higher-order projects that teams are able to build when they are using Grainite as a starting point vs. where they would be spending effort on a fully managed streaming architecture? Can you describe how Grainite is architected? How have the design and goals of the platform changed/evolved since you first started working on it? What does your internal build vs. buy process look like for identifying where to spend your engineering resources? What is the process for getting Grainite set up and integrated into an organizations technical environment? What is your process for determining which elements of the platform to expose as end-user features and customization options vs. keeping internal to the operational aspects of the product? Once Grainite is running, can you describe the day 0 workflow of building an application or data flow? What are the day 2 - N capabilities that Grainite offers for ongoing maintenance/operation/evolution of those applications? What are the most interesting, innovative, or unexpected ways that you have seen Grainite used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Grainite? When is Grainite the wrong choice? What do you have planned for the future of Grainite? Contact Info Ashish LinkedIn (https://www.linkedin.com/in/ashishkumarprofile/) Abhishek LinkedIn (https://www.linkedin.com/in/abhishekchauhan/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Grainite (https://www.grainite.com/) Blog about the challenges of streaming architectures (https://www.grainite.com/blog/there-was-an-old-lady-who-swallowed-a-fly) Getting Started Docs (https://gitbook.grainite.com/developers/getting-started) BigTable (https://research.google/pubs/pub27898/) Spanner (https://research.google/pubs/pub39966/) Firestore (https://cloud.google.com/firestore) OpenCensus (https://opencensus.io/) Citrix (https://www.citrix.com/) NetScaler (https://www.citrix.com/blogs/2022/10/03/netscaler-is-back/) J2EE (https://www.oracle.com/java/technologies/appmodel.html) RocksDB (https://rocksdb.org/) Pulsar (https://pulsar.apache.org/) SQL Server (https://en.wikipedia.org/wiki/Microsoft_SQL_Server) MySQL (https://www.mysql.com/) RAFT Protocol (https://raft.github.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
52 minutes | Mar 19, 2023
Aligning Data Security With Business Productivity To Deploy Analytics Safely And At Speed
Summary As with all aspects of technology, security is a critical element of data applications, and the different controls can be at cross purposes with productivity. In this episode Yoav Cohen from Satori shares his experiences as a practitioner in the space of data security and how to align with the needs of engineers and business users. He also explains why data security is distinct from application security and some methods for reducing the challenge of working across different data systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) today RudderStack makes it easy for data teams to build a customer data platform on their own warehouse. Use their state of the art pipelines to collect all of your data, build a complete view of your customer and sync it to every downstream tool. Sign up for free at dataengineeringpodcast.com/rudder (https://www.dataengineeringpodcast.com/rudder) Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender (https://www.dataengineeringpodcast.com/timextender) where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Yoav Cohen about the challenges that data teams face in securing their data platforms and how that impacts the productivity and adoption of data in the organization Interview Introduction How did you get involved in the area of data management? Data security is a very broad term. Can you start by enumerating some of the different concerns that are involved? How has the scope and complexity of implementing security controls on data systems changed in recent years? In your experience, what is a typical number of data locations that an organization is trying to manage access/permissions within? What are some of the main challenges that data/compliance teams face in establishing and maintaining security controls? How much of the problem is technical vs. procedural/organizational? As a vendor in the space, how do you think about the broad categories/boundary lines for the different elements of data security? (e.g. masking vs. RBAC, etc.) What are the different layers that are best suited to managing each of those categories? (e.g. masking and encryption in storage layer, RBAC in warehouse, etc.) What are some of the ways that data security and organizational productivity are at odds with each other? What are some of the shortcuts that you see teams and individuals taking to address the productivity hit from security controls? What are some of the methods that you have found to be most effective at mitigating or even improving productivity impacts through security controls? How does up-front design of the security layers improve the final outcome vs. trying to bolt on security after the platform is already in use? How can education about the motivations for different security practices improve compliance and user experience? What are the most interesting, innovative, or unexpected ways that you have seen data teams align data security and productivity? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data security technology? What are the areas of data security that still need improvements? Contact Info Yoav Cohen (https://www.linkedin.com/in/yoav-cohen-7a4ba23/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Satori (https://satoricyber.com) Podcast Episode (https://www.dataengineeringpodcast.com/satori-cloud-data-governance-episode-165) Data Masking (https://en.wikipedia.org/wiki/Data_masking) RBAC == Role Based Access Control (https://en.wikipedia.org/wiki/Role-based_access_control) ABAC == Attribute Based Access Control (https://en.wikipedia.org/wiki/Attribute-based_access_control) Gartner Data Security Platform Report (https://www.gartner.com/en/documents/4006252) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
49 minutes | Mar 10, 2023
Use Your Data Warehouse To Power Your Product Analytics With NetSpring
Summary With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) today! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder (https://www.dataengineeringpodcast.com/rudder) Your host is Tobias Macey and today I'm interviewing Priyendra Deshwal about how NetSpring is using the data warehouse to deliver a more flexible and detailed view of your product analytics Interview Introduction How did you get involved in the area of data management? Can you describe what NetSpring is and the story behind it? What are the activities that constitute "product analytics" and what are the roles/teams involved in those activities? When teams first come to you, what are the common challenges that they are facing and what are the solutions that they have attempted to employ? Can you describe some of the challenges involved in bringing product analytics into enterprise or highly regulated environments/industries? How does a warehouse-native approach simplify that effort? There are many different players (both commercial and open source) in the product analytics space. Can you share your view on the role that NetSpring plays in that ecosystem? How is the NetSpring platform implemented to be able to best take advantage of modern warehouse technologies and the associated data stacks? What are the pre-requisites for an organization's infrastructure/data maturity for being able to benefit from NetSpring? How have the goals and implementation of the NetSpring platform evolved from when you first started working on it? Can you describe the steps involved in integrating NetSpring with an organization's existing warehouse? What are the signals that NetSpring uses to understand the customer journeys of different organizations? How do you manage the variance of the data models in the warehouse while providing a consistent experience for your users? Given that you are a product organization, how are you using NetSpring to power NetSpring? What are the most interesting, innovative, or unexpected ways that you have seen NetSpring used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on NetSpring? When is NetSpring the wrong choice? What do you have planned for the future of NetSpring? Contact Info LinkedIn (https://www.linkedin.com/in/priyendra-deshwal/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links NetSpring (https://www.netspring.io/) ThoughtSpot (https://www.thoughtspot.com/) Product Analytics (https://theproductmanager.com/topics/product-analytics-guide/) Amplitude (https://amplitude.com/) Mixpanel (https://mixpanel.com/) Customer Data Platform (https://blog.hubspot.com/service/customer-data-platform-guide) GDPR (https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) CCPA (https://en.wikipedia.org/wiki/California_Consumer_Privacy_Act) Segment (https://segment.com/) Podcast Episode (https://www.dataengineeringpodcast.com/segment-customer-analytics-episode-72/) Rudderstack (https://www.rudderstack.com/) Podcast Episode (https://www.dataengineeringpodcast.com/rudderstack-open-source-customer-data-platform-episode-263/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
46 minutes | Mar 6, 2023
Exploring The Nuances Of Building An Intentional Data Culture
Summary The ecosystem for data professionals has matured to the point that there are a large and growing number of distinct roles. With the scope and importance of data steadily increasing it is important for organizations to ensure that everyone is aligned and operating in a positive environment. To help facilitate the nascent conversation about what constitutes an effective and productive data culture, the team at Data Council have dedicated an entire conference track to the subject. In this episode Pete Soderling and Maggie Hays join the show to explore this topic and their experience preparing for the upcoming conference. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender (https://www.dataengineeringpodcast.com/timextender) where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Pete Soderling and Maggie Hays about the growing importance of establishing and investing in an organization's data culture and their experience forming an entire conference track around this topic Interview Introduction How did you get involved in the area of data management? Can you describe what your working definition of "Data Culture" is? In what ways is a data culture distinct from an organization's corporate culture? How are they interdependent? What are the elements that are most impactful in forming the data culture of an organization? What are some of the motivations that teams/companies might have in fighting against the creation and support of an explicit data culture? Are there any strategies that you have found helpful in counteracting those tendencies? In terms of the conference, what are the factors that you consider when deciding how to group the different presentations into tracks or themes? What are the experiences that you have had personally and in community interactions that led you to elevate data culture to be it's own track? What are the broad challenges that practitioners are facing as they develop their own understanding of what constitutes a healthy and productive data culture? What are some of the risks that you considered when forming this track and evaluating proposals? What are your criteria for determining whether this track is successful? What are the most interesting, innovative, or unexpected aspects of data culture that you have encountered through developing this track? What are the most interesting, unexpected, or challenging lessons that you have learned while working on selecting presentations for this year's event? What do you have planned for the future of this topic at Data Council events? Contact Info Pete @petesoder (https://twitter.com/petesoder) on Twitter LinkedIn (https://www.linkedin.com/in/petesoder) Maggie LinkedIn (https://www.linkedin.com/in/maggie-hays) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Data Council (https://datacouncil.ai/austin) Podcast Episode (https://www.dataengineeringpodcast.com/data-council-data-professional-community-episode-96) Data Community Fund (https://www.datacommunity.fund) DataHub (https://datahubproject.io/) Podcast Episode (https://www.dataengineeringpodcast.com/acryl-data-datahub-metadata-graph-episode-230/) Database Design For Mere Mortals (https://amzn.to/3ZFV6dU) by Michael J. Hernandez (affiliate link) SOAP (https://en.wikipedia.org/wiki/SOAP) REST (https://en.wikipedia.org/wiki/Representational_state_transfer) Econometrics (https://en.wikipedia.org/wiki/Econometrics) DBA == Database Administrator (https://www.careerexplorer.com/careers/database-administrator/) Conway's Law (https://en.wikipedia.org/wiki/Conway%27s_law) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
47 minutes | Feb 27, 2023
Building A Data Mesh Platform At PayPal
Summary There has been a lot of discussion about the practical application of data mesh and how to implement it in an organization. Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. In this episode he shares that journey and the combination of technical and organizational challenges that he encountered in the process. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender (https://www.dataengineeringpodcast.com/timextender) where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Jean-Georges Perrin about his work at PayPal to implement a data mesh and the role of data contracts in making it work Interview Introduction How did you get involved in the area of data management? Can you start by describing the goals and scope of your work at PayPal to implement a data mesh? What are the core problems that you were addressing with this project? Is a data mesh ever "done"? What was your experience engaging at the organizational level to identify the granularity and ownership of the data products that were needed in the initial iteration? What was the impact of leading multiple teams on the design of how to implement communication/contracts throughout the mesh? What are the technical systems that you are relying on to power the different data domains? What is your philosophy on enforcing uniformity in technical systems vs. relying on interface definitions as the unit of consistency? What are the biggest challenges (technical and procedural) that you have encountered during your implementation? How are you managing visibility/auditability across the different data domains? (e.g. observability, data quality, etc.) What are the most interesting, innovative, or unexpected ways that you have seen PayPal's data mesh used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data mesh? When is a data mesh the wrong choice? What do you have planned for the future of your data mesh at PayPal? Contact Info LinkedIn (https://www.linkedin.com/in/jgperrin/) Blog (https://jgp.ai/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Data Mesh (https://www.thoughtworks.com/en-us/what-we-do/data-and-ai/data-mesh) O'Reilly Book (https://amzn.to/3Z5nC8T) (affiliate link) The next generation of Data Platforms is the Data Mesh (https://medium.com/paypal-tech/the-next-generation-of-data-platforms-is-the-data-mesh-b7df4b825522) PayPal (https://about.pypl.com/about-us/default.aspx) Conway's Law (https://en.wikipedia.org/wiki/Conway%27s_law) Data Mesh For All Ages - US (https://amzn.to/3YzVRop), Data Mesh For All Ages - UK (https://amzn.to/3YzVRop) Data Mesh Radio (https://daappod.com/data-mesh-radio/) Data Mesh Community (https://datameshlearning.com/) Data Mesh In Action (http://jgp.ai/dmia) Great Expectations (https://greatexpectations.io/) Podcast Episode (https://www.dataengineeringpodcast.com/great-expectations-technical-debt-data-pipeline-episode-117/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
55 minutes | Feb 19, 2023
The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse
Summary Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to timextender.com/dataengineering where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Ryan Blue about the evolution and applications of the Iceberg table format and how he is making it more accessible at Tabular Interview Introduction How did you get involved in the area of data management? Can you describe what Iceberg is and its position in the data lake/lakehouse ecosystem? Since it is a fundamentally a specification, how do you manage compatibility and consistency across implementations? What are the notable changes in the Iceberg project and its role in the ecosystem since our last conversation October of 2018? Around the time that Iceberg was first created at Netflix a number of alternative table formats were also being developed. What are the characteristics of Iceberg that lead teams to adopt it for their lakehouse projects? Given the constant evolution of the various table formats it can be difficult to determine an up-to-date comparison of their features, particularly earlier in their development. What are the aspects of this problem space that make it so challenging to establish unbiased and comprehensive comparisons? For someone who wants to manage their data in Iceberg tables, what does the implementation look like? How does that change based on the type of query/processing engine being used? Once a table has been created, what are the capabilities of Iceberg that help to support ongoing use and maintenance? What are the most interesting, innovative, or unexpected ways that you have seen Iceberg used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iceberg/Tabular? When is Iceberg/Tabular the wrong choice? What do you have planned for the future of Iceberg/Tabular? Contact Info LinkedIn (https://www.linkedin.com/in/rdblue/) rdblue (https://github.com/rdblue) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Hadoop (https://hadoop.apache.org/) Data Lakehouse (https://www.forbes.com/sites/bernardmarr/2022/01/18/what-is-a-data-lakehouse-a-super-simple-explanation-for-anyone/) ACID == Atomic, Consistent, Isolated, Durable (https://en.wikipedia.org/wiki/ACID) Apache Hive (https://hive.apache.org/) Apache Impala (https://impala.apache.org/) Bodo (https://www.bodo.ai/) Podcast Episode (https://www.dataengineeringpodcast.com/bodo-parallel-data-processing-python-episode-223/) StarRocks (https://www.starrocks.io/) Dremio (https://www.dremio.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dremio-open-data-lakehouse-episode-333/) DDL == Data Definition Language (https://en.wikipedia.org/wiki/Data_definition_language) Trino (https://trino.io/) PrestoDB (https://prestodb.io/) Apache Hudi (https://hudi.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209/) dbt (https://www.getdbt.com/) Apache Flink (https://flink.apache.org/) TileDB (https://tiledb.com/) Podcast Episode (https://www.dataengineeringpodcast.com/tiledb-universal-data-engine-episode-146/) CDC == Change Data Capture (https://en.wikipedia.org/wiki/Change_data_capture) Substrait (https://substrait.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
52 minutes | Feb 11, 2023
Let The Whole Team Participate In Data With The Quilt Versioned Data Hub
Summary Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Your host is Tobias Macey and today I'm interviewing Aneesh Karve about how Quilt Data helps you bring order to your chaotic data in S3 with transactional versioning and data discovery built in Interview Introduction How did you get involved in the area of data management? Can you describe what Quilt is and the story behind it? How have the goals and features of the Quilt platform changed since I spoke with Kevin in June of 2018? What are the main problems that users are trying to solve when they find Quilt? What are some of the alternative approaches/products that they are coming from? How does Quilt compare with options such as LakeFS, Unstruk, Pachyderm, etc.? Can you describe how Quilt is implemented? What are the types of tools and systems that Quilt gets integrated with? How do you manage the tension between supporting the lowest common denominator, while providing options for more advanced capabilities? What is a typical workflow for a team that is using Quilt to manage their data? What are the most interesting, innovative, or unexpected ways that you have seen Quilt used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Quilt? When is Quilt the wrong choice? What do you have planned for the future of Quilt? Contact Info LinkedIn (https://www.linkedin.com/in/aneeshkarve/) @akarve (https://twitter.com/akarve) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Quilt Data (https://quiltdata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/quilt-data-with-kevin-moore-episode-37/) UW Madison (https://www.wisc.edu/) Docker Swarm (https://docs.docker.com/engine/swarm/) Kaggle (https://www.kaggle.com/) open.quiltdata.com (https://open.quiltdata.com/) FinOS Perspective (https://perspective.finos.org/) LakeFS (https://lakefs.io/) Podcast Episode (https://www.dataengineeringpodcast.com/lakefs-data-lake-versioning-episode-157/) Pachyderm (https://www.pachyderm.com/) Podcast Episode (https://www.dataengineeringpodcast.com/pachyderm-data-lineage-episode-82) Unstruk (https://www.unstruk.com/) Podcast Episode (https://www.dataengineeringpodcast.com/unstruk-unstructured-data-warehouse-episode-196/) Parquet (https://parquet.apache.org/) Avro (https://avro.apache.org/) ORC (https://orc.apache.org/) Cloudformation (https://aws.amazon.com/cloudformation/) Troposphere (https://github.com/cloudtools/troposphere) CDK == Cloud Development Kit (https://aws.amazon.com/cdk/) Shadow IT (https://en.wikipedia.org/wiki/Shadow_IT) Podcast Episode (https://www.dataengineeringpodcast.com/shadow-it-data-analytics-episode-121) Delta Lake (https://delta.io/) Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/) Apache Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Datasette (https://datasette.io/) Frictionless (https://frictionlessdata.io/) DVC (https://dvc.org/) Podcast.__init__ Episode (https://www.pythonpodcast.com/data-version-control-episode-206/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
32 minutes | Feb 6, 2023
Reflecting On The Past 6 Years Of Data Engineering
Summary This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Your host is Tobias Macey and today I'm reflecting on the major trends in data engineering over the past 6 years Interview Introduction 6 years of running the Data Engineering Podcast Around the first time that data engineering was discussed as a role Followed on from hype about "data science" Hadoop era Streaming Lambda and Kappa architectures Not really referenced anymore "Big Data" era of capture everything has shifted to focusing on data that presents value Regulatory environment increases risk, better tools introduce more capability to understand what data is useful Data catalogs Amundsen and Alation Orchestration engine Oozie, etc. -> Airflow and Luigi -> Dagster, Prefect, Lyft, etc. Orchestration is now a part of most vertical tools Cloud data warehouses Data lakes DataOps and MLOps Data quality to data observability Metadata for everything Data catalog -> data discovery -> active metadata Business intelligence Read only reports to metric/semantic layers Embedded analytics and data APIs Rise of ELT dbt Corresponding introduction of reverse ETL What are the most interesting, unexpected, or challenging lessons that you have learned while working on running the podcast? What do you have planned for the future of the podcast? Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
51 minutes | Jan 30, 2023
Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics
Summary Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Your host is Tobias Macey and today I'm interviewing Chris Merrick about the Omni Analytics platform and how they are adding automatic data modeling to your business intelligence Interview Introduction How did you get involved in the area of data management? Can you describe what Omni Analytics is and the story behind it? What are the core goals that you are trying to achieve with building Omni? Business intelligence has gone through many evolutions. What are the unique capabilities that Omni Analytics offers over other players in the market? What are the technical and organizational anti-patterns that typically grow up around BI systems? What are the elements that contribute to BI being such a difficult product to use effectively in an organization? Can you describe how you have implemented the Omni platform? How have the design/scope/goals of the product changed since you first started working on it? What does the workflow for a team using Omni look like? What are some of the developments in the broader ecosystem that have made your work possible? What are some of the positive and negative inspirations that you have drawn from the experience that you and your team-mates have gained in previous businesses? What are the most interesting, innovative, or unexpected ways that you have seen Omni used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Omni? When is Omni the wrong choice? What do you have planned for the future of Omni? Contact Info LinkedIn (https://www.linkedin.com/in/merrickchristopher/) @cmerrick (https://twitter.com/cmerrick) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Omni Analytics (https://www.exploreomni.com/) Stitch (https://www.stitchdata.com/) RJ Metrics (https://en.wikipedia.org/wiki/RJMetrics) Looker (https://www.looker.com/) Podcast Episode (https://www.dataengineeringpodcast.com/looker-with-daniel-mintz-episode-55/) Singer (https://www.singer.io/) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) Teradata (https://www.teradata.com/) Fivetran (https://www.fivetran.com/) Apache Arrow (https://arrow.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/voltron-data-apache-arrow-episode-346/) DuckDB (https://duckdb.org/) Podcast Episode (https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/) BigQuery (https://cloud.google.com/bigquery) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
46 minutes | Jan 22, 2023
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI
Summary The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda (https://www.dataengineeringpodcast.com/gartnerda) today to find out more. Your host is Tobias Macey and today I'm interviewing Adam Kamor about Tonic, a service for generating data sets that are safe for development, analytics, and machine learning Interview Introduction How did you get involved in the area of data management? Can you describe what Tonic is and the story behind it? What are the core problems that you are trying to solve? What are some of the ways that fake or obfuscated data is used in development and analytics workflows? challenges of reliably subsetting data impact of ORMs and bad habits developers get into with database modeling Can you describe how Tonic is implemented? What are the units of composition that you are building to allow for evolution and expansion of your product? How have the design and goals of the platform evolved since you started working on it? Can you describe some of the different workflows that customers build on top of your various tools What are the most interesting, innovative, or unexpected ways that you have seen Tonic used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Tonic? When is Tonic the wrong choice? What do you have planned for the future of Tonic? Contact Info LinkedIn (https://www.linkedin.com/in/adam-kamor-85720b48/) @AdamKamor (https://twitter.com/adamkamor) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Tonic (https://hubs.la/Q01yX4qN0) Djinn (https://hubs.la/Q01yX4FL0) Django (https://www.djangoproject.com/) Ruby on Rails (https://rubyonrails.org/) C# (https://learn.microsoft.com/en-us/dotnet/csharp/tour-of-csharp/) Entity Framework (https://learn.microsoft.com/en-us/dotnet/csharp/tour-of-csharp/) PostgreSQL (https://www.postgresql.org/) MySQL (https://www.mysql.com/) Oracle DB (https://www.oracle.com/database/) MongoDB (https://www.mongodb.com/) Parquet (https://parquet.apache.org/) Databricks (https://www.databricks.com/) Mockaroo (https://www.mockaroo.com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
49 minutes | Jan 16, 2023
Building Applications With Data As Code On The DataOS
Summary The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo (http://www.dataengineeringpodcast.com/montecarlo) to learn more. Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda (https://www.dataengineeringpodcast.com/gartnerda) today to find out more. Your host is Tobias Macey and today I'm interviewing Srujan Akula about DataOS, a pre-integrated and managed data platform built by The Modern Data Company Interview Introduction How did you get involved in the area of data management? Can you describe what your mission at The Modern Data Company is and the story behind it? Your flagship (only?) product is a platform that you're calling DataOS. What is the scope and goal of that platform? Who is the target audience? On your site you refer to the idea of "data as software". What are the principles and ways of thinking that are encompassed by that concept? What are the platform capabilities that are required to make it possible? There are 11 "Key Features" listed on your site for the DataOS. What was your process for identifying the "must have" vs "nice to have" features for launching the platform? Can you describe the technical architecture that powers your DataOS product? What are the core principles that you are optimizing for in the design of your platform? How have the design and goals of the system changed or evolved since you started working on DataOS? Can you describe the workflow for the different practitioners and stakeholders working on an installation of DataOS? What are the interfaces and escape hatches that are available for integrating with and extending the operation of the DataOS? What are the features or capabilities that you are expressly choosing not to implement? (e.g. ML pipelines, data sharing, etc.) What are the design elements that you are focused on to make DataOS approachable and understandable by different members of an organization? What are the most interesting, innovative, or unexpected ways that you have seen DataOS used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on DataOS? When is DataOS the wrong choice? What do you have planned for the future of DataOS? Contact Info LinkedIn (https://www.linkedin.com/in/srujanakula/) @srujanakula (https://twitter.com/srujanakula) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Modern Data Company (https://themoderndatacompany.com/) Alation (https://www.alation.com/) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Fivetran (https://www.fivetran.com/) Podcast Episode (https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93/) Airflow (https://airflow.apache.org/) Dremio (https://www.dremio.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dremio-with-tomer-shiran-episode-58/) PrestoDB (https://prestodb.io/) GraphQL (https://graphql.org/) Cypher (https://neo4j.com/developer/cypher/) graph query language Gremlin (https://en.wikipedia.org/wiki/Gremlin_(query_language)) graph query language The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
COMPANY
About us Careers Stitcher Blog Help
AFFILIATES
Partner Portal Advertisers Podswag Stitcher Studios
Privacy Policy Terms of Service Your Privacy Choices
© Stitcher 2023