This site uses cookies to improve your experience. To help us insure we adhere to various privacy regulations, please select your country/region of residence. If you do not select a country, we will assume you are from the United States. Select your Cookie Settings or view our Privacy Policy and Terms of Use.
Cookie Settings
Cookies and similar technologies are used on this website for proper function of the website, for tracking performance analytics and for marketing purposes. We and some of our third-party providers may use cookie data for various purposes. Please review the cookie settings below and choose your preference.
Used for the proper function of the website
Used for monitoring website traffic and interactions
Cookie Settings
Cookies and similar technologies are used on this website for proper function of the website, for tracking performance analytics and for marketing purposes. We and some of our third-party providers may use cookie data for various purposes. Please review the cookie settings below and choose your preference.
Strictly Necessary: Used for the proper function of the website
Performance/Analytics: Used for monitoring website traffic and interactions
In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time datalake without all of the headache. What is the impact of continuous data flows on dags/orchestration of transforms? Closing Announcements Thank you for listening!
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagementDatalakes are notoriously complex. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data.
The goal of this post is to understand how data integrity best practices have been embraced time and time again, no matter the technology underpinning. In the beginning, there was a data warehouse The data warehouse (DW) was an approach to data architecture and structured datamanagement that really hit its stride in the early 1990s.
Summary Datalake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis.
Summary A data lakehouse is intended to combine the benefits of datalakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Datalakes are notoriously complex. To start, can you share your definition of what constitutes a "Data Lakehouse"?
A few months ago, I uploaded a video where I discussed data warehouses, datalakes, and transactional databases. However, the world of datamanagement is evolving rapidly, especially with the resurgence of AI and machine learning.
In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.
Over the years, the technology landscape for datamanagement has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. In keeping up with ever-evolving datamanagement needs, we’re announcing new capabilities that support customers across all of these patterns.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagement Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team.
Summary Datalakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is now a composition of multiple systems that need to be properly configured to work in concert.
Summary Datalakes are gaining popularity due to their flexibility and reduced cost of storage. Along with the benefits there are some additional complexities to consider, including how to safely integrate new data sources or test out changes to existing pipelines. What do you have planned for the future of the project?
Summary One of the perennial challenges posed by datalakes is how to keep them up to date as new data is collected. In this episode Ori Rafael shares his experiences from Upsolver and building scalable stream processing for integrating and analyzing data, and what the tradeoffs are when coming from a batch oriented mindset.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagement RudderStack helps you build a customer data platform on your warehouse or datalake. Can you describe what SQLMesh is and the story behind it? DataOps is a term that has been co-opted and overloaded.
Summary The Presto project has become the de facto option for building scalable open source analytics in SQL for the datalake. Can you give an overview of the options that are available for someone wanting to use its SQL engine for querying their data? Hudi, Delta Lake, Iceberg, Nessie, LakeFS, etc.).
Summary Designing a data platform is a complex and iterative undertaking which requires accounting for many conflicting needs. Designing a platform that relies on a datalake as its central architectural tenet adds additional layers of difficulty. What are the elements that are still cumbersome or intractable?
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagementDatalakes are notoriously complex. Datalakes in various forms have been gaining significant popularity as a unified interface to an organization's analytics. When is Fabric the wrong choice?
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagement Dagster offers a new approach to building and running data platforms and data pipelines. Datalakes are notoriously complex. Go to dataengineeringpodcast.com/dagster today to get started.
What if your datalake could do more than just store information—what if it could think like a database? As data lakehouses evolve, they transform how enterprises manage, store, and analyze their data. Hudi, with its robust community and technical innovation, is well-positioned to lead this charge.
In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagementDatalakes are notoriously complex.
In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform. Can you describe what role Trino and Iceberg play in Stripe's data architecture?
In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they enable across teams and business units. Datalakes are notoriously complex. Go to dataengineeringpodcast.com/dagster today to get started.
With Hybrid Tables’ fast, high-concurrency point operations, you can store application and workflow state directly in Snowflake, serve data without reverse ETL and build lightweight transactional apps while maintaining a single governance and security model for both transactional and analytical data — all on one platform.
In this episode she shares the practical steps to implementing a data governance practice in your organization, and the pitfalls to avoid. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagementDatalakes are notoriously complex.
In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers. Datalakes are notoriously complex. Your first 30 days are free!
He highlights the role of data teams in modern organizations and how Synq is empowering them to achieve this. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagementDatalakes are notoriously complex. Can you describe what Synq is and the story behind it?
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagementDatalakes are notoriously complex. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data.
Colleen Tartow has worked across all stages of the data lifecycle, and in this episode she shares her hard-earned wisdom about how to conduct an AI program for your organization. Datalakes are notoriously complex. What do you have planned for the future of your work at VAST Data? Your first 30 days are free!
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagement Dagster offers a new approach to building and running data platforms and data pipelines. Datalakes are notoriously complex. Go to dataengineeringpodcast.com/dagster today to get started.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagement This episode is supported by Code Comments, an original podcast from Red Hat. Datalakes are notoriously complex. My thanks to the team at Code Comments for their support.
Summary Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between datalake and warehouse capabilities is the catalog. Datalakes are notoriously complex. What is involved in integrating Nessie into a given data stack?
In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process. What is the current state of the ecosystem for data sharing protocols/practices/platforms?
In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).
Summary The most complicated part of data engineering is the effort involved in making the raw data fit into the narrative of the business. Master DataManagement (MDM) is the process of building consensus around what the information actually means in the context of the business and then shaping the data to match those semantics.
Ghalib Suleiman has been on both sides of this equation and joins the show to share his hard-won wisdom about how to start and grow a data team in the early days of company growth. Can you start by sharing your conception of the responsibilities of a data team? When is it more practical to outsource the data work?
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagement Dagster offers a new approach to building and running data platforms and data pipelines. Datalakes are notoriously complex. Datalakes are notoriously complex.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagement This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment.
In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team. Datalakes are notoriously complex.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagement This episode is supported by Code Comments, an original podcast from Red Hat. Datalakes are notoriously complex. Datalakes are notoriously complex.
In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data. Datalakes are notoriously complex. Can you start by outlining some of the situations where reconciling data between databases is needed?
While data warehouses are still in use, they are limited in use-cases as they only support structured data. Datalakes add support for semi-structured and unstructured data, and data lakehouses add further flexibility with better governance in a true hybrid solution built from the ground-up.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagementData projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Datalakes are notoriously complex.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagement Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team.
In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture. What do you have planned for the future of your data platform? What do you have planned for the future of your data platform?
In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain. Datalakes are notoriously complex. Datalakes are notoriously complex.
We organize all of the trending information in your field so you don't have to. Join 37,000+ users and stay up to date on the latest articles your peers are reading.
You know about us, now we want to get to know you!
Let's personalize your content
Let's get even more personalized
We recognize your account from another site in our network, please click 'Send Email' below to continue with verifying your account and setting a password.
Let's personalize your content