This site uses cookies to improve your experience. To help us insure we adhere to various privacy regulations, please select your country/region of residence. If you do not select a country, we will assume you are from the United States. Select your Cookie Settings or view our Privacy Policy and Terms of Use.
Cookie Settings
Cookies and similar technologies are used on this website for proper function of the website, for tracking performance analytics and for marketing purposes. We and some of our third-party providers may use cookie data for various purposes. Please review the cookie settings below and choose your preference.
Used for the proper function of the website
Used for monitoring website traffic and interactions
Cookie Settings
Cookies and similar technologies are used on this website for proper function of the website, for tracking performance analytics and for marketing purposes. We and some of our third-party providers may use cookie data for various purposes. Please review the cookie settings below and choose your preference.
Strictly Necessary: Used for the proper function of the website
Performance/Analytics: Used for monitoring website traffic and interactions
The goal of this post is to understand how data integrity best practices have been embraced time and time again, no matter the technology underpinning. In the beginning, there was a datawarehouse The datawarehouse (DW) was an approach to data architecture and structured datamanagement that really hit its stride in the early 1990s.
A few months ago, I uploaded a video where I discussed datawarehouses, datalakes, and transactional databases. However, the world of datamanagement is evolving rapidly, especially with the resurgence of AI and machine learning.
Over the years, the technology landscape for datamanagement has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. Customers that require a hybrid of these to support many different tools and languages have built a data lakehouse.
In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time datalake without all of the headache. What is the impact of continuous data flows on dags/orchestration of transforms? Closing Announcements Thank you for listening!
Summary A data lakehouse is intended to combine the benefits of datalakes (cost effective, scalable storage and compute) and datawarehouses (user friendly SQL interface). Datalakes are notoriously complex. To start, can you share your definition of what constitutes a "Data Lakehouse"?
In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.
Summary Datalake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis.
In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow. Visit: dataengineeringpodcast.com/data-council today! Don't miss out on their only event this year!
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagement RudderStack helps you build a customer data platform on your warehouse or datalake. Can you describe what SQLMesh is and the story behind it? DataOps is a term that has been co-opted and overloaded.
Summary Building and maintaining a datalake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that datalakes provide allows for generating significant value, but it can also lead to anti-patterns and inconsistent quality in your analytics.
Summary The current trend in datamanagement is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of datalakes as a solution for managing storage and access.
Summary There is a lot of attention on the database market and cloud datawarehouses. While they provide a measure of convenience, they also require you to sacrifice a certain amount of control over your data. Firebolt is the fastest cloud datawarehouse. Visit dataengineeringpodcast.com/firebolt to get started.
Summary Datalakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is now a composition of multiple systems that need to be properly configured to work in concert.
Summary Datawarehouse technology has been around for decades and has gone through several generational shifts in that time. The current trends in data warehousing are oriented around cloud native architectures that take advantage of dynamic scaling and the separation of compute and storage.
Summary A datalake can be a highly valuable resource, as long as it is well built and well managed. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful datalake project, how the Upsolver platform is architected, and how modern datalakes can benefit your organization.
Despite these limitations, datawarehouses, introduced in the late 1980s based on ideas developed even earlier, remain in widespread use today for certain business intelligence and data analysis applications. While datawarehouses are still in use, they are limited in use-cases as they only support structured data.
Summary One of the perennial challenges posed by datalakes is how to keep them up to date as new data is collected. In this episode Ori Rafael shares his experiences from Upsolver and building scalable stream processing for integrating and analyzing data, and what the tradeoffs are when coming from a batch oriented mindset.
This article looks at the options available for storing and processing big data, which is too large for conventional databases to handle. There are two main options available, a datalake and a datawarehouse. What is a DataWarehouse? What is a DataLake?
Summary The market for datawarehouse platforms is large and varied, with options for every use case. It was interesting to learn about some of the custom data types and performance optimizations that are included. For someone getting started with Clickhouse can you describe how they should be thinking about data modeling?
In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagementDatalakes are notoriously complex.
Summary Designing a data platform is a complex and iterative undertaking which requires accounting for many conflicting needs. Designing a platform that relies on a datalake as its central architectural tenet adds additional layers of difficulty. What are the elements that are still cumbersome or intractable?
When it was difficult to wire together the event collection, data modeling, reporting, and activation it made sense to buy monolithic products that handled every stage of the customer data lifecycle. Now that the datawarehouse has taken center stage a new approach of composable customer data platforms is emerging.
The terms “ DataWarehouse ” and “ DataLake ” may have confused you, and you have some questions. Essentially, this is the difference between a lake and a warehouse. On the other hand, a datawarehouse contains historical data that has been cleaned and arranged. .
In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud datawarehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).
Ghalib Suleiman has been on both sides of this equation and joins the show to share his hard-won wisdom about how to start and grow a data team in the early days of company growth. Can you start by sharing your conception of the responsibilities of a data team? When is it more practical to outsource the data work?
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagement This episode is supported by Code Comments, an original podcast from Red Hat. Datalakes are notoriously complex. My thanks to the team at Code Comments for their support.
He describes how the platform is architected, the challenges related to selling cloud technologies into enterprise organizations, and how you can adopt Matillion for your own workflows to reduce the maintenance burden of data integration workflows. No more shipping and praying, you can now know exactly what will change in your database!
“DataLake vs DataWarehouse = Load First, Think Later vs Think First, Load Later” The terms datalake and datawarehouse are frequently stumbled upon when it comes to storing large volumes of data. DataWarehouse Architecture What is a Datalake?
Summary Cloud datawarehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. We feel your pain.
Parting Question From your perspective, what is the biggest gap in the tooling or technology for datamanagement today? Sign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud datawarehouses.
In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture. What do you have planned for the future of your data platform? What do you have planned for the future of your data platform?
Different vendors offering datawarehouses, datalakes, and now data lakehouses all offer their own distinct advantages and disadvantages for data teams to consider. So let’s get to the bottom of the big question: what kind of data storage layer will provide the strongest foundation for your data platform?
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagement RudderStack helps you build a customer data platform on your warehouse or datalake. Can you describe what Agile Data Engine is and the story behind it?
In this episode Paul Blankley and Ryan Janssen explore the power of natural language driven data exploration combined with semantic modeling that enables an intuitive way for everyone in the business to access the data that they need to succeed in their work. Can you describe what Zenlytic is and the story behind it?
Track data files within the table along with their column statistics. Open table formats enable efficient datamanagement and retrieval by storing these files chronologically, with a history of DDL and DML actions and an index of data file locations. It can also be integrated into major data platforms like Snowflake.
Currently, datamanagement is a continually developing field that requires careful consideration when deciding which solution should be implemented to store, process, and analyze data effectively. There are two forms that are frequently selected: datawarehouse vs datalake.
In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides. RudderStack helps you build a customer data platform on your warehouse or datalake.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagement When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. or any other destination you choose.
In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagement RudderStack helps you build a customer data platform on your warehouse or datalake.
The Dominance of the Lakehouses and the Mutation Support Lakehouses have become a standard pattern in data infrastructure, combining the best features of datalakes and warehouses. Unlike datalakes, which are predominantly append-only, lakehouses support data mutation natively.
In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff. July 2021 and July 2022 about data-diff) What are the roadblocks to data testing/validation that you see teams run into most often?
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern datamanagementData and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Missing data? Can you describe what the SQLake product is and the story behind it?
In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems. Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Your flagship (only?) What is the scope and goal of that platform?
In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows. Can you describe what your conception of a data contract is? Closing Announcements Thank you for listening!
We organize all of the trending information in your field so you don't have to. Join 37,000+ users and stay up to date on the latest articles your peers are reading.
You know about us, now we want to get to know you!
Let's personalize your content
Let's get even more personalized
We recognize your account from another site in our network, please click 'Send Email' below to continue with verifying your account and setting a password.
Let's personalize your content