This site uses cookies to improve your experience. To help us insure we adhere to various privacy regulations, please select your country/region of residence. If you do not select a country, we will assume you are from the United States. Select your Cookie Settings or view our Privacy Policy and Terms of Use.
Cookie Settings
Cookies and similar technologies are used on this website for proper function of the website, for tracking performance analytics and for marketing purposes. We and some of our third-party providers may use cookie data for various purposes. Please review the cookie settings below and choose your preference.
Used for the proper function of the website
Used for monitoring website traffic and interactions
Cookie Settings
Cookies and similar technologies are used on this website for proper function of the website, for tracking performance analytics and for marketing purposes. We and some of our third-party providers may use cookie data for various purposes. Please review the cookie settings below and choose your preference.
Strictly Necessary: Used for the proper function of the website
Performance/Analytics: Used for monitoring website traffic and interactions
Data storage has been evolving, from databases to data warehouses and expansive datalakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.
The batch world has been the default for years because of the complexities of running a reliable streaming system at scale. In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time datalake without all of the headache.
Summary Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. Can you describe what Scanner is and the story behind it?
Responding to data overload with a security datalake Security professionals have to continually up their game to make sure that, from all the data at their disposal, theyre using the correct inputs to identify vulnerabilities and incidents. In it, we discuss three layers of AI that can become an attack surface.
The simple idea was, hey how can we get more value from the transactional data in our operational systems spanning finance, sales, customer relationship management, and other siloed functions. There was no easy way to consolidate and analyze this data to more effectively manage our business. A datalake!
The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable datasystems. Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms.
Summary Any software system that survives long enough will require some form of migration or evolution. When that system is responsible for the data layer the process becomes more challenging. Sriram Panyam has been involved in several projects that required migration of large volumes of data in high traffic environments.
Summary The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor.
Summary A data lakehouse is intended to combine the benefits of datalakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Datalakes are notoriously complex. Visit [dataengineeringpodcast.com/data-council]([link] and use code *depod20* to register today!
Over the years, the technology landscape for data management has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. These patterns include both centralized storage patterns like data warehouse , datalake and data lakehouse , and distributed patterns such as data mesh.
Summary Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Datalakes are notoriously complex. Operating it at scale, however, is notoriously challenging. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
Summary One of the perennial challenges posed by datalakes is how to keep them up to date as new data is collected. In this episode Ori Rafael shares his experiences from Upsolver and building scalable stream processing for integrating and analyzing data, and what the tradeoffs are when coming from a batch oriented mindset.
Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows.
Today we’re focusing on customers who migrated from a cloud data warehouse to Snowflake and some of the benefits they saw. A consolidated datasystem to accommodate a big(ger) WHOOP When a company experiences exponential growth over a short period, it’s easy for its data foundation to feel a bit like it was built on the fly.
In a recent customer workshop with a large retail data science media company, one of the attendees, an engineering leader, made the following observation: “Everytime I go to your competitor website, they only care about their system. How to onboard data into their system? I don’t care about their system.
Summary The Presto project has become the de facto option for building scalable open source analytics in SQL for the datalake. Another area that has been seeing a lot of activity is datalakes and projects to make them more manageable and feature complete (e.g. Hudi, Delta Lake, Iceberg, Nessie, LakeFS, etc.).
Summary Designing a data platform is a complex and iterative undertaking which requires accounting for many conflicting needs. Designing a platform that relies on a datalake as its central architectural tenet adds additional layers of difficulty. When is a datalake architecture the wrong choice?
Datalakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the datalake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.
Summary Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
Beyond working with well-structured data in a data warehouse, modern AI systems can use deep learning and natural language processing to work effectively with unstructured and semi-structured data in datalakes and lakehouses.
Datalakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the datalake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.
This reduces the overall complexity of getting streaming data ready to use: Simply create external access integration with your existing Kafka solution. SnowConvert is an easy-to-use code conversion tool that accelerates legacy relational database management system (RDBMS) migrations to Snowflake.
Datalakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the datalake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.
Petr shares his journey from being an engineer to founding Synq, emphasizing the importance of treating datasystems with the same rigor as engineering systems. He discusses the challenges and solutions in data reliability, including the need for transparency and ownership in datasystems.
Datalakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the datalake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Datalakes are notoriously complex. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data.
To address this shortcoming Datorios created an observability platform for Flink that brings visibility to the internals of this popular stream processing system. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat.
Snowflake is now making it even easier for customers to bring the platform’s usability, performance, governance and many workloads to more data with Iceberg tables (now generally available), unlocking full storage interoperability. Iceberg tables provide compute engine interoperability over a single copy of data.
In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform. What are the other systems that feed into and rely on the Trino/Iceberg service?
Datalakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the datalake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.
Photo by Tiger Lily Data warehouses and datalakes play a crucial role for many businesses. It gives businesses access to the data from all of their various systems. As well as often integrating data so that end-users can answer business critical questions.
Anyone who’s been roaming around the forest of Data Engineering has probably run into many of the newish tools that have been growing rapidly around the concepts of Data Warehouses, DataLakes, and Lake Houses … the merging of the old relational database functionality with TB and PB level cloud-based file storage systems.
Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. Datalakes are notoriously complex. Summary Building a database engine requires a substantial amount of engineering effort and time investment.
Summary Modern businesses aspire to be data driven, and technologists enjoy working through the challenge of building datasystems to support that goal. Data governance is the binding force between these two parts of the organization. What are some of the misconceptions that you encounter about data governance?
In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Datalakes are notoriously complex.
In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Start trusting your data with Monte Carlo today! Supercharge your business teams with customer data using Hightouch for Reverse ETL today.
WhyLogs is a powerful library for flexibly instrumenting all of your datasystems to understand the entire lifecycle of your data from source to productionized model. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies.
The Machine Learning Platform (MLP) team at Netflix provides an entire ecosystem of tools around Metaflow , an open source machine learning infrastructure framework we started, to empower data scientists and machine learning practitioners to build and manage a variety of ML systems. ETL workflows), as well as downstream (e.g.
Summary Building a data platform is a substrantial engineering endeavor. The services and systems need to be kept up to date, but so does the code that controls their behavior. Datalakes are notoriously complex. Support Data Engineering Podcast Summary Building a data platform is a substrantial engineering endeavor.
They opted for Snowflake, a cloud-native data platform ideal for SQL-based analysis. The team landed the data in a DataLake implemented with cloud storage buckets and then loaded into Snowflake, enabling fast access and smooth integrations with analytical tools. Here is another example.
Summary Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between datalake and warehouse capabilities is the catalog. Datalakes are notoriously complex. What is involved in integrating Nessie into a given data stack?
While data warehouses are still in use, they are limited in use-cases as they only support structured data. Datalakes add support for semi-structured and unstructured data, and data lakehouses add further flexibility with better governance in a true hybrid solution built from the ground-up.
ERP and CRM systems are designed and built to fulfil a broad range of business processes and functions. This generalisation makes their data models complex and cryptic and require domain expertise. As you do not want to start your development with uncertainty, you decide to go for the operational raw data directly.
We organize all of the trending information in your field so you don't have to. Join 37,000+ users and stay up to date on the latest articles your peers are reading.
You know about us, now we want to get to know you!
Let's personalize your content
Let's get even more personalized
We recognize your account from another site in our network, please click 'Send Email' below to continue with verifying your account and setting a password.
Let's personalize your content