Top Data Engineering Digest Cloud Computing Data Architecture Content for August, 2022

August, 2022

Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)

Simon Späti

AUGUST 25, 2022

Image by Rachel Claire on Pexels Ever wanted or been asked to build an open-source Data Lake offloading data for analytics? Asked yourself what components and features would that include. Didn’t know the difference between a Data Lakehouse and a Data Warehouse? Or you just wanted to govern your hundreds to thousands of files and have more database-like features but don’t know how?

Data Lake

Data Lake Data Warehouse Government Data

ShortCircuitOperator in Apache Airflow: The guide

Marc Lamberti

AUGUST 11, 2022

The ShortCircuitOperator in Apache Airflow is simple but powerful. It allows skipping tasks based on the result of a condition. There are many reasons why you may want to stop running tasks. Let’s see how to use the ShortCircuitOperator and what you should be aware of. By the way, if you are new to Airflow, check my courses here ; you will get at a special discount.

Coding

Coding Python Process IT

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

MORE WEBINARS

Trending Sources

How to gather requirements for your data project

Start Data Engineering

AUGUST 11, 2022

1. Introduction 2. Gathering requirements 2.1. Identify the end-users 2.2. Help end-users define the requirements 2.3. End-user validation 2.4. Deliver iteratively 2.5. Handling changing requirements/new features 3. Conclusion 4. Further reading 5. Reference 1. Introduction Data engineers are often caught off guard by undefined end-user assumptions.

Project

Project Data Engineer Data Engineering Data

Webinars

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

MORE WEBINARS

7 Techniques to Handle Imbalanced Data

KDnuggets

AUGUST 24, 2022

This blog post introduces seven techniques that are commonly applied in domains like intrusion detection or real-time bidding, because the datasets are often extremely imbalanced.

Datasets

Datasets Data Machine Learning

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Data Engineering Podcast

AUGUST 28, 2022

Summary The dream of every engineer is to automate all of their tasks. For data engineers, this is a monumental undertaking. Orchestration engines are one step in that direction, but they are not a complete solution. In this episode Sean Knapp shares his views on what constitutes proper automation and the work that he and his team at Ascend are doing to help make it a reality.

Data Engineer

Data Engineer Data Engineering MongoDB Metadata

Real-Time Wildlife Monitoring with Apache Kafka

Confluent

AUGUST 17, 2022

Confluent Hackathon ‘22: Using Apache Kafka a Raspberry Pi, and a camera, Simon Aubury builds a detection and monitoring system to better understand wildlife population trends over time.

Kafka

Kafka Systems Building

Data Mesh?—?A Data Movement and Processing Platform @ Netflix

Netflix Tech

AUGUST 1, 2022

Data Mesh?—?A Data Movement and Processing Platform @ Netflix By Bo Lei , Guilherme Pires , James Shao , Kasturi Chatterjee , Sujay Jain , Vlad Sydorenko Background Realtime processing technologies (A.K.A stream processing) is one of the key factors that enable Netflix to maintain its leading position in the competition of entertaining our users. Our previous generation of streaming pipeline solution Keystone has a proven track record of serving multiple of our key business needs.

Process

Process Transportation Kafka Entertainment

More Trending

Data Mesh?—?A Data Movement and Processing Platform @ Netflix

Netflix Tech

AUGUST 1, 2022

Process

Process Transportation Kafka Entertainment

Incremental Strategies to Move Your Data Strategy Forward Remove Obstacles to Unlock Possibilities in Financial Services

Cloudera

AUGUST 30, 2022

Firms are burdened with tech debt and endless regulatory compliance, often leaving innovation last to receive the necessary budgets. Data-fuelled innovation requires a pragmatic strategy. This blog lays out some steps to help you incrementally advance efforts to be a more data-driven, customer-centric organization. Embrace incremental progress. The financial sector’s evolution is unleashing myriad demands on firms operating in the market.

Cloud Storage

Cloud Storage Government Data Governance Retail

Teradata VantageCloud Lake and ClearScape Analytics: Empowering Enterprise Analytical Innovation

Teradata

AUGUST 29, 2022

Teradata's new offerings, VantageCloud Lake and ClearScape Analytics, make it the complete cloud analytics & data platform, with cloud-native deployment and expanded analytics capabilities.

Cloud

Cloud IT Data

What Does ETL Have to Do with Machine Learning?

KDnuggets

AUGUST 15, 2022

ETL during the process of producing effective machine learning algorithms is found at the base - the foundation. Let’s go through the steps on how ETL is important to machine learning.

Machine Learning

Machine Learning Algorithm Process

Alumni Of AirBnB's Early Years Reflect On What They Learned About Building Data Driven Organizations

Data Engineering Podcast

AUGUST 28, 2022

Summary AirBnB pioneered a number of the organizational practices that have become the goal of modern data teams. Out of that culture a number of successful businesses were created to provide the tools and methods to a broader audience. In this episode several almuni of AirBnB’s formative years who have gone on to found their own companies join the show to reflect on their shared successes, missed opportunities, and lessons learned.

Building

Building MongoDB Scala MySQL

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, and Terrence Sheflin

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

Data

Getting Started with Stream Processing: The Ultimate Guide

Confluent

AUGUST 11, 2022

Whether you’re new to stream processing or evaluating real-time data use cases, learn how stream processing works, its benefits, and the best way to get started.

Process

Process IT Data

Reinforcement Learning for Budget Constrained Recommendations

Netflix Tech

AUGUST 24, 2022

by Ehtsham Elahi with James McInerney , Nathan Kallus , Dario Garcia Garcia and Justin Basilico Introduction This writeup is about using reinforcement learning to construct an optimal list of recommendations when the user has a finite time budget to make a decision from the list of recommendations. Working within the time budget introduces an extra resource constraint for the recommender system.

Algorithm

Algorithm Systems Datasets Architecture

Speeding up Queries With Z-Order

Cloudera

AUGUST 4, 2022

Z-order is an ordering for multi-dimensional data, e.g. rows in a database table. Once data is in Z-order it is possible to efficiently search against more columns. This article reveals how Z-ordering works and how one can use it with Apache Impala. In a previous blog post , we demonstrated the power of Parquet page indexes, which can greatly improve the performance of selective queries.

Telecommunication

Telecommunication Algorithm Raw Data Python

Reflections on Data Literacy for Financial Services Leaders

Teradata

AUGUST 17, 2022

In conversations with c-level execs at banks & financial institutions, one theme always crops up. How do we change our operating model to be more agile & customer focused in a digital first world?

Banking

Banking Data

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

Manufacturing

The Importance of Experiment Design in Data Science

KDnuggets

AUGUST 12, 2022

Do you feel overwhelmed by the sheer number of ideas that you could try while building a machine learning pipeline? You can not take the liberty of trying all possible ways to arrive at a solution - hence we discuss the importance of experiment design in data science projects.

Data Science

Data Science Designing Machine Learning Data

An Exploration Of The Expectations, Ecosystem, and Realities Of Real-Time Data Applications

Data Engineering Podcast

AUGUST 21, 2022

Summary Data has permeated every aspect of our lives and the products that we interact with. As a result, end users and customers have come to expect interactions and updates with services and analytics to be fast and up to date. In this episode Shruti Bhat gives her view on the state of the ecosystem for real-time data and the work that she and her team at Rockset is doing to make it easier for engineers to build those experiences.

Lambda Architecture

Lambda Architecture MongoDB Scala MySQL

Data Enrichment in Existing Data Pipelines Using Confluent Cloud

Confluent

AUGUST 16, 2022

Learn how you can integrate data streams into your environment, and enrich data across your existing data pipelines using Confluent Cloud.

Data Pipeline

Data Pipeline Cloud Data

How we shaved 90 minutes off our longest running model

dbt Developer Hub

AUGUST 17, 2022

When running a job that has over 1,700 models, how do you know what a “good” runtime is? If the total process takes 3 hours, is that fantastic or terrible? While there are many possible answers depending on dataset size, complexity of modeling, and historical run times, the crux of the matter is normally “did you hit your SLAs”? However, in the cloud computing world where bills are based on usage, the question is really “did you hit your SLAs and stay within budget ”?

Data Warehouse

Data Warehouse Datasets Cloud Coding

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.

Systems

How Universal Data Distribution Accelerates Complex DoD Missions

Cloudera

AUGUST 11, 2022

We’ve come a long way since 1778 when George Washington’s spies gathered and shared military intelligence on the British Army’s tactical operations in occupied New York. But information broadly, and the management of data specifically, is still “the” critical factor for situational awareness, streamlined operations, and a host of other use cases across today’s tech-driven battlefields. .

Transportation

Transportation Data Ingestion Architecture Data

Escaping the Prison of Forecasting

Teradata

AUGUST 10, 2022

Retail and CPG businesses are trapped by the disconnect between today’s digital customers and long-established demand forecasting and supply-chain processes. Find out more.

Retail

Retail Process

How Do Data Scientists and Data Engineers Work Together?

KDnuggets

AUGUST 18, 2022

If you’re considering a career in data science, it’s important to understand how these two fields differ, and which one might be more appropriate for someone with your skills and interests.

Data Engineer

Data Engineer Data Engineering Engineering Data Science

Understanding The Role Of The Chief Data Officer

Data Engineering Podcast

AUGUST 21, 2022

Summary The position of Chief Data Officer (CDO) is relatively new in the business world and has not been universally adopted. As a result, not everyone understands what the responsibilities of the role are, when you need one, and how to hire for it. In this episode Tracy Daniels, CDO of Truist, shares her journey into the position, her responsibilities, and her relationship to the data professionals in her organization.

Metadata

Metadata MongoDB MySQL Data Lake

The Ultimate Guide To Data-Driven Construction: Optimize Projects, Reduce Risks, & Boost Innovation

Speaker: Donna Laquidara-Carr, PhD, LEED AP, Industry Insights Research Director at Dodge Construction Network

In today’s construction market, owners, construction managers, and contractors must navigate increasing challenges, from cost management to project delays. Fortunately, digital tools now offer valuable insights to help mitigate these risks. However, the sheer volume of tools and the complexity of leveraging their data effectively can be daunting. That’s where data-driven construction comes in.

Project

Serverless Stream Processing with Apache Kafka, Azure Functions, and ksqlDB

Confluent

AUGUST 10, 2022

Confluent’s ksqlDB product offers powerful, serverless stream processing tools that maximize Kafka on Azure.

Kafka

Kafka Process

August 2022 dbt Update: v1.3 beta, Tech Partner Program, and Coalesce!

dbt Developer Hub

AUGUST 30, 2022

Semantic layer, Python model support, the new dbt Cloud UI and IDE… there’s a lot our product team is excited to share with you at Coalesce in a few weeks. But how these things fit together—because of where dbt Labs is headed—is what I’m most excited to discuss. You’ll hear more in Tristan’s keynote , but this feels like a good time to remind you that Coalesce isn’t just for answering tough questions… it’s for surfacing them.

Programming

Programming Consulting Python BI

How to Use Apache Iceberg in CDP’s Open Lakehouse

Cloudera

AUGUST 8, 2022

In June 2022, Cloudera announced the general availability of Apache Iceberg in the Cloudera Data Platform (CDP). Iceberg is a 100% open-table format, developed through the Apache Software Foundation , which helps users avoid vendor lock-in and implement an open lakehouse. . The general availability covers Iceberg running within some of the key data services in CDP, including Cloudera Data Warehouse ( CDW ), Cloudera Data Engineering ( CDE ), and Cloudera Machine Learning ( CML ).

Data Warehouse

Data Warehouse BI Machine Learning SQL

An "Everything Data" Approach to Smart Cities

Teradata

AUGUST 3, 2022

Teradata’s approach to the Smart City is an analytics-centric, city-data-ecosystem approach designed to give access across all relevant data. Find out more.

Data

Data Designing Accessible Accessibility

Driving Responsible Innovation: How to Navigate AI Governance & Data Privacy

Speaker: Aindra Misra, Senior Manager, Product Management (Data, ML, and Cloud Infrastructure) at BILL

Join us for an insightful webinar that explores the critical intersection of data privacy and AI governance. In today’s rapidly evolving tech landscape, building robust governance frameworks is essential to fostering innovation while staying compliant with regulations. Our expert speaker, Aindra Misra, will guide you through best practices for ensuring data protection while leveraging AI capabilities.

Government

Data Transformation: Standardization vs Normalization

KDnuggets

AUGUST 12, 2022

Increasing accuracy in your models is often obtained through the first steps of data transformations. This guide explains the difference between the key feature scaling methods of standardization and normalization, and demonstrates when and how to apply each approach.

Data

Data Machine Learning

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Data Engineering Podcast

AUGUST 13, 2022

Summary Data is useless if it isn’t being used, and you can’t use it if you don’t know where it is. Data catalogs were the first solution to this problem, but they are only helpful if you know what you are looking for. In this episode Shinji Kim discusses the challenges of data discovery and how to collect and preserve additional context about each piece of information so that you can find what you need when you don’t even know what you’re looking for yet.

Metadata

Metadata MongoDB Scala MySQL

Getting Started with the KRaft Protocol

Confluent

AUGUST 31, 2022

Kafka Raft lets you use Apache Kafka without ZooKeeper by consolidating metadata management. Here’s how you can learn and do more with KRaft.

Kafka

Kafka Metadata Management

Loan Prediction using Machine Learning Project Source Code

ProjectPro

AUGUST 30, 2022

This article will walk you through how one can start by exploring a loan prediction system as a data science and machine learning problem and build a system/application for loan prediction using your own machine learning project. Loan sanctioning and credit scoring forms a multi-billion dollar industry -- in the US alone. With everyone from young students, entrepreneurs, and multi-million dollar companies turning to banks to seek financial support for their ventures, processing these application

Machine Learning

Machine Learning Coding Project Datasets

What Is Entity Resolution? How It Works & Why It Matters

Entity Resolution Sometimes referred to as data matching or fuzzy matching, entity resolution, is critical for data quality, analytics, graph visualization and AI. Learn what entity resolution is, why it matters, how it works and its benefits. Advanced entity resolution using AI is crucial because it efficiently and easily solves many of today’s data quality and analytics problems.

August, 2022

Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)

ShortCircuitOperator in Apache Airflow: The guide

Webinars

Trending Sources

How to gather requirements for your data project

Webinars

7 Techniques to Handle Imbalanced Data

15 Modern Use Cases for Enterprise Business Intelligence

An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Real-Time Wildlife Monitoring with Apache Kafka

Data Mesh?—?A Data Movement and Processing Platform @ Netflix

Sign up to get articles personalized to your interests!

More Trending

Data Mesh?—?A Data Movement and Processing Platform @ Netflix

Incremental Strategies to Move Your Data Strategy Forward Remove Obstacles to Unlock Possibilities in Financial Services

Teradata VantageCloud Lake and ClearScape Analytics: Empowering Enterprise Analytical Innovation

What Does ETL Have to Do with Machine Learning?

Alumni Of AirBnB's Early Years Reflect On What They Learned About Building Data Driven Organizations

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Getting Started with Stream Processing: The Ultimate Guide

Reinforcement Learning for Budget Constrained Recommendations

Speeding up Queries With Z-Order

Reflections on Data Literacy for Financial Services Leaders

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

The Importance of Experiment Design in Data Science

An Exploration Of The Expectations, Ecosystem, and Realities Of Real-Time Data Applications

Data Enrichment in Existing Data Pipelines Using Confluent Cloud

How we shaved 90 minutes off our longest running model

Improving the Accuracy of Generative AI Systems: A Structured Approach

How Universal Data Distribution Accelerates Complex DoD Missions

Escaping the Prison of Forecasting

How Do Data Scientists and Data Engineers Work Together?

Understanding The Role Of The Chief Data Officer

The Ultimate Guide To Data-Driven Construction: Optimize Projects, Reduce Risks, & Boost Innovation

Serverless Stream Processing with Apache Kafka, Azure Functions, and ksqlDB

August 2022 dbt Update: v1.3 beta, Tech Partner Program, and Coalesce!

How to Use Apache Iceberg in CDP’s Open Lakehouse

An "Everything Data" Approach to Smart Cities

Driving Responsible Innovation: How to Navigate AI Governance & Data Privacy

Data Transformation: Standardization vs Normalization

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Getting Started with the KRaft Protocol

Loan Prediction using Machine Learning Project Source Code

What Is Entity Resolution? How It Works & Why It Matters

Stay Connected