Top Data Engineering Digest Data Cloud Content for 2021

2021

Building a Data Engineering Project in 20 Minutes

Simon Späti

MARCH 9, 2021

This post focuses on practical data pipelines with examples from web-scraping real-estates, uploading them to S3 with MinIO, Spark and Delta Lake, adding some Data Science magic with Jupyter Notebooks, ingesting into Data Warehouse Apache Druid, visualising dashboards with Superset and managing everything with Dagster. The goal is to touch on the common data engineering challenges and using promising new technologies, tools or frameworks, which most of them I wrote about in Business Intelligence

Data Engineering

Data Engineering Data Engineer Engineering Project

How to add tests to your data pipelines

Start Data Engineering

OCTOBER 12, 2021

Introduction Testing your data pipeline 1. End-to-end system testing 2. Data quality testing 3. Monitoring and alerting 4. Unit and contract testing Conclusion Further reading Introduction Testing data pipelines are different from testing other applications, like a website backend.

Data Pipeline

Data Pipeline Data Systems

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

11 Best Companies to Work for as a Data Scientist

KDnuggets

DECEMBER 30, 2021

This list of best data science companies aims to go beyond the usual and expected. Some great and perhaps underrated options to get a job as a data scientist.

Data Science

Data Science Data

Building Uber’s Fulfillment Platform for Planet-Scale using Google Cloud Spanner

Uber Engineering

SEPTEMBER 29, 2021

Introduction. The Fulfillment Platform is a foundational Uber domain that enables the rapid scaling of new verticals. The platform handles billions of database transactions each day, ranging from user actions (e.g., a driver starting a trip) and system actions … The post Building Uber’s Fulfillment Platform for Planet-Scale using Google Cloud Spanner appeared first on Uber Engineering Blog.

Google Cloud

Google Cloud Cloud Building Database

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineering

Tech workers warned they were going to quit. Now, the problem is spiralling out of control

DataKitchen

OCTOBER 22, 2021

The post Tech workers warned they were going to quit. Now, the problem is spiralling out of control first appeared on DataKitchen.

What’s New in Apache Kafka 3.0.0

Confluent

SEPTEMBER 21, 2021

I’m pleased to announce the release of Apache Kafka 3.0 on behalf of the Apache Kafka® community. Apache Kafka 3.0 is a major release in more ways than one. Apache […].

Kafka

Turning the page

Cloudera

JUNE 1, 2021

Today marks the beginning of an exciting new chapter for Cloudera. Cloudera will become a private company with the flexibility and resources to accelerate product innovation, cloud transformation and customer growth. Cloudera will benefit from the operating capabilities, capital support and expertise of Clayton, Dubilier & Rice (CD&R) and KKR – two of the most experienced and successful global investment firms in the world recognized for supporting the growth strategies of the businesses

Cloud

Cloud Big Data Data Lake Finance

More Trending

Turning the page

Cloudera

JUNE 1, 2021

Cloud

Cloud Big Data Data Lake Finance

Natural Language Processing: A Guide to NLP Use Cases, Approaches, and Tools

AltexSoft

AUGUST 25, 2021

Humans have been trying to make machines chat for decades. Alan Turing considered computers’ ability to generate natural speech a proof of their ability to think. Today, we converse with virtual companions all the time. But despite years of research and innovation, their unnatural responses remind us that no, we’re not yet at the HAL 9000-level of speech sophistication.

Process

Process Deep Learning Datasets Machine Learning

How to Host a Virtual Global Data Science Hackathon

Teradata

MARCH 25, 2021

Learn how best to host a virtual hackathon, or any virtual event, with these tips and tricks from our Teradata team. Read more.

Data Science

Data Science Data

How Netflix uses eBPF flow logs at scale for network insight

Netflix Tech

JUNE 7, 2021

By Alok Tiagi , Hariharan Ananthakrishnan , Ivan Porto Carrero and Keerti Lakshminarayan Netflix has developed a network observability sidecar called Flow Exporter that uses eBPF tracepoints to capture TCP flows at near real time. At much less than 1% of CPU and memory on the instance, this highly performant sidecar provides flow data at scale for network insight.

Transportation

Transportation AWS Cloud Kafka

Revisiting The Technical And Social Benefits Of The Data Mesh

Data Engineering Podcast

DECEMBER 26, 2021

Summary The data mesh is a thesis that was presented to address the technical and organizational challenges that businesses face in managing their analytical workflows at scale. Zhamak Dehghani introduced the concepts behind this architectural patterns in 2019, and since then it has been gaining popularity with many companies adopting some version of it in their systems.

BI Data Warehouse Data Engineer Data Engineering

How to Achieve High-Accuracy Results When Using LLMs

Speaker: Ben Epstein, Stealth Founder & CTO | Tony Karrer, Founder & CTO, Aggregage

When tasked with building a fundamentally new product line with deeper insights than previously achievable for a high-value client, Ben Epstein and his team faced a significant challenge: how to harness LLMs to produce consistent, high-accuracy outputs at scale. In this new session, Ben will share how he and his team engineered a system (based on proven software engineering approaches) that employs reproducible test variations (via temperature 0 and fixed seeds), and enables non-LLM evaluation m

Software Engineering

Azure Data Factory Linked Service: Advanced Authoring

Azure Data Engineering

DECEMBER 11, 2021

We have discussed Linked Service parameterization through the UI, in a previous post. But not all Linked Service Types support parametrization using the UI. In this post, we will discuss the Linked Services that can’t be parameterized using the UI. (i.e., they don’t have any option to add parameter). If you are familiar with Azure Services, you might know that the Linked Services or any other Azure artefact has corresponding underlying JSON code.

Coding

Coding Data Management

How to make data pipelines idempotent

Start Data Engineering

MAY 13, 2021

What is an idempotent function Pre-requisites Why idempotency matters Making your data pipeline idempotent Conclusion Further reading References What is an idempotent function “Idempotence is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application” - wikipedia Defined as f(f(x)) = f(x) In the data engineering context, this can come to mean that: running a data pipeline

Data Pipeline

Data Pipeline Computer Science Data Data Engineer

Alternative Feature Selection Methods in Machine Learning

KDnuggets

DECEMBER 24, 2021

Feature selection methodologies go beyond filter, wrapper and embedded methods. In this article, I describe 3 alternative algorithms to select predictive features based on a feature importance score.

Machine Learning

Machine Learning Algorithm Data Preparation Python

Real-Time Exactly-Once Ad Event Processing with Apache Flink, Kafka, and Pinot

Uber Engineering

SEPTEMBER 23, 2021

Uber recently launched a new capability: Ads on UberEats. With this new ability came new challenges that needed to be solved at Uber, such as systems for ad auctions, bidding, attribution, reporting, and more. This article focuses on how we … The post Real-Time Exactly-Once Ad Event Processing with Apache Flink, Kafka, and Pinot appeared first on Uber Engineering Blog.

Kafka

Kafka Process Systems Engineering

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

Data

Data-driven 2021: Predictions for a new year in data, analytics and AI

DataKitchen

JANUARY 4, 2021

The post Data-driven 2021: Predictions for a new year in data, analytics and AI first appeared on DataKitchen.

Data Analytics

Data Analytics Data

Kafka Summit Americas 2021 Recap

Confluent

SEPTEMBER 16, 2021

The full inventory of three online Kafka Summits in 2021 is now complete. Kafka Summit Americas wrapped just yesterday. Being a part of the event team and the Program Committee, […].

Kafka

Kafka Programming

NVIDIA RAPIDS in Cloudera Machine Learning

Cloudera

MAY 19, 2021

Introduction. In the previous blog post in this series, we walked through the steps for leveraging Deep Learning in your Cloudera Machine Learning (CML) projects. This year, we expanded our partnership with NVIDIA , enabling your data teams to dramatically speed up compute processes for data engineering and data science workloads with no code changes using RAPIDS AI.

Machine Learning

Machine Learning Data Science Datasets Raw Data

Airflow Timetable: Schedule your DAGs like never before

Marc Lamberti

NOVEMBER 2, 2021

Airflow Timetable. This new concept introduced in Airflow 2.2 is going to change your way of scheduling your data pipelines. Or I would say, you’re finally going to have all the freedom and flexibility you ever dreamt of for scheduling your DAGs. What if you want to run your DAG for specific schedule intervals with “holes” in between?

Data Pipeline

Data Pipeline Coding Process IT

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

Improving Population Health Through Citizen 360

Teradata

JANUARY 5, 2021

By leveraging data to create a 360 degree view of its citizenry, government agencies can create more optimal experiences & improve outcomes such as closing the tax gap or improving quality of care.

Government

Government IT Data

How to Take Notes in 2021?

Simon Späti

SEPTEMBER 28, 2021

Taking notes helps you not to forget things, teaches you to express yourself, brainstorms your thoughts, research a topic, and so many more things. I used to take notes all my life. Maybe it’s because I’m Swiss, they say we are well organised. I used to write in OneNote for 10+ years. I have notebooks for my bachelor studies and every workplace I worked.

Fast And Flexible Headless Data Analytics With Cube.JS

Data Engineering Podcast

DECEMBER 21, 2021

Summary One of the perennial challenges of data analytics is having a consistent set of definitions, along with a flexible and performant API endpoint for querying them. In this episode Artom Keydunov and Pavel Tiunov share their work on Cube.js and the various ways that it is being used in the open source community. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the p

Data Analytics

Data Analytics BI Computer Science SQL

Azure Data Factory: Fail Activity

Azure Data Engineering

NOVEMBER 21, 2021

During some scenarios in Azure Data Factory, we may want to intentionally stop the execution of the pipeline. An example could be when we want to check the existence of a file or folder using Get Metadata activity. We may want to fail the pipeline if the file/folder does not exist. To achieve this, we could use the Fail Activity. Invoking the Fail Activity ensures that the pipeline execution will be stopped.

Metadata

Metadata Data Utilities Coding

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

How to choose the right tools for your data pipeline

Start Data Engineering

DECEMBER 12, 2021

1. Introduction 2. Requirements 3. Components 4. Choosing tools 4.1 Requirement x Component framework 4.2 Filters 5. Conclusion 6. Further reading 1. Introduction If you are building data pipelines from the ground up, the number of available data engineering tools to choose from can be overwhelming. If you are thinking Most of the tools seem to be doing the same/similar thing, which one should I choose?

Data Pipeline

Data Pipeline Data Engineer Data Engineering Data

6 Predictive Models Every Beginner Data Scientist Should Master

KDnuggets

DECEMBER 23, 2021

Data Science models come with different flavors and techniques — luckily, most advanced models are based on a couple of fundamentals. Which models should you learn when you want to begin a career as Data Scientist? This post brings you 6 models that are widely used in the industry, either in standalone form or as a building block for other advanced techniques.

Data Science

Data Science Data Building Algorithm

How Uber Achieves Operational Excellence in the Data Quality Experience

Uber Engineering

AUGUST 5, 2021

Uber delivers efficient and reliable transportation across the global marketplace, which is powered by hundreds of services, machine learning models, and tens of thousands of datasets. While growing rapidly, we’re also committed to maintaining data quality, as it can greatly … The post How Uber Achieves Operational Excellence in the Data Quality Experience appeared first on Uber Engineering Blog.

Transportation

Transportation Datasets Machine Learning Data

5 hot new IT jobs — and why they just might stick

DataKitchen

OCTOBER 18, 2021

The post 5 hot new IT jobs — and why they just might stick first appeared on DataKitchen.

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

Data

Apache Kafka Made Simple: A First Glimpse of a Kafka Without ZooKeeper

Confluent

MARCH 30, 2021

At the heart of Apache Kafka® sits the log—a simple data structure that uses sequential operations that work symbiotically with the underlying hardware. Efficient disk buffering and CPU cache usage, […].

Kafka

Kafka Data

Spark on Kubernetes – Gang Scheduling with YuniKorn

Cloudera

MAY 5, 2021

Apache YuniKorn (Incubating) has just released 0.10.0 ( release announcement ). As part of this release, a new feature called Gang Scheduling has become available. By leveraging the Gang Scheduling feature, Spark jobs scheduling on Kubernetes becomes more efficient. What is Apache YuniKorn (Incubating)? Apache YuniKorn (Incubating) is a new Apache incubator project that offers rich scheduling capabilities on Kubernetes.

Metadata

Metadata Algorithm Big Data Machine Learning

Airflow Trigger Rules: All you need to know!

Marc Lamberti

SEPTEMBER 21, 2021

By default, your tasks get executed once all the parent tasks succeed. this behaviour is what you expect in general. But what if you want something more complex? What if you would like to execute a task as soon as one of its parents succeeds? Or maybe you would like to execute a different set of tasks if a task fails? Or act differently according to if a task succeeds, fails or event gets skipped?

Data Pipeline

Data Pipeline IT Management Data

Toward a Data Mesh (part 2) : Architecture & Technologies

François Nguyen

MARCH 22, 2021

Just an illustration – not the truth and you certainly can do it with other technologies. TL;DR After setting up and organizing the teams, we are describing 4 topics to make data mesh a reality. the selfserve platform based on a serverless philisophy (life is too short to do provisioning) the building of data products (as code) : we are building data workflows not data pipelines the promotion of data domains where the metadata on the data life cycle is as important as your data The old dat

Technology

Technology Architecture Google Cloud Metadata

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

Manufacturing

2021

Building a Data Engineering Project in 20 Minutes

How to add tests to your data pipelines

Trending Sources

11 Best Companies to Work for as a Data Scientist

Building Uber’s Fulfillment Platform for Planet-Scale using Google Cloud Spanner

The Ultimate Guide to Apache Airflow DAGS

Tech workers warned they were going to quit. Now, the problem is spiralling out of control

What’s New in Apache Kafka 3.0.0

Turning the page

Sign up to get articles personalized to your interests!

More Trending

Turning the page

Natural Language Processing: A Guide to NLP Use Cases, Approaches, and Tools

How to Host a Virtual Global Data Science Hackathon

How Netflix uses eBPF flow logs at scale for network insight

Revisiting The Technical And Social Benefits Of The Data Mesh

How to Achieve High-Accuracy Results When Using LLMs

Azure Data Factory Linked Service: Advanced Authoring

How to make data pipelines idempotent

Alternative Feature Selection Methods in Machine Learning

Real-Time Exactly-Once Ad Event Processing with Apache Flink, Kafka, and Pinot

Apache Airflow® Best Practices: DAG Writing

Data-driven 2021: Predictions for a new year in data, analytics and AI

Kafka Summit Americas 2021 Recap

NVIDIA RAPIDS in Cloudera Machine Learning

Airflow Timetable: Schedule your DAGs like never before

Optimizing The Modern Developer Experience with Coder

Improving Population Health Through Citizen 360

How to Take Notes in 2021?

Fast And Flexible Headless Data Analytics With Cube.JS

Azure Data Factory: Fail Activity

15 Modern Use Cases for Enterprise Business Intelligence

How to choose the right tools for your data pipeline

6 Predictive Models Every Beginner Data Scientist Should Master

How Uber Achieves Operational Excellence in the Data Quality Experience

5 hot new IT jobs — and why they just might stick

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Apache Kafka Made Simple: A First Glimpse of a Kafka Without ZooKeeper

Spark on Kubernetes – Gang Scheduling with YuniKorn

Airflow Trigger Rules: All you need to know!

Toward a Data Mesh (part 2) : Architecture & Technologies

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Stay Connected