Top Data Engineering Digest Data Data Pipeline Content for May, 2021

May, 2021

How to make data pipelines idempotent

Start Data Engineering

MAY 13, 2021

What is an idempotent function Pre-requisites Why idempotency matters Making your data pipeline idempotent Conclusion Further reading References What is an idempotent function “Idempotence is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application” - wikipedia Defined as f(f(x)) = f(x) In the data engineering context, this can come to mean that: running a data pipeline

Data Pipeline

Data Pipeline Computer Science Data Data Engineering

The Architecture of Uber’s API gateway

Uber Engineering

MAY 19, 2021

API gateways are an integral part of microservices architecture in recent years. An API gateway provides a single point of entry for all our apps and provides an interface to access data, logic, or functionality from back-end microservices. It also … The post The Architecture of Uber’s API gateway appeared first on Uber Engineering Blog.

Architecture

Architecture Engineering Accessible Accessibility

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

NVIDIA RAPIDS in Cloudera Machine Learning

Cloudera

MAY 19, 2021

Introduction. In the previous blog post in this series, we walked through the steps for leveraging Deep Learning in your Cloudera Machine Learning (CML) projects. This year, we expanded our partnership with NVIDIA , enabling your data teams to dramatically speed up compute processes for data engineering and data science workloads with no code changes using RAPIDS AI.

Machine Learning

Machine Learning Data Science Datasets Raw Data

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Introducing Confluent for Kubernetes

Confluent

MAY 12, 2021

We are excited to announce that Confluent for Kubernetes is generally available! Today, we are enabling our customers to realize many of the benefits of our cloud service with the […].

Cloud

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

Paving The Road For Fast Analytics On Distributed Clouds With The Yellowbrick Data Warehouse

Data Engineering Podcast

MAY 27, 2021

Summary The data warehouse has become the focal point of the modern data platform. With increased usage of data across businesses, and a diversity of locations and environments where data needs to be managed, the warehouse engine needs to be fast and easy to manage. Yellowbrick is a data warehouse platform that was built from the ground up for speed, and can work across clouds and all the way to the edge.

Data Warehouse

Data Warehouse Cloud PostgreSQL Kafka

What’s the Secret Recipe for DataOps?

DataKitchen

MAY 3, 2021

Catalog & Cocktails podcast hosts Tim Gasper & Juan Sequeda of data.world interview DataKitchen CEO Chris Bergh on how to create the right DataOps culture & measuring the value of your DataOps strategy. The post What’s the Secret Recipe for DataOps? first appeared on DataKitchen.

Twelve Thoughts About the Data Mesh

Teradata

MAY 15, 2021

The concept of Data Mesh is abuzz in the industry right now. Find out why we're so enthusiastic about it.

Data

Data IT

More Trending

Twelve Thoughts About the Data Mesh

Teradata

MAY 15, 2021

The concept of Data Mesh is abuzz in the industry right now. Find out why we're so enthusiastic about it.

Data

Data IT

My (Seemingly) Random Walk to Netflix

Netflix Tech

MAY 28, 2021

Part of our series on who works in Analytics at Netflix?—?and what the role entails By Sean Barnes, Studio Production Data Science & Engineering I am going to tell you a story about a person that works for Netflix. That person grew up dreaming of working in the entertainment industry. They attended the University of Southern California, double majored in data science and television & film production, and graduated summa cum laude.

Entertainment

Entertainment Healthcare Data Science Finance

Spark on Kubernetes – Gang Scheduling with YuniKorn

Cloudera

MAY 5, 2021

Apache YuniKorn (Incubating) has just released 0.10.0 ( release announcement ). As part of this release, a new feature called Gang Scheduling has become available. By leveraging the Gang Scheduling feature, Spark jobs scheduling on Kubernetes becomes more efficient. What is Apache YuniKorn (Incubating)? Apache YuniKorn (Incubating) is a new Apache incubator project that offers rich scheduling capabilities on Kubernetes.

Metadata

Metadata Algorithm Big Data Machine Learning

Confluent CLI Launches Exciting New Features and an Intuitive UI

Confluent

MAY 18, 2021

With so many technologies in the modern development ecosystem, a common complaint is having to go through the mental gymnastics of adopting new products and keeping up with ever-expanding feature […].

Technology

Technology Cloud

Easily Build Advanced Similarity Search With The Pinecone Vector Database

Data Engineering Podcast

MAY 25, 2021

Summary Machine learning models use vectors as the natural mechanism for representing their internal state. The problem is that in order for the models to integrate with external systems their internal state has to be translated into a lower dimension. To eliminate this impedance mismatch Edo Liberty founded Pinecone to build database that works natively with vectors.

Database

Database Building Data Warehouse Machine Learning

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Data Observability and Monitoring with DataOps

DataKitchen

MAY 10, 2021

Data errors impact decision-making. When analytics and dashboards are inaccurate, business leaders may not be able to solve problems and pursue opportunities. Data errors infringe on work-life balance. They cause people to work long hours at the expense of personal and family time. Data errors also affect careers. If you have been in the data profession for any length of time, you probably know what it means to face a mob of stakeholders who are angry about inaccurate or late analytics.

Manufacturing

Manufacturing Data Pipeline Data Data Analytics

Big Data Analytics: How It Works, Tools, and Real-Life Applications

AltexSoft

MAY 14, 2021

Big Data enjoys the hype around it and for a reason. But the understanding of the essence of Big Data and ways to analyze it is still blurred. The truth is, there’s more to this term than just the size of information generated. Not only does Big Data apply to the huge volumes of continuously growing data that come in different formats, but it also refers to the range of processes, tools, and approaches used to gain insights from that data.

Big Data

Big Data Data Analytics IT NoSQL

Thirteen Thoughts About the Data Mesh

Teradata

MAY 15, 2021

The concept of Data Mesh is abuzz in the industry right now. Find out why we're so enthusiastic about it.

Data

Data IT

The Ethics of AI Comes Down to Conscious Decisions

Cloudera

MAY 27, 2021

This blog post was written by Pedro Pereira as a guest author for Cloudera. . Right now, someone somewhere is writing the next fake news story or editing a deepfake video. An authoritarian regime is manipulating an artificial intelligence (AI) system to spy on technology users. No matter how good the intentions behind the development of a technology, someone is bound to corrupt and manipulate it.

Algorithm

Algorithm Media Consulting Machine Learning

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Kafka Summit Europe 2021 Recap

Confluent

MAY 17, 2021

And that’s a wrap on Kafka Summit Europe 2021, the first of three global Kafka Summits this year. We’ve seen 17,000 registrations from over 7,000 companies and 137 different countries. […].

Kafka

A Holistic Approach To Data Governance Through Self Reflection At Collibra

Data Engineering Podcast

MAY 20, 2021

Summary Data governance is a phrase that means many different things to many different people. This is because it is actually a concept that encompasses the entire lifecycle of data, across all of the people in an organization who interact with it. Stijn Christiaens co-founded Collibra with the goal of addressing the wide variety of technological aspects that are necessary to realize such an important and expansive process.

Data Governance

Data Governance Government Data Warehouse Data

Netflix Drive

Netflix Tech

MAY 5, 2021

A file and folder interface for Netflix Cloud Services Written by Vikram Krishnamurthy , Kishore Kasi , Abhishek Kapatkar , and Tejas Chopra In this post, we are introducing Netflix Drive, a Cloud drive for media assets and providing a high level overview of some of its features and interfaces. We intend this to be a first post in a series of posts covering Netflix Drive.

Metadata

Metadata Bytes Media Cloud Storage

How DataOps Enables a Data Fabric

DataKitchen

MAY 4, 2021

The post How DataOps Enables a Data Fabric first appeared on DataKitchen.

Data

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineer

Beyond Resilience-The Next Generation of Supply Chain

Teradata

MAY 11, 2021

After the shock of COVID exposed the brittle nature of many global supply chains, focus has shifted to resilience, a necessary consideration but not the only one.

Automating CDP Private Cloud Installations with Ansible

Cloudera

MAY 10, 2021

The introduction of CDP Public Cloud has dramatically reduced the time in which you can be up and running with Cloudera’s latest technologies, be it with containerised Data Warehouse , Machine Learning , Operational Database or Data Engineering experiences or the multi-purpose VM-based Data Hub style of deployment. In CDP Private Cloud, the introduction of Cloudera Data Warehouse and Cloudera Machine Learning Experiences on RedHat OpenShift Kubernetes clusters means that we can deploy new

Cloud

Cloud Consulting Data Warehouse Machine Learning

Confluent Update Regarding Codecov Incident

Confluent

MAY 5, 2021

Our team was recently notified of unauthorized read-only access to Confluent’s Github account stemming from the recent Codecov incident (more information here). The security of our customers and their data […].

Accessible

Accessible Accessibility Data

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

Data Engineering Podcast

MAY 18, 2021

Summary Data lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic understanding of your data quality, where errors are occurring, or how a report was constructed you need to track the lineage of the data from beginning to end. The complicating factor is that every framework, platform, and product has its own concepts of how to store, represent, and expose that information.

Metadata

Metadata Kafka Data Warehouse Hadoop

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

Achieving observability in async workflows

Netflix Tech

MAY 14, 2021

Written by Colby Callahan , Megha Manohara , and Mike Azar. Managing and operating asynchronous workflows can be difficult without the proper tools and architecture that puts observability, debugging, and tracing at the forefront. Imagine getting paged outside normal work hours?—?users are having trouble with the application you’re responsible for, and you start diving into logs.

Java

Java Programming Language Media Architecture

Asynchronous APIs in CRM and marketing tools

Grouparoo

MAY 27, 2021

When integrating with Destinations , there are generally two main approaches made available by API providers: single or batched. With the "single" approach, one API request usually affects a single profile in the destination. The "batched" approach, which you can read more about here , allows you to affect multiple profiles in a single API request.

Process

Process Systems Designing IT

Open Banking is Transforming Financial Services and Chipping Away the Relevance of Traditional Banks

Teradata

MAY 10, 2021

The sharing of client data in an Open Banking marketplace challenges banks to adopt a customer-centric approach & collaborate with new players to re-define their relevance.

Banking

Banking Data

Driving Agility and Scalability through Smart Data

Cloudera

MAY 3, 2021

Last year presented business and organizational challenges that hadn’t been seen in a century and the troubling fact is that the challenges applied pains and gains unequally across industry segments. While brick-and-mortar retail was crushed a year ago with mandated store closures, digital commerce retailers realized ten years of digital sales penetration in only three months.

Scala

Scala Retail Java SQL

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

Announcing ksqlDB 0.18.0

Confluent

MAY 26, 2021

We’re pleased to announce ksqlDB 0.18.0! This release includes pull queries on table-table joins and support for variable substitution in the Java client and ksqlDB’s migration tool. We’ll step through […].

Java

Java Process

Building Your Data Warehouse On Top Of PostgreSQL

Data Engineering Podcast

MAY 13, 2021

Summary There is a lot of attention on the database market and cloud data warehouses. While they provide a measure of convenience, they also require you to sacrifice a certain amount of control over your data. If you want to build a warehouse that gives you both control and flexibility then you might consider building on top of the venerable PostgreSQL project.

PostgreSQL

PostgreSQL Data Warehouse Building MySQL

Data Transformations Using the Data Build Tool

Ripple Engineering

MAY 27, 2021

At Ripple , we are moving towards building complex business models out of raw data. To do this successfully, we need to automate our historically manual processes. Even with a digital-first approach, many of our internal processes were done by hand, making them great candidates to be automated. A prime example of this was the process of managing our data transformation workflows.

Building

Building Raw Data SQL Data

What is Azure Data Factory? A beginner’s guide to ADF

A Cloud Guru: Data Engineering

MAY 26, 2021

With Microsoft Build 2021 currently underway, what better time to take a beginner-friendly deep dive into Azure Data Factory. In this post, we’ll talk about what Azure Data Factory is, how to get started using it, and what you might use it for. Keep up with all things Azure in the ACG original series Azure […] The post What is Azure Data Factory?

Data

Data Cloud Building IT

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

Data

May, 2021

How to make data pipelines idempotent

The Architecture of Uber’s API gateway

Webinars

Trending Sources

NVIDIA RAPIDS in Cloudera Machine Learning

Webinars

Introducing Confluent for Kubernetes

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Paving The Road For Fast Analytics On Distributed Clouds With The Yellowbrick Data Warehouse

What’s the Secret Recipe for DataOps?

Twelve Thoughts About the Data Mesh

Sign up to get articles personalized to your interests!

More Trending

Twelve Thoughts About the Data Mesh

My (Seemingly) Random Walk to Netflix

Spark on Kubernetes – Gang Scheduling with YuniKorn

Confluent CLI Launches Exciting New Features and an Intuitive UI

Easily Build Advanced Similarity Search With The Pinecone Vector Database

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Data Observability and Monitoring with DataOps

Big Data Analytics: How It Works, Tools, and Real-Life Applications

Thirteen Thoughts About the Data Mesh

The Ethics of AI Comes Down to Conscious Decisions

How to Modernize Manufacturing Without Losing Control

Kafka Summit Europe 2021 Recap

A Holistic Approach To Data Governance Through Self Reflection At Collibra

Netflix Drive

How DataOps Enables a Data Fabric

The Ultimate Guide to Apache Airflow DAGS

Beyond Resilience-The Next Generation of Supply Chain

Automating CDP Private Cloud Installations with Ansible

Confluent Update Regarding Codecov Incident

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

Optimizing The Modern Developer Experience with Coder

Achieving observability in async workflows

Asynchronous APIs in CRM and marketing tools

Open Banking is Transforming Financial Services and Chipping Away the Relevance of Traditional Banks

Driving Agility and Scalability through Smart Data

15 Modern Use Cases for Enterprise Business Intelligence

Announcing ksqlDB 0.18.0

Building Your Data Warehouse On Top Of PostgreSQL

Data Transformations Using the Data Build Tool

What is Azure Data Factory? A beginner’s guide to ADF

Apache Airflow® Best Practices: DAG Writing

Stay Connected