Top Data Engineering Digest Data Data Pipeline Content for May, 2021

May, 2021

How to make data pipelines idempotent

Start Data Engineering

MAY 13, 2021

What is an idempotent function Pre-requisites Why idempotency matters Making your data pipeline idempotent Conclusion Further reading References What is an idempotent function “Idempotence is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application” - wikipedia Defined as f(f(x)) = f(x) In the data engineering context, this can come to mean that: running a data pipeline

Data Pipeline

Data Pipeline Computer Science Data Data Engineer

The Architecture of Uber’s API gateway

Uber Engineering

MAY 19, 2021

API gateways are an integral part of microservices architecture in recent years. An API gateway provides a single point of entry for all our apps and provides an interface to access data, logic, or functionality from back-end microservices. It also … The post The Architecture of Uber’s API gateway appeared first on Uber Engineering Blog.

Architecture

Architecture Engineering Accessible Accessibility

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Apache Airflow®: The Ultimate Guide to DAG Writing

MORE WEBINARS

Trending Sources

Introducing Confluent for Kubernetes

Confluent

MAY 12, 2021

We are excited to announce that Confluent for Kubernetes is generally available! Today, we are enabling our customers to realize many of the benefits of our cloud service with the […].

Cloud

Webinars

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Apache Airflow®: The Ultimate Guide to DAG Writing

MORE WEBINARS

Spark on Kubernetes – Gang Scheduling with YuniKorn

Cloudera

MAY 5, 2021

Apache YuniKorn (Incubating) has just released 0.10.0 ( release announcement ). As part of this release, a new feature called Gang Scheduling has become available. By leveraging the Gang Scheduling feature, Spark jobs scheduling on Kubernetes becomes more efficient. What is Apache YuniKorn (Incubating)? Apache YuniKorn (Incubating) is a new Apache incubator project that offers rich scheduling capabilities on Kubernetes.

Metadata

Metadata Algorithm Big Data Machine Learning

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

Architecture

Paving The Road For Fast Analytics On Distributed Clouds With The Yellowbrick Data Warehouse

Data Engineering Podcast

MAY 27, 2021

Summary The data warehouse has become the focal point of the modern data platform. With increased usage of data across businesses, and a diversity of locations and environments where data needs to be managed, the warehouse engine needs to be fast and easy to manage. Yellowbrick is a data warehouse platform that was built from the ground up for speed, and can work across clouds and all the way to the edge.

Data Warehouse

Data Warehouse Cloud PostgreSQL Kafka

What’s the Secret Recipe for DataOps?

DataKitchen

MAY 3, 2021

Catalog & Cocktails podcast hosts Tim Gasper & Juan Sequeda of data.world interview DataKitchen CEO Chris Bergh on how to create the right DataOps culture & measuring the value of your DataOps strategy. The post What’s the Secret Recipe for DataOps? first appeared on DataKitchen.

My (Seemingly) Random Walk to Netflix

Netflix Tech

MAY 28, 2021

Part of our series on who works in Analytics at Netflix?—?and what the role entails By Sean Barnes, Studio Production Data Science & Engineering I am going to tell you a story about a person that works for Netflix. That person grew up dreaming of working in the entertainment industry. They attended the University of Southern California, double majored in data science and television & film production, and graduated summa cum laude.

Entertainment

Entertainment Healthcare Data Science Finance

More Trending

My (Seemingly) Random Walk to Netflix

Netflix Tech

MAY 28, 2021

Entertainment

Entertainment Healthcare Data Science Finance

Big Data Analytics: How It Works, Tools, and Real-Life Applications

AltexSoft

MAY 14, 2021

Big Data enjoys the hype around it and for a reason. But the understanding of the essence of Big Data and ways to analyze it is still blurred. The truth is, there’s more to this term than just the size of information generated. Not only does Big Data apply to the huge volumes of continuously growing data that come in different formats, but it also refers to the range of processes, tools, and approaches used to gain insights from that data.

Big Data

Big Data Data Analytics IT NoSQL

Confluent CLI Launches Exciting New Features and an Intuitive UI

Confluent

MAY 18, 2021

With so many technologies in the modern development ecosystem, a common complaint is having to go through the mental gymnastics of adopting new products and keeping up with ever-expanding feature […].

Technology

Technology Cloud

NVIDIA RAPIDS in Cloudera Machine Learning

Cloudera

MAY 19, 2021

Introduction. In the previous blog post in this series, we walked through the steps for leveraging Deep Learning in your Cloudera Machine Learning (CML) projects. This year, we expanded our partnership with NVIDIA , enabling your data teams to dramatically speed up compute processes for data engineering and data science workloads with no code changes using RAPIDS AI.

Machine Learning

Machine Learning Datasets Data Science Raw Data

Easily Build Advanced Similarity Search With The Pinecone Vector Database

Data Engineering Podcast

MAY 25, 2021

Summary Machine learning models use vectors as the natural mechanism for representing their internal state. The problem is that in order for the models to integrate with external systems their internal state has to be translated into a lower dimension. To eliminate this impedance mismatch Edo Liberty founded Pinecone to build database that works natively with vectors.

Database

Database Building Data Warehouse Machine Learning

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

Data

Data Observability and Monitoring with DataOps

DataKitchen

MAY 10, 2021

Data errors impact decision-making. When analytics and dashboards are inaccurate, business leaders may not be able to solve problems and pursue opportunities. Data errors infringe on work-life balance. They cause people to work long hours at the expense of personal and family time. Data errors also affect careers. If you have been in the data profession for any length of time, you probably know what it means to face a mob of stakeholders who are angry about inaccurate or late analytics.

Manufacturing

Manufacturing Data Pipeline Data Data Analytics

Twelve Thoughts About the Data Mesh

Teradata

MAY 15, 2021

The concept of Data Mesh is abuzz in the industry right now. Find out why we're so enthusiastic about it.

Data

Data IT

Computer Vision in Healthcare: Creating an AI Diagnostic Tool for Medical Image Analysis

AltexSoft

MAY 12, 2021

Our lungs are the only body organs that constantly interact with the external environment, through the air we breathe. This exposure makes the respiratory system extremely susceptible to a wide range of diseases, from long-familiar asthma to novel COVID-19. Subtle at early stages, the signs of lung conditions are easy to overlook. And delays in diagnosis often lead to harsh consequences.

Medical

Medical Healthcare Datasets Machine Learning

Error Handling Patterns for Apache Kafka Applications

Confluent

MAY 21, 2021

Apache Kafka® applications run in a distributed manner across multiple containers or machines. And in the world of distributed systems, what can go wrong often goes wrong. This blog post […].

Kafka

Kafka Systems

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

The Ethics of AI Comes Down to Conscious Decisions

Cloudera

MAY 27, 2021

This blog post was written by Pedro Pereira as a guest author for Cloudera. . Right now, someone somewhere is writing the next fake news story or editing a deepfake video. An authoritarian regime is manipulating an artificial intelligence (AI) system to spy on technology users. No matter how good the intentions behind the development of a technology, someone is bound to corrupt and manipulate it.

Algorithm

Algorithm Media Consulting Machine Learning

A Holistic Approach To Data Governance Through Self Reflection At Collibra

Data Engineering Podcast

MAY 20, 2021

Summary Data governance is a phrase that means many different things to many different people. This is because it is actually a concept that encompasses the entire lifecycle of data, across all of the people in an organization who interact with it. Stijn Christiaens co-founded Collibra with the goal of addressing the wide variety of technological aspects that are necessary to realize such an important and expansive process.

Data Governance

Data Governance Government Data Warehouse Data Pipeline

Netflix Drive

Netflix Tech

MAY 5, 2021

A file and folder interface for Netflix Cloud Services Written by Vikram Krishnamurthy , Kishore Kasi , Abhishek Kapatkar , and Tejas Chopra In this post, we are introducing Netflix Drive, a Cloud drive for media assets and providing a high level overview of some of its features and interfaces. We intend this to be a first post in a series of posts covering Netflix Drive.

Metadata

Metadata Bytes Media Cloud Storage

Beyond Resilience-The Next Generation of Supply Chain

Teradata

MAY 11, 2021

After the shock of COVID exposed the brittle nature of many global supply chains, focus has shifted to resilience, a necessary consideration but not the only one.

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

Intelligent Document Processing: Technology Overview

AltexSoft

MAY 27, 2021

Whatever the industry, various documents accompany at least a quarter of business operations. Healthcare, for example, is filled with millions of patient records and medical forms. As far as transportation, these can be maintenance and driver logs. The documents often come in semi-structured and unstructured data formats, which makes them difficult to process quickly and accurately.

Technology

Technology Process Insurance Medical

Kafka Summit Europe 2021 Recap

Confluent

MAY 17, 2021

And that’s a wrap on Kafka Summit Europe 2021, the first of three global Kafka Summits this year. We’ve seen 17,000 registrations from over 7,000 companies and 137 different countries. […].

Kafka

Automating CDP Private Cloud Installations with Ansible

Cloudera

MAY 10, 2021

The introduction of CDP Public Cloud has dramatically reduced the time in which you can be up and running with Cloudera’s latest technologies, be it with containerised Data Warehouse , Machine Learning , Operational Database or Data Engineering experiences or the multi-purpose VM-based Data Hub style of deployment. In CDP Private Cloud, the introduction of Cloudera Data Warehouse and Cloudera Machine Learning Experiences on RedHat OpenShift Kubernetes clusters means that we can deploy new

Cloud

Cloud Consulting Data Warehouse Machine Learning

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

Data Engineering Podcast

MAY 18, 2021

Summary Data lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic understanding of your data quality, where errors are occurring, or how a report was constructed you need to track the lineage of the data from beginning to end. The complicating factor is that every framework, platform, and product has its own concepts of how to store, represent, and expose that information.

Metadata

Metadata Kafka Data Warehouse Hadoop

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, and Terrence Sheflin

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

Data

Achieving observability in async workflows

Netflix Tech

MAY 14, 2021

Written by Colby Callahan , Megha Manohara , and Mike Azar. Managing and operating asynchronous workflows can be difficult without the proper tools and architecture that puts observability, debugging, and tracing at the forefront. Imagine getting paged outside normal work hours?—?users are having trouble with the application you’re responsible for, and you start diving into logs.

Java

Java Programming Language Media Architecture

Open Banking is Transforming Financial Services and Chipping Away the Relevance of Traditional Banks

Teradata

MAY 10, 2021

The sharing of client data in an Open Banking marketplace challenges banks to adopt a customer-centric approach & collaborate with new players to re-define their relevance.

Banking

Banking Data

Asynchronous APIs in CRM and marketing tools

Grouparoo

MAY 27, 2021

When integrating with Destinations , there are generally two main approaches made available by API providers: single or batched. With the "single" approach, one API request usually affects a single profile in the destination. The "batched" approach, which you can read more about here , allows you to affect multiple profiles in a single API request.

Process

Process Systems Designing IT

Using kafka-merge-purge to Deal with Failure in an Event-Driven System at FLYERALARM

Confluent

MAY 13, 2021

Failures are inevitable in any system, and there are various options for mitigating them automatically. This is made possible by event-driven applications leveraging Apache Kafka® and built with fault tolerance […].

Kafka

Kafka Systems

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

Manufacturing

5 Factors to Consider When Choosing a Stream Processing Engine

Cloudera

MAY 13, 2021

Are you using the right stream processing engine for the job at hand? You might think you are—and you very well might be!—but have you really examined the stream processing engines out there in a side-by-side comparison to make sure? Our Choose the Right Stream Processing Engine for Your Data Needs whitepaper makes those comparisons for you, so you can quickly and confidently determine which engine best meets your key business requirements.

Process

Process Engineering Kafka Architecture

Building Your Data Warehouse On Top Of PostgreSQL

Data Engineering Podcast

MAY 13, 2021

Summary There is a lot of attention on the database market and cloud data warehouses. While they provide a measure of convenience, they also require you to sacrifice a certain amount of control over your data. If you want to build a warehouse that gives you both control and flexibility then you might consider building on top of the venerable PostgreSQL project.

PostgreSQL

PostgreSQL Data Warehouse Building MySQL

Data Transformations Using the Data Build Tool

Ripple Engineering

MAY 27, 2021

At Ripple , we are moving towards building complex business models out of raw data. To do this successfully, we need to automate our historically manual processes. Even with a digital-first approach, many of our internal processes were done by hand, making them great candidates to be automated. A prime example of this was the process of managing our data transformation workflows.

Building

Building Raw Data SQL Data

What Isaac Newton Did in Lockdown – And What it Tells Us About Data Science

Teradata

MAY 5, 2021

The end of the pandemic may well be in sight, but it’s highlighted the incredible power of data science to transform economies, industries & people’s lives for the better.

Data Science

Data Science IT Data

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.

Systems

May, 2021

How to make data pipelines idempotent

The Architecture of Uber’s API gateway

Webinars

Trending Sources

Introducing Confluent for Kubernetes

Webinars

Spark on Kubernetes – Gang Scheduling with YuniKorn

Apache Airflow® Best Practices for ETL and ELT Pipelines

Paving The Road For Fast Analytics On Distributed Clouds With The Yellowbrick Data Warehouse

What’s the Secret Recipe for DataOps?

My (Seemingly) Random Walk to Netflix

Sign up to get articles personalized to your interests!

More Trending

My (Seemingly) Random Walk to Netflix

Big Data Analytics: How It Works, Tools, and Real-Life Applications

Confluent CLI Launches Exciting New Features and an Intuitive UI

NVIDIA RAPIDS in Cloudera Machine Learning

Easily Build Advanced Similarity Search With The Pinecone Vector Database

Apache Airflow®: The Ultimate Guide to DAG Writing

Data Observability and Monitoring with DataOps

Twelve Thoughts About the Data Mesh

Computer Vision in Healthcare: Creating an AI Diagnostic Tool for Medical Image Analysis

Error Handling Patterns for Apache Kafka Applications

Optimizing The Modern Developer Experience with Coder

The Ethics of AI Comes Down to Conscious Decisions

A Holistic Approach To Data Governance Through Self Reflection At Collibra

Netflix Drive

Beyond Resilience-The Next Generation of Supply Chain

15 Modern Use Cases for Enterprise Business Intelligence

Intelligent Document Processing: Technology Overview

Kafka Summit Europe 2021 Recap

Automating CDP Private Cloud Installations with Ansible

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Achieving observability in async workflows

Open Banking is Transforming Financial Services and Chipping Away the Relevance of Traditional Banks

Asynchronous APIs in CRM and marketing tools

Using kafka-merge-purge to Deal with Failure in an Event-Driven System at FLYERALARM

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

5 Factors to Consider When Choosing a Stream Processing Engine

Building Your Data Warehouse On Top Of PostgreSQL

Data Transformations Using the Data Build Tool

What Isaac Newton Did in Lockdown – And What it Tells Us About Data Science

Improving the Accuracy of Generative AI Systems: A Structured Approach

Stay Connected