Top Data Engineering Digest Generalist Data Analysis Content for July, 2021

July, 2021

Containerizing Apache Hadoop Infrastructure at Uber

Uber Engineering

JULY 22, 2021

Introduction. As Uber’s business grew, we scaled our Apache Hadoop (referred to as ‘Hadoop’ in this article) deployment to 21000+ hosts in 5 years, to support the various analytical and machine learning use cases. We built a team with varied … The post Containerizing Apache Hadoop Infrastructure at Uber appeared first on Uber Engineering Blog.

Hadoop

Hadoop Machine Learning Engineering Architecture

Tyrannical Data and Its Antidotes in the Microservices World

Confluent

JULY 14, 2021

Data is the lifeblood of so much of what we build as software professionals, so it’s unsurprising that operations involving its transfer occupy the vast majority of developer time across […].

IT Data Building Database

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

How to Validate Datatypes in Python

Start Data Engineering

JULY 21, 2021

Introduction Using Native Python Using Pydantic Pydantic Caveats Conclusion Further reading References Introduction Data type issues are one of the biggest concerns when processing data in python. If you are wondering how to Make sure that a column is of a specific data type ( e.g.

Python

Python Database Process Data

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Airflow on Kubernetes : Get started in 10 mins

Marc Lamberti

JULY 6, 2021

Airflow on Kubernetes is quite popular isn’t it? There is a good chance that you know Kubernetes, that you even have a Kubernetes cluster and you would like to deploy and run Airflow on it. However, Kubernetes is hard. There is so many things to deal with that it can be really laborious to just deploy an application. Hopefully for us, some super smart people have created Helm.

Accessible

Accessible Accessibility Data Pipeline Building

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

Data Pipeline

Delivering Modern Enterprise Data Engineering with Cloudera Data Engineering on Azure

Cloudera

JULY 13, 2021

After the launch of CDP Data Engineering (CDE) on AWS a few months ago, we are thrilled to announce that CDE, the only cloud-native service purpose built for enterprise data engineers, is now available on Microsoft Azure. . CDP Data Engineering offers an all-inclusive toolset that enables data pipeline orchestration, automation, advanced monitoring, visual profiling, and a comprehensive management toolset for streamlining ETL processes and making complex data actionable across your analytic team

Data Engineering

Data Engineering Data Engineer Engineering Pipeline-centric

Data Movement in Netflix Studio via Data Mesh

Netflix Tech

JULY 26, 2021

By Andrew Nguonly , Armando Magalhães , Obi-Ike Nwoke , Shervin Afshar , Sreyashi Das , Tongliang Liu , Wei Liu , Yucheng Zeng Background Over the next few years, most content on Netflix will come from Netflix’s own Studio. From the moment a Netflix film or series is pitched and long before it becomes available on Netflix, it goes through many phases.

Data

Data MySQL Data Pipeline Data Warehouse

Uber’s Fulfillment Platform: Ground-up Re-architecture to Accelerate Uber’s Go/Get Strategy

Uber Engineering

JULY 27, 2021

Introduction to Fulfillment at Uber. Uber’s mission is to help our consumers effortlessly go anywhere and get anything in thousands of cities worldwide. At its core, we capture a consumer’s intent and fulfill it by matching it with the right … The post Uber’s Fulfillment Platform: Ground-up Re-architecture to Accelerate Uber’s Go/Get Strategy appeared first on Uber Engineering Blog.

Architecture

Architecture Engineering IT

More Trending

Uber’s Fulfillment Platform: Ground-up Re-architecture to Accelerate Uber’s Go/Get Strategy

Uber Engineering

JULY 27, 2021

Architecture

Architecture Engineering IT

Create a Data Analysis Pipeline with Apache Kafka and RStudio

Confluent

JULY 13, 2021

In Data Science projects, we distinguish between descriptive analytics and statistical models running in production. Overall, these can be seen as one process. You start with analyzing historical data to […].

Data Analysis

Data Analysis Kafka Data Science Data

Adding Context And Comprehension To Your Analytics Through Data Discovery With SelectStar

Data Engineering Podcast

JULY 30, 2021

Summary Companies of all sizes and industries are trying to use the data that they and their customers generate to survive and thrive in the modern economy. As a result, they are relying on a constantly growing number of data sources being accessed by an increasingly varied set of users. In order to help data consumers find and understand the data is available, and help the data producers understand how to prioritize their work, SelectStar has built a data discovery platform that brings everyone

BI SQL Data Engineering Data Engineer

COVID-19 Pandemic Analytics for a Safe Return-To-Office

Teradata

JULY 11, 2021

Learn how Teradata is using Advanced Analytics to guide its safe Return-To-Office (RTO) policy for its global employees. Read more.

Reflecting on Cloudera’s Commitment to Address Workplace Inequality: One Year Later

Cloudera

JULY 8, 2021

It’s been a year of awakening and change across the U.S. and around the world. One year ago our CEO Rob Bearden vowed to take decisive action to make Cloudera a more diverse, equitable, and inclusive place to work and have Cloudera take an active role in promoting those attributes in the tech industry and our communities. . There is no one size fits all solution to creating an intentional and strategic plan for a diverse workforce.

Finance

Finance Programming Education Datasets

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

DevOps Is Not DataOps

DataKitchen

JULY 1, 2021

Arvind Murali, Intelligent Data podcast host, interviews DataKitchen CEO Chris Bergh about how DataOps helps improve the speed of data & analytics deployment. The post DevOps Is Not DataOps first appeared on DataKitchen.

Data

Elastic Distributed Training with XGBoost on Ray

Uber Engineering

JULY 7, 2021

Introduction. Since we productionized distributed XGBoost on Apache Spark™ at Uber in 2017, XGBoost has powered a wide spectrum of machine learning (ML) use cases at Uber, spanning from optimizing marketplace dynamic pricing policies for Freight , improving times of … The post Elastic Distributed Training with XGBoost on Ray appeared first on Uber Engineering Blog.

Machine Learning

Machine Learning Engineering Architecture

Protecting Data Integrity in Confluent Cloud: Over 8 Trillion Messages Per Day

Confluent

JULY 30, 2021

It’s about maintaining the right data even when no one is watching. Last year, Confluent announced support for Infinite Storage, which fundamentally changes data retention in Apache Kafka® by allowing […].

Data Integration

Data Integration Cloud Kafka Data

Building a Multi-Tenant Managed Platform For Streaming Data With Pulsar at Datastax

Data Engineering Podcast

JULY 27, 2021

Summary Everyone expects data to be transmitted, processed, and updated instantly as more and more products integrate streaming data. The technology to make that possible has been around for a number of years, but the barriers to adoption have still been high due to the level of technical understanding and operational capacity that have been required to run at scale.

Management

Management Building Kafka Data Warehouse

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Data Engineers of Netflix?—?Interview with Kevin Wylie

Netflix Tech

JULY 15, 2021

Data Engineers of Netflix?—?Interview with Kevin Wylie This post is part of our “Data Engineers of Netflix” series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix. Kevin Wylie is a Data Engineer on the Content Data Science and Engineering team. In this post, Kevin talks about his extensive experience in content analytics at Netflix since joining more than 10 years ago.

Data Engineering

Data Engineering Data Engineer Engineering Entertainment

#ClouderaLife Spotlight: Vinicius Cardoso, Sr Solutions Engineering

Cloudera

JULY 30, 2021

Meet Vinicius Cardoso, better known as Vini. . He is a Sr. Solutions Engineer (SE) working in Australia. . In his role, customers are at the center of everything he does. Wearing the hat of Enterprise Architect, he dives deep to understand customer’s organization goals, initiatives and requirements in order to identify the key capabilities that need to be delivered. .

Engineering

Engineering Professional Services Big Data Data Science

Does Your Organization Need a Chief Data Officer? Probably

DataKitchen

JULY 21, 2021

The post Does Your Organization Need a Chief Data Officer? Probably first appeared on DataKitchen.

Data

Customer Support Automation Platform at Uber

Uber Engineering

JULY 14, 2021

High Level Overview of the Problem. Introduction. If you’ve used any online/digital service, chances are that you are familiar with what a typical customer service experience entails: you send a message (usually email aliased) to the company’s support staff, fill … The post Customer Support Automation Platform at Uber appeared first on Uber Engineering Blog.

Engineering

Engineering Architecture Data

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Announcing ksqlDB 0.19.0

Confluent

JULY 20, 2021

We’re pleased to announce ksqlDB 0.19.0! This release includes a new NULLIF function and a major upgrade to ksqlDB’s data modeling capabilities—foreign-key joins. We’re excited to share this highly requested […].

Data

Data Process

Bringing The Metrics Layer To The Masses With Transform

Data Engineering Podcast

JULY 22, 2021

Summary Collecting and cleaning data is only useful if someone can make sense of it afterward. The latest evolution in the data ecosystem is the introduction of a dedicated metrics layer to help address the challenge of adding context and semantics to raw information. In this episode Nick Handel shares the story behind Transform, a new platform that provides a managed metrics layer for your data platform.

SQL

SQL BI Data Warehouse Data Engineering

Propensity Model: How to Predict Customer Behavior Using Machine Learning

AltexSoft

JULY 8, 2021

It’s a common practice for companies and their marketing teams to try guessing how likely certain groups of customers are going to act under certain circumstances. For this purpose, they create propensity models. Built in a traditional statistical fashion, the accuracy of outcomes predictive tools provide isn’t always high. To help companies unlock the full potential of personalized marketing, propensity models should use the power of machine learning technologies.

Machine Learning

Machine Learning Algorithm Education Data Science

Accelerate Offloading to Cloudera Data Warehouse (CDW) with Procedural SQL Support

Cloudera

JULY 16, 2021

Did you know Cloudera customers, such as SMG and Geisinger , offloaded their legacy DW environment to Cloudera Data Warehouse (CDW) to take advantage of CDW’s modern architecture and best-in-class performance? In addition to substantial cost savings upon moving to CDW, Geisinger is also able to search through hundreds of million patient note records in seconds providing better treatment to their patients.

Data Warehouse

Data Warehouse SQL PostgreSQL Database

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

What Is Referential Transparency and Why Should You Care?

Rock the JVM

JULY 28, 2021

Discover how referential transparency boosts your productivity as a functional programmer in Scala and why it's crucial

Scala

Scala IT

Data cleaning for nulls with SQL vs. code

Grouparoo

JULY 28, 2021

When preparing your data set for analysis, it is crucial to ensure that your data set is both complete and accurate. One step in this process is deciding how to handle null values. Depending on how your data is going to be used, you may not want null values at all! Let's clean some data We're going to take a look at calculating Lifetime Value (LTV) of a customer.

SQL

SQL Coding Aggregated Data Python

Speed, Scale, Storage: Our Journey from Apache Kafka to Performance in Confluent Cloud

Confluent

JULY 28, 2021

At Confluent, we focus on the holy trinity of performance, price, and availability, with the goal of delivering a similar performance envelope for all workloads across all supported cloud providers. […].

Cloud

Cloud Kafka Architecture

Strategies For Proactive Data Quality Management

Data Engineering Podcast

JULY 19, 2021

Summary Data quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions rely on hand-coded rules for catching known bugs, or statistical analysis of records to detect anomalies retroactively. While those are useful tools, it is far better to prevent data errors before they become an outsized issue.

Management

Management Data Warehouse Data Pipeline Data

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

Recommender Systems: Behind the Scenes of Machine-Learning-Based Personalization

AltexSoft

JULY 27, 2021

Steve Jobs once said, “People don’t know what they want until you show it to them”. Well, try arguing that considering that we all watch videos suggested by YouTube, buy goods suggested by Amazon, and watch TV shows suggested by Netflix. People like being guided and given relevant offers and recommendations. They like being treated in a personal manner.

Machine Learning

Machine Learning Systems Algorithm Deep Learning

Five Strategies to Accelerate Data Product Development

Cloudera

JULY 26, 2021

Introduction. With this first article of the two-part series on data product strategies, I am presenting some of the emerging themes in data product development and how they inform the prerequisites and foundational capabilities of an Enterprise data platform that would serve as the backbone for developing successful data product strategies. Once we have identified those capabilities, the second article explores how the Cloudera Data Platform delivers those prerequisite capabilities and has enab

Generalist

Generalist Telecommunication Healthcare Data Science

What Is Referential Transparency and Why Should You Care?

Rock the JVM

JULY 28, 2021

Discover how referential transparency boosts your productivity as a functional programmer in Scala and why it's crucial

Scala

Scala IT

Knowledge Graph Technologies Accelerate and Improve the Data Model Definition for Master Data

Zalando Engineering

JULY 28, 2021

The Master Data Management Challenge Master data management (MDM) is a technology-enabled discipline in which business and Information Technology work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise's official shared master data assets. 1 At Zalando we are at an early phase of realising MDM for our internal data assets and we have chosen to do it in a consolidated style.

Technology

Technology Data Systems Data Governance

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineering

July, 2021

Containerizing Apache Hadoop Infrastructure at Uber

Tyrannical Data and Its Antidotes in the Microservices World

Webinars

Trending Sources

How to Validate Datatypes in Python

Webinars

Airflow on Kubernetes : Get started in 10 mins

A Guide to Debugging Apache Airflow® DAGs

Delivering Modern Enterprise Data Engineering with Cloudera Data Engineering on Azure

Data Movement in Netflix Studio via Data Mesh

Uber’s Fulfillment Platform: Ground-up Re-architecture to Accelerate Uber’s Go/Get Strategy

Sign up to get articles personalized to your interests!

More Trending

Uber’s Fulfillment Platform: Ground-up Re-architecture to Accelerate Uber’s Go/Get Strategy

Create a Data Analysis Pipeline with Apache Kafka and RStudio

Adding Context And Comprehension To Your Analytics Through Data Discovery With SelectStar

COVID-19 Pandemic Analytics for a Safe Return-To-Office

Reflecting on Cloudera’s Commitment to Address Workplace Inequality: One Year Later

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

DevOps Is Not DataOps

Elastic Distributed Training with XGBoost on Ray

Protecting Data Integrity in Confluent Cloud: Over 8 Trillion Messages Per Day

Building a Multi-Tenant Managed Platform For Streaming Data With Pulsar at Datastax

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Data Engineers of Netflix?—?Interview with Kevin Wylie

#ClouderaLife Spotlight: Vinicius Cardoso, Sr Solutions Engineering

Does Your Organization Need a Chief Data Officer? Probably

Customer Support Automation Platform at Uber

How to Modernize Manufacturing Without Losing Control

Announcing ksqlDB 0.19.0

Bringing The Metrics Layer To The Masses With Transform

Propensity Model: How to Predict Customer Behavior Using Machine Learning

Accelerate Offloading to Cloudera Data Warehouse (CDW) with Procedural SQL Support

Optimizing The Modern Developer Experience with Coder

What Is Referential Transparency and Why Should You Care?

Data cleaning for nulls with SQL vs. code

Speed, Scale, Storage: Our Journey from Apache Kafka to Performance in Confluent Cloud

Strategies For Proactive Data Quality Management

15 Modern Use Cases for Enterprise Business Intelligence

Recommender Systems: Behind the Scenes of Machine-Learning-Based Personalization

Five Strategies to Accelerate Data Product Development

What Is Referential Transparency and Why Should You Care?

Knowledge Graph Technologies Accelerate and Improve the Data Model Definition for Master Data

The Ultimate Guide to Apache Airflow DAGS

Stay Connected