Top Data Engineering Digest Java Data Process Content for Week of May 01

Sat.May 01, 2021 - Fri.May 07, 2021

Spark on Kubernetes – Gang Scheduling with YuniKorn

Cloudera

MAY 5, 2021

Apache YuniKorn (Incubating) has just released 0.10.0 ( release announcement ). As part of this release, a new feature called Gang Scheduling has become available. By leveraging the Gang Scheduling feature, Spark jobs scheduling on Kubernetes becomes more efficient. What is Apache YuniKorn (Incubating)? Apache YuniKorn (Incubating) is a new Apache incubator project that offers rich scheduling capabilities on Kubernetes.

Metadata

Metadata Algorithm Big Data Machine Learning

Making Spark Cloud Native At Data Mechanics

Data Engineering Podcast

MAY 6, 2021

Summary Spark is one of the most well-known frameworks for data processing, whether for batch or streaming, ETL or ML, and at any scale. Because of its popularity it has been deployed on every kind of platform you can think of. In this episode Jean-Yves Stephan shares the work that he is doing at Data Mechanics to make it sing on Kubernetes. He explains how operating in a cloud-native context simplifies some aspects of running the system while complicating others, how it simplifies the developme

Cloud

Cloud Data Warehouse Data Engineer Data Engineering

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

What’s the Secret Recipe for DataOps?

DataKitchen

MAY 3, 2021

Catalog & Cocktails podcast hosts Tim Gasper & Juan Sequeda of data.world interview DataKitchen CEO Chris Bergh on how to create the right DataOps culture & measuring the value of your DataOps strategy. The post What’s the Secret Recipe for DataOps? first appeared on DataKitchen.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Confluent Update Regarding Codecov Incident

Confluent

MAY 5, 2021

Our team was recently notified of unauthorized read-only access to Confluent’s Github account stemming from the recent Codecov incident (more information here). The security of our customers and their data […].

Accessibility

Accessibility Accessible Data

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

Driving Agility and Scalability through Smart Data

Cloudera

MAY 3, 2021

Last year presented business and organizational challenges that hadn’t been seen in a century and the troubling fact is that the challenges applied pains and gains unequally across industry segments. While brick-and-mortar retail was crushed a year ago with mandated store closures, digital commerce retailers realized ten years of digital sales penetration in only three months.

Scala

Scala Retail Java SQL

The Grand Vision And Present Reality of DataOps

Data Engineering Podcast

MAY 3, 2021

Summary The Data industry is changing rapidly, and one of the most active areas of growth is automation of data workflows. Taking cues from the DevOps movement of the past decade data professionals are orienting around the concept of DataOps. More than just a collection of tools, there are a number of organizational and conceptual changes that a proper DataOps approach depends on.

Data Warehouse

Data Warehouse Data Pipeline BI Metadata

Netflix Drive

Netflix Tech

MAY 5, 2021

A file and folder interface for Netflix Cloud Services Written by Vikram Krishnamurthy , Kishore Kasi , Abhishek Kapatkar , and Tejas Chopra In this post, we are introducing Netflix Drive, a Cloud drive for media assets and providing a high level overview of some of its features and interfaces. We intend this to be a first post in a series of posts covering Netflix Drive.

Metadata

Metadata Bytes Media Cloud Storage

More Trending

Netflix Drive

Netflix Tech

MAY 5, 2021

Metadata

Metadata Bytes Media Cloud Storage

How DataOps Enables a Data Fabric

DataKitchen

MAY 4, 2021

The post How DataOps Enables a Data Fabric first appeared on DataKitchen.

Data

Quantifying the value of multi-cloud deployment strategies with CDP Public Cloud

Cloudera

MAY 6, 2021

In the introductory article of this series, I presented the overarching framework for quantifying the value of the Cloudera Data Platform (CDP): . In this article, I will be focusing on the contribution that a multi-cloud strategy has towards these value drivers, and address a question that I regularly get from clients: Is there a quantifiable benefit to a multi-cloud deployment?

Cloud

Cloud Insurance Metadata Utilities

What Isaac Newton Did in Lockdown – And What it Tells Us About Data Science

Teradata

MAY 5, 2021

The end of the pandemic may well be in sight, but it’s highlighted the incredible power of data science to transform economies, industries & people’s lives for the better.

Data Science

Data Science IT Data

Streaming ETL with Confluent: Routing and Fan-Out of Apache Kafka Messages with ksqlDB

Confluent

MAY 4, 2021

In the world of data engineering, data routing decisions are crucial to successful distributed system design. Some organizations choose to route data from within application code. Other teams hand off […].

Kafka

Kafka Data Engineer Data Engineering Coding

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Functional Collections in Scala

Rock the JVM

MAY 7, 2021

Discover a powerful Scala feature that many developers overlook: a concise guide to functional collections that could revolutionize your Scala programming

Scala

Scala Programming

Streaming Market Data with Flink SQL Part I: Streaming VWAP

Cloudera

MAY 4, 2021

This article is the first of a multipart series to showcase the power and expressibility of FlinkSQL applied to market data. Code and data for this series are available on github. It was co-authored by Krishnen Vytelingum, Head of Quantitative Modeling, Simudyne. Speed matters in financial markets. Whether the goal is to maximize alpha or minimize exposure, financial technologists invest heavily in having the most up-to-date insights on the state of the market and where it is going.

SQL

SQL Business Analyst Data Java

Hardening Palantir’s Kubernetes Infrastructure with Cilium

Palantir

MAY 6, 2021

Containerized infrastructure has become an industry-wide trend as engineering teams lean on the likes of Docker or Kubernetes to manage, deploy, and scale their environments; here, Palantir is no exception. We built Rubix , Palantir’s Kubernetes infrastructure, with two primary goals in mind: streamlining and scaling the deployment of our software platforms and strengthening our security posture.

Bytes

Bytes Engineering Metadata Process

Announcing the MongoDB Atlas Sink and Source Connectors in Confluent Cloud

Confluent

MAY 6, 2021

Today, Confluent is announcing the general availability (GA) of the fully managed MongoDB Atlas Source and MongoDB Atlas Sink Connectors within Confluent Cloud. Now, with just a few simple clicks, […].

MongoDB

MongoDB Cloud Management Kafka

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Working with Mixed Data Types within a Field Using Rockset

Rockset

MAY 5, 2021

So. you think all your data in a particular field are a string type, but when you try to run your query, you get some errors. Doing more investigation, it looks like you have some int and undefined types as well. Bummer. Despair not! We can actually work around this (without data prep ?). To recap, in our first blog, we created an integration with MongoDB on Rockset, so Rockset can read and [update] the data coming in MongoDB.

MongoDB

MongoDB Unstructured Data Data SQL

Welcome, Teal!

Grouparoo

MAY 5, 2021

We are excited to have Teal Larson come aboard Grouparoo as an engineer. Teal has already started working on our www site, building out pages that help communicate what we are building and for whom. We have doubled our Pacific Northwest cohort. I think that means that we will have to plan a trip up there for a hiking offsite. The first thing I noticed about Teal was her time outside of tech as a language arts teacher.

Building

Building Engineering IT

What Isaac Newton Did in Lockdown – And What it Tells Us About Data Science

Teradata

MAY 5, 2021

The end of the pandemic may well be in sight, but it’s highlighted the incredible power of data science to transform economies, industries & people’s lives for the better.

Data Science

Data Science IT Data

Cloud Migration Series (Step 2 of 5): Start Planning

Cloud Academy

MAY 6, 2021

This is part 2 of a 5-part series on best practices for enterprise cloud migration. Released weekly from the end of April to the end of May 2021, each article will cover a new phase of a business’s transition to the cloud, what to be on the lookout for, and how to ensure the journey is a success. Be sure to subscribe to our blog to be notified when new content goes live!

Cloud

Cloud Google Cloud AWS Data Lake

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineer

How to Conduct Data Incident Management for Data Teams

Monte Carlo

MAY 6, 2021

As data systems become increasingly distributed and companies ingest more and more data, the opportunity for error (and incidents) only increases. For decades, software engineering teams have relied on a multi-step process to identify, triage, resolve, and prevent issues from taking down their applications. As data operations mature, it’s time we treat data downtime , in other words, periods of time when data is missing, inaccurate, or otherwise erroneous, with the same diligence, particularly w

Management

Management Data Pipeline Software Engineer Software Engineering

Using Sync Modes in Grouparoo

Grouparoo

MAY 4, 2021

We've improved the Getting Started Experience! Check out our UI Configuration method. The steps utilizing grouparoo generate will not be replicable as the command will be fully deprecated in v0.8.1 A few weeks ago we wrote about Sync Modes and why they may be useful when it comes to syncing data to a destination. In short, Sync Modes allow you to have more control over what operations are performed and how Grouparoo interacts with contacts that may already exist in the destination system.

Utilities

Utilities Data Integration Systems Building

The Race to Transform

Teradata

MAY 3, 2021

For banks, the essential elements of survival include not only a comprehensive data strategy that drives real return, but also cultural and organizational changes.

Banking

Banking Data

Which Open Source Data Integration Tool is Best?

Preset

MAY 2, 2021

Airbyte and Meltano are the two leading open-source data integration platforms. In this post, we'll showcase the strengths of both platforms.

Data Integration

Data Integration Data

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

Data

Sat.May 01, 2021 - Fri.May 07, 2021

Spark on Kubernetes – Gang Scheduling with YuniKorn

Making Spark Cloud Native At Data Mechanics

Webinars

Trending Sources

What’s the Secret Recipe for DataOps?

Webinars

Confluent Update Regarding Codecov Incident

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Driving Agility and Scalability through Smart Data

The Grand Vision And Present Reality of DataOps

Netflix Drive

Sign up to get articles personalized to your interests!

More Trending

Netflix Drive

How DataOps Enables a Data Fabric

Quantifying the value of multi-cloud deployment strategies with CDP Public Cloud

What Isaac Newton Did in Lockdown – And What it Tells Us About Data Science

Streaming ETL with Confluent: Routing and Fan-Out of Apache Kafka Messages with ksqlDB

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Functional Collections in Scala

Streaming Market Data with Flink SQL Part I: Streaming VWAP

Hardening Palantir’s Kubernetes Infrastructure with Cilium

Announcing the MongoDB Atlas Sink and Source Connectors in Confluent Cloud

How to Modernize Manufacturing Without Losing Control

Working with Mixed Data Types within a Field Using Rockset

Welcome, Teal!

What Isaac Newton Did in Lockdown – And What it Tells Us About Data Science

Cloud Migration Series (Step 2 of 5): Start Planning

The Ultimate Guide to Apache Airflow DAGS

How to Conduct Data Incident Management for Data Teams

Using Sync Modes in Grouparoo

The Race to Transform

Which Open Source Data Integration Tool is Best?

Apache Airflow® Best Practices: DAG Writing

Stay Connected