Sat.May 01, 2021 - Fri.May 07, 2021

article thumbnail

Spark on Kubernetes – Gang Scheduling with YuniKorn

Cloudera

Apache YuniKorn (Incubating) has just released 0.10.0 ( release announcement ). As part of this release, a new feature called Gang Scheduling has become available. By leveraging the Gang Scheduling feature, Spark jobs scheduling on Kubernetes becomes more efficient. What is Apache YuniKorn (Incubating)? Apache YuniKorn (Incubating) is a new Apache incubator project that offers rich scheduling capabilities on Kubernetes.

Metadata 136
article thumbnail

Making Spark Cloud Native At Data Mechanics

Data Engineering Podcast

Summary Spark is one of the most well-known frameworks for data processing, whether for batch or streaming, ETL or ML, and at any scale. Because of its popularity it has been deployed on every kind of platform you can think of. In this episode Jean-Yves Stephan shares the work that he is doing at Data Mechanics to make it sing on Kubernetes. He explains how operating in a cloud-native context simplifies some aspects of running the system while complicating others, how it simplifies the developme

Cloud 100
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

What’s the Secret Recipe for DataOps?

DataKitchen

Catalog & Cocktails podcast hosts Tim Gasper & Juan Sequeda of data.world interview DataKitchen CEO Chris Bergh on how to create the right DataOps culture & measuring the value of your DataOps strategy. The post What’s the Secret Recipe for DataOps? first appeared on DataKitchen.

98
article thumbnail

Confluent Update Regarding Codecov Incident

Confluent

Our team was recently notified of unauthorized read-only access to Confluent’s Github account stemming from the recent Codecov incident (more information here). The security of our customers and their data […].

article thumbnail

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

article thumbnail

Driving Agility and Scalability through Smart Data

Cloudera

Last year presented business and organizational challenges that hadn’t been seen in a century and the troubling fact is that the challenges applied pains and gains unequally across industry segments. While brick-and-mortar retail was crushed a year ago with mandated store closures, digital commerce retailers realized ten years of digital sales penetration in only three months.

Scala 103
article thumbnail

The Grand Vision And Present Reality of DataOps

Data Engineering Podcast

Summary The Data industry is changing rapidly, and one of the most active areas of growth is automation of data workflows. Taking cues from the DevOps movement of the past decade data professionals are orienting around the concept of DataOps. More than just a collection of tools, there are a number of organizational and conceptual changes that a proper DataOps approach depends on.

More Trending

article thumbnail

How DataOps Enables a Data Fabric

DataKitchen

The post How DataOps Enables a Data Fabric first appeared on DataKitchen.

Data 66
article thumbnail

Quantifying the value of multi-cloud deployment strategies with CDP Public Cloud

Cloudera

In the introductory article of this series, I presented the overarching framework for quantifying the value of the Cloudera Data Platform (CDP): . In this article, I will be focusing on the contribution that a multi-cloud strategy has towards these value drivers, and address a question that I regularly get from clients: Is there a quantifiable benefit to a multi-cloud deployment?

Cloud 92
article thumbnail

What Isaac Newton Did in Lockdown – And What it Tells Us About Data Science

Teradata

The end of the pandemic may well be in sight, but it’s highlighted the incredible power of data science to transform economies, industries & people’s lives for the better.

article thumbnail

Streaming ETL with Confluent: Routing and Fan-Out of Apache Kafka Messages with ksqlDB

Confluent

In the world of data engineering, data routing decisions are crucial to successful distributed system design. Some organizations choose to route data from within application code. Other teams hand off […].

Kafka 59
article thumbnail

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

article thumbnail

Functional Collections in Scala

Rock the JVM

Discover a powerful Scala feature that many developers overlook: a concise guide to functional collections that could revolutionize your Scala programming

Scala 52
article thumbnail

Streaming Market Data with Flink SQL Part I: Streaming VWAP

Cloudera

This article is the first of a multipart series to showcase the power and expressibility of FlinkSQL applied to market data. Code and data for this series are available on github. It was co-authored by Krishnen Vytelingum, Head of Quantitative Modeling, Simudyne. Speed matters in financial markets. Whether the goal is to maximize alpha or minimize exposure, financial technologists invest heavily in having the most up-to-date insights on the state of the market and where it is going.

SQL 80
article thumbnail

Hardening Palantir’s Kubernetes Infrastructure with Cilium

Palantir

Containerized infrastructure has become an industry-wide trend as engineering teams lean on the likes of Docker or Kubernetes to manage, deploy, and scale their environments; here, Palantir is no exception. We built Rubix , Palantir’s Kubernetes infrastructure, with two primary goals in mind: streamlining and scaling the deployment of our software platforms and strengthening our security posture.

Bytes 52
article thumbnail

Announcing the MongoDB Atlas Sink and Source Connectors in Confluent Cloud

Confluent

Today, Confluent is announcing the general availability (GA) of the fully managed MongoDB Atlas Source and MongoDB Atlas Sink Connectors within Confluent Cloud. Now, with just a few simple clicks, […].

MongoDB 52
article thumbnail

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

article thumbnail

Working with Mixed Data Types within a Field Using Rockset

Rockset

So. you think all your data in a particular field are a string type, but when you try to run your query, you get some errors. Doing more investigation, it looks like you have some int and undefined types as well. Bummer. Despair not! We can actually work around this (without data prep ?). To recap, in our first blog, we created an integration with MongoDB on Rockset, so Rockset can read and [update] the data coming in MongoDB.

MongoDB 52
article thumbnail

Welcome, Teal!

Grouparoo

We are excited to have Teal Larson come aboard Grouparoo as an engineer. Teal has already started working on our www site, building out pages that help communicate what we are building and for whom. We have doubled our Pacific Northwest cohort. I think that means that we will have to plan a trip up there for a hiking offsite. The first thing I noticed about Teal was her time outside of tech as a language arts teacher.

article thumbnail

What Isaac Newton Did in Lockdown – And What it Tells Us About Data Science

Teradata

The end of the pandemic may well be in sight, but it’s highlighted the incredible power of data science to transform economies, industries & people’s lives for the better.

article thumbnail

Cloud Migration Series (Step 2 of 5): Start Planning

Cloud Academy

This is part 2 of a 5-part series on best practices for enterprise cloud migration. Released weekly from the end of April to the end of May 2021, each article will cover a new phase of a business’s transition to the cloud, what to be on the lookout for, and how to ensure the journey is a success. Be sure to subscribe to our blog to be notified when new content goes live!

Cloud 40
article thumbnail

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

article thumbnail

How to Conduct Data Incident Management for Data Teams

Monte Carlo

As data systems become increasingly distributed and companies ingest more and more data, the opportunity for error (and incidents) only increases. For decades, software engineering teams have relied on a multi-step process to identify, triage, resolve, and prevent issues from taking down their applications. As data operations mature, it’s time we treat data downtime , in other words, periods of time when data is missing, inaccurate, or otherwise erroneous, with the same diligence, particularly w

article thumbnail

Using Sync Modes in Grouparoo

Grouparoo

We've improved the Getting Started Experience! Check out our UI Configuration method. The steps utilizing grouparoo generate will not be replicable as the command will be fully deprecated in v0.8.1 A few weeks ago we wrote about Sync Modes and why they may be useful when it comes to syncing data to a destination. In short, Sync Modes allow you to have more control over what operations are performed and how Grouparoo interacts with contacts that may already exist in the destination system.

article thumbnail

The Race to Transform

Teradata

For banks, the essential elements of survival include not only a comprehensive data strategy that drives real return, but also cultural and organizational changes.

Banking 52
article thumbnail

Which Open Source Data Integration Tool is Best?

Preset

Airbyte and Meltano are the two leading open-source data integration platforms. In this post, we'll showcase the strengths of both platforms.

article thumbnail

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!