Top Data Engineering Digest Kafka Scala Content for March, 2021

March, 2021

Building a Data Engineering Project in 20 Minutes

Simon Späti

MARCH 9, 2021

This post focuses on practical data pipelines with examples from web-scraping real-estates, uploading them to S3 with MinIO, Spark and Delta Lake, adding some Data Science magic with Jupyter Notebooks, ingesting into Data Warehouse Apache Druid, visualising dashboards with Superset and managing everything with Dagster. The goal is to touch on the common data engineering challenges and using promising new technologies, tools or frameworks, which most of them I wrote about in Business Intelligence

Data Engineering

Data Engineering Data Engineer Engineering Project

Apache Kafka Made Simple: A First Glimpse of a Kafka Without ZooKeeper

Confluent

MARCH 30, 2021

At the heart of Apache Kafka® sits the log—a simple data structure that uses sequential operations that work symbiotically with the underlying hardware. Efficient disk buffering and CPU cache usage, […].

Kafka

Kafka Data

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

How to Host a Virtual Global Data Science Hackathon

Teradata

MARCH 25, 2021

Learn how best to host a virtual hackathon, or any virtual event, with these tips and tricks from our Teradata team. Read more.

Data Science

Data Science Data

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Toward a Data Mesh (part 2) : Architecture & Technologies

François Nguyen

MARCH 22, 2021

Just an illustration – not the truth and you certainly can do it with other technologies. TL;DR After setting up and organizing the teams, we are describing 4 topics to make data mesh a reality. the selfserve platform based on a serverless philisophy (life is too short to do provisioning) the building of data products (as code) : we are building data workflows not data pipelines the promotion of data domains where the metadata on the data life cycle is as important as your data The old dat

Technology

Technology Architecture Google Cloud Metadata

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

The Future of Business Intelligence is Open Source

Maxime Beauchemin

MARCH 8, 2021

While “software is [still actively] eating the world” , it’s also clear that open source is taking over software. Simply put, open source is a superior approach at building and distributing software because it provides important guaranties around how software can be discovered, tried, operated, collaborated on and packaged. For those reasons, it is not surprising that it has taken over most of the modern data stack: infrastructure, databases, orchestration, data processing, AI/ML and beyond.

Business Intelligence

Business Intelligence BI Database-centric Google Cloud

Remote Workstations for the Discerning Artists

Netflix Tech

MARCH 8, 2021

By Michelle Brenner Netflix is poised to become the world’s most prolific producer of visual effects and original animated content. To meet that demand, we need to attract the world’s best artistic talent. Artists like to work at places where they can create groundbreaking entertainment instead of worrying about getting access to the software or source files they need.

Entertainment

Entertainment Java AWS Accessibility

Building a Data Engineering Project in 20 Minutes

Simon Späti

MARCH 9, 2021

Data Engineering

Data Engineering Data Engineer Engineering Project

More Trending

Building a Data Engineering Project in 20 Minutes

Simon Späti

MARCH 9, 2021

Data Engineering

Data Engineering Data Engineer Engineering Project

Under the Hood of Real-Time Analytics with Apache Kafka and Pinot

Confluent

MARCH 9, 2021

Real-time analytics has become the need of the hour for modern internet companies. The ability to derive internal insights around business metrics, user growth and adoption as well as security […].

Kafka

Kafka Architecture

CFO Analytics: What Is It and Why Should You Care?

Teradata

MARCH 3, 2021

Finance-driven analytics might be the largest untapped opportunity for organizations & a catalyst for driving business value & strategic vision. But, what exactly is CFO analytics?

IT Finance

Towards a Data Mesh (part 1) : Data Domains and Teams Topologies.

François Nguyen

MARCH 7, 2021

Just an illustration – not the truth and we will pivot if it does not work. I discovered Zhamak Dehghani’s first article about Data Mesh in August 2020. Thanks to Youtube, you have the live illustration in this video with even more context and explanations. And then, you have this second video that is an introduction to her second article (december 2020).

Government

Government Data Governance Data Metadata

Top Three Requirements for Data Flows

Cloudera

MARCH 11, 2021

Data flows are an integral part of every modern enterprise. No matter whether they move data from one operational system to another to power a business process or fuel central data warehouses with the latest data for near-real-time reporting, life without them would be full of manual, tedious and error-prone data modification and copying tasks. At Cloudera, we’re helping our customers implement data flows on-premises and in the public cloud using Apache NiFi , a core component of Cloudera DataFl

Cloud

Cloud Data Data Warehouse Data Integration

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

ConsoleMe: A Central Control Plane for AWS Permissions and Access

Netflix Tech

MARCH 10, 2021

ConsoleMe: A Central Control Plane for AWS Permissions and Access By Curtis Castrapel , Patrick Sanders , and Hee Won Kim At AWS re:Invent 2020, we open sourced two new tools for managing multi-account AWS permissions and access. We’re very excited to bring you ConsoleMe (pronounced: kuhn-soul-mee ), and its CLI utility, Weep (pun intended)! If you missed the talk, check it out here.

AWS

AWS Accessibility Accessible Cloud

Data Quality Management For The Whole Team With Soda Data

Data Engineering Podcast

MARCH 29, 2021

Summary Data quality is on the top of everyone’s mind recently, but getting it right is as challenging as ever. One of the contributing factors is the number of people who are involved in the process and the potential impact on the business if something goes wrong. In this episode Maarten Masschelein and Tom Baeyens share the work they are doing at Soda to bring everyone on board to make your data clean and reliable.

Management

Management Data Warehouse Data Data Pipeline

Monitoring Your Event Streams: Integrating Confluent with Prometheus and Grafana

Confluent

MARCH 29, 2021

Self-managing a highly scalable distributed system with Apache Kafka® at its core is not an easy feat. That’s why operators prefer tooling such as Confluent Control Center for administering and […].

Kafka

Kafka Systems Management IT

Enterprise Data Operating Systems in the Cloud: Necessary, But Not Sufficient

Teradata

MARCH 11, 2021

Getting your Cloud data architecture right starts with understanding which data products you need, the roles they perform, & the functional & non-functional characteristics that those roles demand.

Cloud

Cloud Systems Data Architecture Architecture

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

How to trigger a spark job from AWS Lambda

Start Data Engineering

MARCH 27, 2021

Event driven pipelines Lambda function to trigger spark jobs Setup and run Monitoring and logging Teardown Conclusion Further reading References Event driven pipelines Event driven systems represent a software design pattern where a logic is executed in response to an event. This event can be a file creation on S3, a new database row, API call, etc.

AWS

AWS Cloud Storage Database Cloud

Accelerated integration of Eventador with Cloudera – SQL Stream Builder

Cloudera

MARCH 29, 2021

In October 2020, Cloudera made a strategic acquisition of a company called Eventador. This was primarily to augment our streaming capabilities within Cloudera DataFlow. Eventador was adept at simplifying the process of building streaming applications. Their flagship product, SQL Stream Builder, made access to real-time data streams easily possible with just SQL (Structured Query Language).

SQL

SQL Scala Manufacturing Java

The Netflix Cosmos Platform

Netflix Tech

MARCH 1, 2021

Orchestrated Functions as a Microservice by Frank San Miguel on behalf of the Cosmos team Introduction Cosmos is a computing platform that combines the best aspects of microservices with asynchronous workflows and serverless functions. Its sweet spot is applications that involve resource-intensive algorithms coordinated via complex, hierarchical workflows that last anywhere from minutes to years.

Media

Media Pipeline-centric Algorithm Coding

Real World Change Data Capture At Datacoral

Data Engineering Podcast

MARCH 22, 2021

Summary The world of business is becoming increasingly dependent on information that is accurate up to the minute. For analytical systems, the only way to provide this reliably is by implementing change data capture (CDC). Unfortunately, this is a non-trivial undertaking, particularly for teams that don’t have extensive experience working with streaming data and complex distributed systems.

Data Warehouse

Data Warehouse Metadata Hadoop Data Lake

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineer

How to Tune RocksDB for Your Kafka Streams Application

Confluent

MARCH 10, 2021

Apache Kafka ships with Kafka Streams, a powerful yet lightweight client library for Java and Scala to implement highly scalable and elastic applications and microservices that process and analyze data […].

Kafka

Kafka Scala Java Process

Enhancing Customer Experience with Every Journey

Teradata

MARCH 4, 2021

Big Tech giants dominate by using data to improve product & experience. The auto industry can emulate this by analyzing data to improve customer experience & guide individual choices.

Data

Reverse ETL with dbt and Grouparoo

Grouparoo

MARCH 30, 2021

Teams are centralizing their data in their data warehouse by loading data in and transforming it as necessary. Increasingly, we are seeing teams turn to dbt to do this transforming. The idea is to write *.sql files that, when run in the right order, create useful rollup tables or materialized views of the data. We've been asked by teams using dbt how Grouparoo can then sync their data to their cloud-based apps.

Data Warehouse

Data Warehouse SQL Project Database

Using SQL to democratize streaming data

Cloudera

MARCH 2, 2021

Streaming analytics is crucial to modern business – it opens up new product opportunities and creates massive operational efficiencies. In many cases, it’s the difference between creating an outstanding customer experience versus a poor one – or losing the customer altogether. However, in the typical enterprise, only a small team has the core skills needed to gain access and create value from streams of data.

SQL

SQL Java Data Lake Scala

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

A Day in the Life of an Experimentation and Causal Inference Scientist @ Netflix

Netflix Tech

MARCH 2, 2021

Stephanie Lane , Wenjing Zheng , Mihir Tendulkar Source credit: Netflix Within the rapid expansion of data-related roles in the last decade, the title Data Scientist has emerged as an umbrella term for myriad skills and areas of business focus. What does this title mean within a given company, or even within a given industry? It can be hard to know from the outside.

Data Science

Data Science Machine Learning Entertainment Designing

Managing The DoorDash Data Platform

Data Engineering Podcast

MARCH 15, 2021

Summary The team at DoorDash has a complex set of optimization challenges to deal with using data that they collect from a multi-sided marketplace. In order to handle the volume and variety of information that they use to run and improve the business the data team has to build a platform that analysts and data scientists can use in a self-service manner.

Management

Management Data Warehouse PostgreSQL Kafka

To Pull or to Push Your Data with Kafka Connect? That Is the Question.

Confluent

MARCH 2, 2021

Today, every company is a data company. There are many different data pipeline, integration, and ingestion tools in the market, but before you can feed your data analytics needs, data […].

Kafka

Kafka Data Pipeline Data Analytics Data

Don’t Just Collect Vehicle Data – Monetize It!

Teradata

MARCH 23, 2021

As the auto sector transforms, vehicle data is becoming one of the most important sources of insight. But if it is left in fragmented silos, it quickly becomes a cost & delivers little value.

IT Data

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

How the Open Edge Is Driving Digital Transformation

DataKitchen

MARCH 30, 2021

The post How the Open Edge Is Driving Digital Transformation first appeared on DataKitchen.

Congratulations to our 2021 Partner Award Winners

Cloudera

MARCH 23, 2021

We announced at our Partner Sales Kickoff, the winners of the 2021 Cloudera Partner Awards. These six awards recognize Cloudera partners who are dedicated to enabling customers to do more with their data by leveraging the power of an enterprise data cloud. Thank you to this year’s winners for their partnership in helping our joint customers’ ability to drive value from their data in the hybrid cloud.

Healthcare

Healthcare Cloud Data Science Government

Production Media Management: Transforming Media Workflows by leveraging the Cloud

Netflix Tech

MARCH 12, 2021

Written by Anton Margoline , Avinash Dathathri , Devang Shah and Murthy Parthasarathi. Credit to Netflix Studio’s Product, Design, Content Hub Engineering teams along with all of the supporting partner and platform teams. In this post, we will share a behind-the-scenes look at how Netflix delivers technology and infrastructure to help production crews create and exchange media during production and post production stages.

Media

Media Cloud Management Metadata

Leave Your Data Where It Is And Automate Feature Extraction With Molecula

Data Engineering Podcast

MARCH 8, 2021

Summary A majority of the time spent in data engineering is copying data between systems to make the information available for different purposes. This introduces challenges such as keeping information synchronized, managing schema evolution, building transformations to match the expectations of the destination systems. H.O. Maycotte was faced with these same challenges but at a massive scale, leading him to question if there is a better way.

IT Data Warehouse MongoDB Kafka

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

Data

March, 2021

Building a Data Engineering Project in 20 Minutes

Apache Kafka Made Simple: A First Glimpse of a Kafka Without ZooKeeper

Webinars

Trending Sources

How to Host a Virtual Global Data Science Hackathon

Webinars

Toward a Data Mesh (part 2) : Architecture & Technologies

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

The Future of Business Intelligence is Open Source

Remote Workstations for the Discerning Artists

Building a Data Engineering Project in 20 Minutes

Sign up to get articles personalized to your interests!

More Trending

Building a Data Engineering Project in 20 Minutes

Under the Hood of Real-Time Analytics with Apache Kafka and Pinot

CFO Analytics: What Is It and Why Should You Care?

Towards a Data Mesh (part 1) : Data Domains and Teams Topologies.

Top Three Requirements for Data Flows

Agent Tooling: Connecting AI to Your Tools, Systems & Data

ConsoleMe: A Central Control Plane for AWS Permissions and Access

Data Quality Management For The Whole Team With Soda Data

Monitoring Your Event Streams: Integrating Confluent with Prometheus and Grafana

Enterprise Data Operating Systems in the Cloud: Necessary, But Not Sufficient

How to Modernize Manufacturing Without Losing Control

How to trigger a spark job from AWS Lambda

Accelerated integration of Eventador with Cloudera – SQL Stream Builder

The Netflix Cosmos Platform

Real World Change Data Capture At Datacoral

The Ultimate Guide to Apache Airflow DAGS

How to Tune RocksDB for Your Kafka Streams Application

Enhancing Customer Experience with Every Journey

Reverse ETL with dbt and Grouparoo

Using SQL to democratize streaming data

Optimizing The Modern Developer Experience with Coder

A Day in the Life of an Experimentation and Causal Inference Scientist @ Netflix

Managing The DoorDash Data Platform

To Pull or to Push Your Data with Kafka Connect? That Is the Question.

Don’t Just Collect Vehicle Data – Monetize It!

15 Modern Use Cases for Enterprise Business Intelligence

How the Open Edge Is Driving Digital Transformation

Congratulations to our 2021 Partner Award Winners

Production Media Management: Transforming Media Workflows by leveraging the Cloud

Leave Your Data Where It Is And Automate Feature Extraction With Molecula

Apache Airflow® Best Practices: DAG Writing

Stay Connected