Top Data Engineering Digest ETL Tools Generalist Content for June, 2022

June, 2022

Making Sense of CRISP-ML(Q): The Machine Learning Lifecycle Process

KDnuggets

JUNE 30, 2022

Learn about the standard process for building sustainable machine learning applications.

Machine Learning

Machine Learning Process Building

Data Orchestration Trends: The Shift From Data Pipelines to Data Products

Simon Späti

JUNE 20, 2022

Data consumers, such as data analysts, and business users, care mostly about the production of data assets. On the other hand, data engineers have historically focused on modeling the dependencies between tasks (instead of data assets) with an orchestrator tool. How can we reconcile both worlds? This article reviews open-source data orchestration tools (Airflow, Prefect, Dagster) and discusses how data orchestration tools introduce data assets as first-class objects.

Data Pipeline

Data Pipeline Data Data Engineering Data Engineer

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Azure Data Factory: New Monitoring View Features

Azure Data Engineering

JUNE 28, 2022

It is very easy to visually monitor previous pipeline runs in Data Factory using the Monitor page in the Azure Data Factory , which we have already covered in a previous post. There have been some recent improvements to the monitoring view, we will go through these briefly in this post. Data from the Azure Monitor view can be easily exported to csv by clicking on the newly added Export to CSV button.

Data

Data IT

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform

Data Engineering Podcast

JUNE 26, 2022

Summary The proliferation of sensors and GPS devices has dramatically increased the number of applications for spatial data, and the need for scalable geospatial analytics. In order to reduce the friction involved in aggregating disparate data sets that share geographic similarities the Unfolded team built a platform that supports working across raster, vector, and tabular data in a single system.

Datasets

Datasets Unstructured Data Metadata MongoDB

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

5 Steps to land a high paying data engineering job

Start Data Engineering

JUNE 24, 2022

1. Introduction 2. Steps 2.1. Choosing companies to work for 2.2. Optimizing your Linkedin & resume 2.3. Landing interviews 2.4. Preparing for interviews 2.5. Offers & Negotiation 3. Conclusion 4. Further reading 5. Reference 1. Introduction The data industry is booming! & data engineering salaries are skyrocketing. But landing a new job is not an easy task.

Data Engineering

Data Engineering Data Engineer Engineering Data

Dynamic Task Mapping in Apache Airflow

Marc Lamberti

JUNE 19, 2022

Dynamic Task Mapping is a new feature of Apache Airflow 2.3 that puts your DAGs to a new level. Now, you can create tasks dynamically without knowing in advance how many tasks you need. This feature is for you if you want to process various files, evaluate multiple machine learning models, or process a varied number of data based on a SQL request. Excited?

SQL

SQL Coding Machine Learning Python

24 SQL Questions You Might See on Your Next Interview

KDnuggets

JUNE 28, 2022

Preparing for the SQL job interview can be overwhelming enough. You don’t need someone telling you that you need to know everything on top of that! Be smart and focus on preparing the SQL questions that appear most often at the job interview.

SQL

More Trending

24 SQL Questions You Might See on Your Next Interview

KDnuggets

JUNE 28, 2022

SQL

Data Orchestration Trends: The Shift From Data Pipelines to Data Products

Simon Späti

JUNE 14, 2022

Data Pipeline

Data Pipeline Data Data Engineering Data Engineer

Azure Data Factory: Script Activity

Azure Data Engineering

JUNE 19, 2022

While we have discussed various ways for running custom SQL code in Azure Data Factory in a previous post , recently, a new activity has been added to Azure Data Factory called Script Activity , which provides a more flexible way of running custom SQL scripts. Azure Data Factory: Script Activity As shown in the screenshot above, this activity supports execution of custom Data Query Language (DQL) as well as Data Definition Language (DDL) and Data Manipulation Language (DML).

SQL

SQL Datasets Data Database

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. A variety of platforms have been developed to capture and analyze that information to great effect, but they are inherently limited in their utility due to their nature as storage systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data g

Metadata

Metadata MongoDB MySQL Scala

The Future Is Hybrid Data, Embrace It

Cloudera

JUNE 7, 2022

We live in a hybrid data world. In the past decade, the amount of structured data created, captured, copied, and consumed globally has grown from less than 1 ZB in 2011 to nearly 14 ZB in 2020. Impressive, but dwarfed by the amount of unstructured data, cloud data, and machine data – another 50 ZB. In fact, the total amount of data is expected to nearly triple by 2025.

IT Unstructured Data Data Architecture Government

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Modernizing a public health system with Teradata’s connected analytic architecture

Teradata

JUNE 30, 2022

How do you accelerate disease prevention and response? Teradata provides a response to help accelerate public health infrastructure modernization.

Analytics Architecture

Analytics Architecture Architecture Systems

Data Preparation with SQL Cheatsheet

KDnuggets

JUNE 27, 2022

If your raw data is in a SQL-based data lake, why spend the time and money to export the data into a new platform for data prep?

Data Preparation

Data Preparation SQL Raw Data Data Lake

Introducing the Current 2022 Program Committee

Confluent

JUNE 15, 2022

The committee will ensure Current has the best speakers from top companies in every industry, and cover all streaming data technologies.

Programming

Programming Technology Data

Azure Data Factory: Monitor Self Hosted Integration Runtime Metrics

Azure Data Engineering

JUNE 12, 2022

Self-hosted integration runtime in the context of Azure data factory is a gateway that connects the on-prem data sources to datastores in the cloud. To know more about Integration runtimes, please refer to the previous post. We have discussed how to check whether Integration Runtime is online or offline using PowerShell command in a previous post. In today’s post, lets have a look at how to monitor self-hosted integration runtime metrics such as CPU utilization, Available memory, number of concu

Utilities

Utilities Cloud Data Process

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Combining The Simplicity Of Spreadsheets With The Power Of Modern Data Infrastructure At Canvas

Data Engineering Podcast

JUNE 19, 2022

Summary Data analysis is a valuable exercise that is often out of reach of non-technical users as a result of the complexity of data systems. In order to lower the barrier to entry Ryan Buick created the Canvas application with a spreadsheet oriented workflow that is understandable to a wide audience. In this episode Ryan explains how he and his team have designed their platform to bring everyone onto a level playing field and the benefits that it provides to the organization.

Metadata

Metadata Unstructured Data MongoDB MySQL

Moving Enterprise Data From Anywhere to Any System Made Easy

Cloudera

JUNE 2, 2022

Since 2015, the Cloudera DataFlow team has been helping the largest enterprise organizations in the world adopt Apache NiFi as their enterprise standard data movement tool. Over the last few years, we have had a front-row seat in our customers’ hybrid cloud journey as they expand their data estate across the edge, on-premise, and multiple cloud providers.

Systems

Systems Data Lake Google Cloud Data Collection

Natively Connect Teradata QueryGrid to Google BigQuery

Teradata

JUNE 14, 2022

With the Teradata QueryGrid Google BigQuery Connector, we’re enabling our customers to natively join data between Vantage and BigQuery in real-time, at scale.

Data

Introducing Objectiv: Open-source product analytics infrastructure

KDnuggets

JUNE 21, 2022

Collect validated user behavior data that’s ready to model on without prepwork. Take models built on one dataset and deploy & run them on another.

Datasets

Datasets Data

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineer

Machine Learning Metrics: How to Measure the Performance of a Machine Learning Model

AltexSoft

JUNE 16, 2022

Choosing the machine learning path when developing your software is half the success. Yes, it’s an advanced way of doing things. Yes, it brings automation, so widely discussed machine intelligence, and other awesome perks. But just because you put it there doesn’t guarantee your project will do well and pay off. So, how would you measure the success of a machine learning model?

Machine Learning

Machine Learning Hospitality Retail Medical

Autonomous Networks — The Telco and Media Growth Engine

Confluent

JUNE 16, 2022

How real-time integrations between modern and legacy systems benefit communication service providers with autonomous network features, enhanced customer experiences, and more.

Media

Media Engineering Systems Architecture

Discover And De-Clutter Your Unstructured Data With Aparavi

Data Engineering Podcast

JUNE 12, 2022

Summary Unstructured data takes many forms in an organization. From a data engineering perspective that often means things like JSON files, audio or video recordings, images, etc. Another category of unstructured data that every business deals with is PDFs, Word documents, workstation backups, and countless other types of information. Aparavi was created to tame the sprawl of information across machines, datacenters, and clouds so that you can reduce the amount of duplicate data and save time an

Unstructured Data

Unstructured Data MongoDB MySQL Scala

Cloudera’s Applied ML Prototype Catalog Continues to Grow

Cloudera

JUNE 10, 2022

Here at Cloudera, we’re committed to helping make the lives of data practitioners as painless as possible. For data scientists, we continue to provide new Applied Machine Learning Prototypes (AMPs), which are open source and available on GitHub. These pre-built reference examples are complete end-to-end data science projects. In Cloudera Machine Learning (CML), you can deploy them with the single click of a button, bringing data scientists that much closer to providing value.

Machine Learning

Machine Learning Data Science Project Systems

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

Operational excellence—data ensures airlines maintain the right trajectory

Teradata

JUNE 2, 2022

Learn how data and analytics can enable airlines to navigate towards more streamlined operations. Read more.

Data

Deep Learning Key Terms, Explained

KDnuggets

JUNE 13, 2022

Gain a beginner's perspective on artificial neural networks and deep learning with this set of 14 straight-to-the-point related key concept definitions.

Deep Learning

Deep Learning Machine Learning

How Netflix Content Engineering makes a federated graph searchable (Part 2)

Netflix Tech

JUNE 15, 2022

By Alex Hutter , Falguni Jhaveri , and Senthil Sayeebaba In a previous post , we described the indexing architecture of Studio Search and how we scaled the architecture by building a config-driven self-service platform that allowed teams in Content Engineering to spin up search indices easily. This post will discuss how Studio Search supports querying the data available in these indices.

Engineering

Engineering Accessibility Accessible Architecture

How to Elastically Scale Apache Kafka Clusters on Confluent Cloud

Confluent

JUNE 7, 2022

How to elastically scale Kafka clusters from 0 to 100 MB/s and back with automatic cluster resizing, data rebalancing, real-time consumption optimization, and monitoring in seconds.

Kafka

Kafka Cloud Data

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

Strategies And Tactics For A Successful Master Data Management Implementation

Data Engineering Podcast

JUNE 26, 2022

Summary The most complicated part of data engineering is the effort involved in making the raw data fit into the narrative of the business. Master Data Management (MDM) is the process of building consensus around what the information actually means in the context of the business and then shaping the data to match those semantics. In this episode Malcolm Hawker shares his years of experience working in this domain to explore the combination of technical and social skills that are necessary to mak

Data Management

Data Management Management MongoDB MySQL

#ClouderaLife Spotlight: Hassan Mirza

Cloudera

JUNE 8, 2022

In this #ClouderaLife Spotlight Hassan talks about three life themes that have kept him moving and motivated: learning from his father’s work ethic despite his family’s forcible displacement from their country of origin, his early experience with organized sports, and the value of mentorship. Hassan describes how these experiences led him to give back to his family and community by becoming a Mental Health First Aider and a mentor for refugees seeking a better life.

Consulting

Consulting Recruitment Finance Certification

A Model Implementation

Teradata

JUNE 9, 2022

How do you take the first steps to free the power of analytics from on-premise systems whilst protecting valuable data and de-risking transformation? Find out more.

Systems

Systems Data

NLP, NLU, and NLG: What’s The Difference? A Comprehensive Guide

KDnuggets

JUNE 10, 2022

This article aims to quickly cover the similarities and differences between NLP, NLU, and NLG and talk about what the future for NLP holds.

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

Data

June, 2022

Making Sense of CRISP-ML(Q): The Machine Learning Lifecycle Process

Data Orchestration Trends: The Shift From Data Pipelines to Data Products

Webinars

Trending Sources

Azure Data Factory: New Monitoring View Features

Webinars

Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

5 Steps to land a high paying data engineering job

Dynamic Task Mapping in Apache Airflow

24 SQL Questions You Might See on Your Next Interview

Sign up to get articles personalized to your interests!

More Trending

24 SQL Questions You Might See on Your Next Interview

Data Orchestration Trends: The Shift From Data Pipelines to Data Products

Azure Data Factory: Script Activity

Level Up Your Data Platform With Active Metadata

The Future Is Hybrid Data, Embrace It

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Modernizing a public health system with Teradata’s connected analytic architecture

Data Preparation with SQL Cheatsheet

Introducing the Current 2022 Program Committee

Azure Data Factory: Monitor Self Hosted Integration Runtime Metrics

How to Modernize Manufacturing Without Losing Control

Combining The Simplicity Of Spreadsheets With The Power Of Modern Data Infrastructure At Canvas

Moving Enterprise Data From Anywhere to Any System Made Easy

Natively Connect Teradata QueryGrid to Google BigQuery

Introducing Objectiv: Open-source product analytics infrastructure

The Ultimate Guide to Apache Airflow DAGS

Machine Learning Metrics: How to Measure the Performance of a Machine Learning Model

Autonomous Networks — The Telco and Media Growth Engine

Discover And De-Clutter Your Unstructured Data With Aparavi

Cloudera’s Applied ML Prototype Catalog Continues to Grow

Optimizing The Modern Developer Experience with Coder

Operational excellence—data ensures airlines maintain the right trajectory

Deep Learning Key Terms, Explained

How Netflix Content Engineering makes a federated graph searchable (Part 2)

How to Elastically Scale Apache Kafka Clusters on Confluent Cloud

15 Modern Use Cases for Enterprise Business Intelligence

Strategies And Tactics For A Successful Master Data Management Implementation

#ClouderaLife Spotlight: Hassan Mirza

A Model Implementation

NLP, NLU, and NLG: What’s The Difference? A Comprehensive Guide

Apache Airflow® Best Practices: DAG Writing

Stay Connected