Top Data Engineering Digest Data Programming Cloud Storage Content for September, 2022

September, 2022

How to Correctly Select a Sample From a Huge Dataset in Machine Learning

KDnuggets

SEPTEMBER 28, 2022

We explain how choosing a small, representative dataset from a large population can improve model training reliability.

Datasets

Datasets Machine Learning

Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)

Simon Späti

SEPTEMBER 30, 2022

Image by Rachel Claire on Pexels Ever wanted or been asked to build an open-source Data Lake offloading data for analytics? Asked yourself what components and features would that include. Didn’t know the difference between a Data Lakehouse and a Data Warehouse? Or you just wanted to govern your hundreds to thousands of files and have more database-like features but don’t know how?

Data Lake

Data Lake Data Warehouse Government Data

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Airflow Taskflow API: The Guide

Marc Lamberti

SEPTEMBER 18, 2022

Airflow Taskflow is a new way of writing DAGs at ease. As you will see, you need to write fewer lines than before to obtain the same DAG. That helps to make DAGs easier to build, read, and maintain. The Taskflow API has three main aspects: XCOM Args, Decorator, and XCOM backends. In this tutorial, you will learn what the Taskflow API is, why it is crucial for you, and how to create your DAGs.

SQL

SQL Python Coding Accessible

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Building Data Pipelines That Run From Source To Analysis And Activation With Hevo Data

Data Engineering Podcast

SEPTEMBER 11, 2022

Summary Any business that wants to understand their operations and customers through data requires some form of pipeline. Building reliable data pipelines is a complex and costly undertaking with many layered requirements. In order to reduce the amount of time and effort required to build pipelines that power critical insights Manish Jethani co-founded Hevo Data.

Data Pipeline

Data Pipeline Building MongoDB MySQL

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

Data Pipeline

Real-Time Gaming Infrastructure for Millions of Users with Apache Kafka, ksqlDB, and WebSockets

Confluent

SEPTEMBER 14, 2022

How gaming enterprises like Sony and Big Fish Games use Apache Kafka®, Confluent, and ksqlDB’s data streaming technologies for the best in-game experience, ROI, and real-time capabilities.

Kafka

Kafka Technology Data

Top 10 Globally Recognized Certifications for Cyber Security

U-Next

SEPTEMBER 27, 2022

Introduction . Cybersecurity or computer security and information security is the act of preventing theft, damage, loss, or unauthorized access to computers, networks, and data. As our interconnections grow, so do the chances for evil hackers to steal, destroy, or disrupt our lives. The increase in cybercrime has increased the demand for cybersecurity expertise.

Certification

Certification Consulting Computer Science Government

Become an AI Artist Using Phraser and Stable Diffusion

KDnuggets

SEPTEMBER 28, 2022

Generate the prompt using Phraser and create realistic art using the Diffusion model.

More Trending

Become an AI Artist Using Phraser and Stable Diffusion

KDnuggets

SEPTEMBER 28, 2022

Generate the prompt using Phraser and create realistic art using the Diffusion model.

The Rise of the Semantic Layer

Simon Späti

SEPTEMBER 29, 2022

A semantic layer is something we use every day. We build dashboards with yearly and monthly aggregations. We design dimensions for drilling down reports by region, product, or whatever metrics we are interested in. What has changed is that we no longer use a singular business intelligence tool; different teams use different visualizations (BI, notebooks, and embedded analytics).

BI Business Intelligence Designing Building

Large Scale Industrialization Key to Open Source Innovation

Cloudera

SEPTEMBER 7, 2022

We are now well into 2022 and the megatrends that drove the last decade in data — The Apache Software Foundation as a primary innovation vehicle for big data, the arrival of cloud computing, and the debut of cheap distributed storage — have now converged and offer clear patterns for competitive advantage for vendors and value for customers. Cloudera has been parlaying those patterns into clear wins for the community at large and, more importantly, streamlining the benefits of that innovation to

Big Data Ecosystem

Big Data Ecosystem Hadoop Big Data Architecture

Build A Common Understanding Of Your Data Reliability Rules With Soda Core and Soda Checks Language

Data Engineering Podcast

SEPTEMBER 25, 2022

Summary Regardless of how data is being used, it is critical that the information is trusted. The practice of data reliability engineering has gained momentum recently to address that question. To help support the efforts of data teams the folks at Soda Data created the Soda Checks Language and the corresponding Soda Core utility that acts on this new DSL.

Building

Building Metadata MongoDB MySQL

Keeping Multiple Databases in Sync Using Kafka Connect and CDC

Confluent

SEPTEMBER 20, 2022

Microservices have numerous benefits, but data silos are incredibly challenging. Learn how Kafka Connect and CDC provide real-time database synchronization, bridging data silos between all microservice applications.

Kafka

Kafka Database Data

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

Rejoice! The Vantage Analytics and Data Platform Provide Incredible Power for All in a “Cloudy” Environment

Teradata

SEPTEMBER 30, 2022

With the release of VantageCloud Lake and ClearScape Analytics, Teradata brings a cloud-native architecture to extend the technical innovations and differentiators that Vantage is well known for.

Architecture

Architecture Cloud Data

The Mistake Every Data Scientist Has Made at Least Once

KDnuggets

SEPTEMBER 22, 2022

How to increase your chances of avoiding the mistake.

Data

Data Data Science

The Rise of the Semantic Layer

Simon Späti

SEPTEMBER 29, 2022

BI Business Intelligence Designing Building

Data Governance and Strategy for the Global Enterprise

Cloudera

SEPTEMBER 23, 2022

In a recent blog, Cloudera Chief Technology Officer Ram Venkatesh described the evolution of a data lakehouse, as well as the benefits of using an open data lakehouse, especially the open Cloudera Data Platform (CDP). If you missed it, you can read up about it here. Modern data lakehouses are typically deployed in the cloud. Cloud computing brings several distinct advantages that are core to the lakehouse value proposition.

Data Governance

Data Governance Government Amazon Web Services Cloud Computing

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Power Your Real-Time Analytics Without The Headache Using Fivetran's Change Data Capture Integrations

Data Engineering Podcast

SEPTEMBER 25, 2022

Summary Data integration from source systems to their downstream destinations is the foundational step for any data product. With the increasing expecation for information to be instantly accessible, it drives the need for reliable change data capture. The team at Fivetran have recently introduced that functionality to power real-time data products.

Food

Food MongoDB MySQL Scala

Excited to be back at Google Cloud Next 2022!

Confluent

SEPTEMBER 28, 2022

Highlighting sessions on the power of our Confluent-Google partnership: multi-layer data security, real-time cloud data streaming and analytics, database modernization, and more.

Google Cloud

Google Cloud Cloud Data Security Database

What Do You Want to be Famous for?

Teradata

SEPTEMBER 1, 2022

Financial services organizations that exhibit true data literacy avoid bottlenecks and instead choose to build best in class solutions that meet current and future needs. Find out more.

Building

Building Data

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support…

Netflix Tech

SEPTEMBER 29, 2022

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support for Non-Parallelizable Workloads by Kostas Christidis Introduction Timestone is a high-throughput, low-latency priority queueing system we built in-house to support the needs of Cosmos , our media encoding platform. Over the past 2.5 years, its usage has increased, and Timestone is now also the priority queueing engine backing Conductor , our general-purpose workflow orchestration engine, and BDP Sch

Systems

Systems Metadata Media Kafka

The Modern Data Lakehouse: An Architectural Innovation

Cloudera

SEPTEMBER 9, 2022

The promise of a modern data lakehouse architecture. Imagine having self-service access to all business data, anywhere it may be, and being able to explore it all at once. Imagine quickly answering burning business questions nearly instantly, without waiting for data to be found, shared, and ingested. Imagine independently discovering rich new business insights from both structured and unstructured data working together, without having to beg for data sets to be made available.

Architecture

Architecture Metadata Machine Learning Unstructured Data

Operational Analytics To Increase Efficiency For Multi-Location Businesses With OpsAnalitica

Data Engineering Podcast

SEPTEMBER 18, 2022

Summary In order to improve efficiency in any business you must first know what is contributing to wasted effort or missed opportunities. When your business operates across multiple locations it becomes even more challenging and important to gain insights into how work is being done. In this episode Tommy Yionoulis shares his experiences working in the service and hospitality industries and how that led him to found OpsAnalitica, a platform for collecting and analyzing metrics on multi location

Hospitality

Hospitality Food MongoDB MySQL

Event-Driven Microservices with Python and Apache Kafka

Confluent

SEPTEMBER 21, 2022

A deep dive into how microservices work, why it’s the backbone of real-time applications, and how to build event-driven microservices applications with Python and Kafka.

Kafka

Kafka Python Building

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

KonMari your data: Planning a query migration using the Marie Kondo method

dbt Developer Hub

SEPTEMBER 7, 2022

If you’ve ever heard of Marie Kondo, you’ll know she has an incredibly soothing and meditative method to tidying up physical spaces. Her KonMari Method is about categorizing, discarding unnecessary items, and building a sustainable system for keeping stuff. As an analytics engineer at your company, doesn’t that last sentence describe your job perfectly?!

Designing

Designing Data Project Coding

5 Concepts You Should Know About Gradient Descent and Cost Function

KDnuggets

SEPTEMBER 16, 2022

Why is Gradient Descent so important in Machine Learning? Learn more about this iterative optimization algorithm and how it is used to minimize a loss function.

Machine Learning

Machine Learning Algorithm IT

7 Ways To Develop A Portfolio That Gets You Hired

U-Next

SEPTEMBER 30, 2022

Show, don’t tell is what people tell writers and screenwriters, but this is practically applicable to aspirants who want to land their dream jobs as well. Apart from your university degree and professional certifications, what adds compelling weightage to your candidacy is a solid portfolio. . A portfolio is like your business card and regardless of whether you are a fresher or someone experienced, moving up the corporate ladder, a portfolio is what will ensure you a job, that higher paycheck a

Portfolio

Portfolio Recruitment Certification Project

Improve Underwriting Using Data and Analytics

Cloudera

SEPTEMBER 22, 2022

Insurance carriers are always looking to improve operational efficiency. We’ve previously highlighted opportunities to improve digital claims processing with data and AI. In this post, I’ll explore opportunities to enhance risk assessment and underwriting, especially in personal lines and small and medium-sized enterprises. Underwriting is an area that can yield improvements by applying the old saying “work smarter, not harder.

Insurance

Insurance Medical Machine Learning Data

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

Building A Shared Understanding Of Data Assets In A Business Through A Single Pane Of Glass With Workstream

Data Engineering Podcast

SEPTEMBER 18, 2022

Summary There is a constant tension in business data between growing siloes, and breaking them down. Even when a tool is designed to integrate information as a guard against data isolation, it can easily become a silo of its own, where you have to make a point of using it to seek out information. In order to help distribute critical context about data assets and their status into the locations where work is being done Nicholas Freund co-founded Workstream.

Building

Building Metadata MongoDB MySQL

6 Ways Data Streaming is Transforming Financial Services

Confluent

SEPTEMBER 12, 2022

How banks and finance companies use Confluent to transform their digital systems with event-driven architecture, real-time payment processing, fraud detection, and analytics.

Banking

Banking Finance Architecture Data

The case against `git cherry pick`: Recommended branching strategy for multi-environment dbt projects

dbt Developer Hub

SEPTEMBER 12, 2022

Why do people cherry pick into upper branches? The simplest branching strategy for making code changes to your dbt project repository is to have a single main branch with your production-level code. To update the main branch, a developer will: Create a new feature branch directly from the main branch Make changes on said feature branch Test locally When ready, open a pull request to merge their changes back into the main branch If you are just getting started in dbt and deciding which branchin

Project

Project Coding Process Cloud

5 Data Science Skills That Pay & 5 That Don’t

KDnuggets

SEPTEMBER 13, 2022

This article will go over the top 5 data science skills that pay you and 5 that don’t.

Data Science

Data Science Data

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineering

September, 2022

How to Correctly Select a Sample From a Huge Dataset in Machine Learning

Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)

Webinars

Trending Sources

Airflow Taskflow API: The Guide

Webinars

Building Data Pipelines That Run From Source To Analysis And Activation With Hevo Data

A Guide to Debugging Apache Airflow® DAGs

Real-Time Gaming Infrastructure for Millions of Users with Apache Kafka, ksqlDB, and WebSockets

Top 10 Globally Recognized Certifications for Cyber Security

Become an AI Artist Using Phraser and Stable Diffusion

Sign up to get articles personalized to your interests!

More Trending

Become an AI Artist Using Phraser and Stable Diffusion

The Rise of the Semantic Layer

Large Scale Industrialization Key to Open Source Innovation

Build A Common Understanding Of Your Data Reliability Rules With Soda Core and Soda Checks Language

Keeping Multiple Databases in Sync Using Kafka Connect and CDC

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Rejoice! The Vantage Analytics and Data Platform Provide Incredible Power for All in a “Cloudy” Environment

The Mistake Every Data Scientist Has Made at Least Once

The Rise of the Semantic Layer

Data Governance and Strategy for the Global Enterprise

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Power Your Real-Time Analytics Without The Headache Using Fivetran's Change Data Capture Integrations

Excited to be back at Google Cloud Next 2022!

What Do You Want to be Famous for?

More Performance Evaluation Metrics for Classification Problems You Should Know

How to Modernize Manufacturing Without Losing Control

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support…

The Modern Data Lakehouse: An Architectural Innovation

Operational Analytics To Increase Efficiency For Multi-Location Businesses With OpsAnalitica

Event-Driven Microservices with Python and Apache Kafka

Optimizing The Modern Developer Experience with Coder

KonMari your data: Planning a query migration using the Marie Kondo method

5 Concepts You Should Know About Gradient Descent and Cost Function

7 Ways To Develop A Portfolio That Gets You Hired

Improve Underwriting Using Data and Analytics

15 Modern Use Cases for Enterprise Business Intelligence

Building A Shared Understanding Of Data Assets In A Business Through A Single Pane Of Glass With Workstream

6 Ways Data Streaming is Transforming Financial Services

The case against `git cherry pick`: Recommended branching strategy for multi-environment dbt projects

5 Data Science Skills That Pay & 5 That Don’t

The Ultimate Guide to Apache Airflow DAGS

Stay Connected