Top Data Engineering Digest Data Schemas Data Cleanse Content for January, 2021

January, 2021

Data-driven 2021: Predictions for a new year in data, analytics and AI

DataKitchen

JANUARY 4, 2021

The post Data-driven 2021: Predictions for a new year in data, analytics and AI first appeared on DataKitchen.

Data Analytics

Data Analytics Data

Confluent and Microsoft Announce Strategic Alliance

Confluent

JANUARY 26, 2021

Today, I am thrilled to announce a new strategic alliance with Microsoft to enable a seamless, integrated experience between Confluent Cloud and the Azure platform. This represents a significant milestone […].

Cloud

Cloud Programming

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 1: The Set-Up & Basics

Cloudera

JANUARY 6, 2021

Introduction. Python is used extensively among Data Engineers and Data Scientists to solve all sorts of problems from ETL/ELT pipelines to building machine learning models. Apache HBase is an effective data storage system for many workflows but accessing this data specifically through Python can be a struggle. For data professionals that want to make use of data stored in HBase the recent upstream project “hbase-connectors” can be used with PySpark for basic operations.

Machine Learning

Machine Learning Data Science Database Building

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Improving Population Health Through Citizen 360

Teradata

JANUARY 5, 2021

By leveraging data to create a 360 degree view of its citizenry, government agencies can create more optimal experiences & improve outcomes such as closing the tax gap or improving quality of care.

Government

Government IT Data

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

Data Pipeline

How to update millions of records in MySQL?

Start Data Engineering

JANUARY 30, 2021

Introduction Setup Problems with a single large update Updating in batches Conclusion Further reading Introduction When updating a large number of records in an OLTP database, such as MySQL, you have to be mindful about locking the records. If those records are locked, they will not be editable(update or delete) by other transactions on your database.

MySQL

MySQL Database

Why do you need Agile Software Development for your Data Use Case ?

François Nguyen

JANUARY 29, 2021

The agile part of DataOps In this previous article, we have defined dataops as a”combination of tools and methods inspired by Agile, Devops and Lean Manufacturing » (thanks to DataKitchen for this definition). Let’s focus on the agile part and why it is so relevant for your data use cases. “Data is like a box of chocolates, you never know what you’re gonna get.” It is about the nature of data : you cannot guess what will be the content and the quality of your data sources

Manufacturing

Manufacturing Data Analysis Data Building

Job conversion possibilities within Data Science

Team Data Science

JANUARY 14, 2021

Data science encompasses a range of fields, like data analysis, machine learning, statistics, computer science, infrastructure, and data architecture, and looking at how businesses are transforming on a day-to-day basis, we may infer that some data science jobs will be in high demand within the next ten years, there is a strong need for experts who understand the market demands, who can formulate a data-driven approach and then execute the way out.

Data Science

Data Science Computer Science Data Engineer Data Engineering

More Trending

Job conversion possibilities within Data Science

Team Data Science

JANUARY 14, 2021

Data Science

Data Science Computer Science Data Engineer Data Engineering

Helpful Tools for Apache Kafka Developers

Confluent

JANUARY 20, 2021

Apache Kafka® is at the core of a large ecosystem that includes powerful components, such as Kafka Connect and Kafka Streams. This ecosystem also includes many tools and utilities that […].

Kafka

Kafka Utilities

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 3: Productionization of ML models

Cloudera

JANUARY 20, 2021

In this last installment, we’ll discuss a demo application that uses PySpark.ML to make a classification model based off of training data stored in both Cloudera’s Operational Database (powered by Apache HBase) and Apache HDFS. Afterwards, this model is then scored and served through a simple Web Application. For more context, this demo is based on concepts discussed in this blog post How to deploy ML models to production.

Machine Learning

Machine Learning Database Data Science Building

Optimizing the Aural Experience on Android Devices with xHE-AAC

Netflix Tech

JANUARY 22, 2021

By Phill Williams and Vijay Gondi Introduction At Netflix, we are passionate about delivering great audio to our members. We began streaming 5.1 channel surround sound in 2010, Dolby Atmos in 2017 , and adaptive bitrate audio in 2019. Continuing in this tradition, we are proud to announce that Netflix now streams Extended HE-AAC with MPEG-D DRC ( xHE-AAC ) to compatible Android Mobile devices (Android 9 and newer).

Metadata

Metadata Programming Algorithm Media

How to unit test sql transforms in dbt

Start Data Engineering

JANUARY 16, 2021

Introduction Setup Code Conditional logic to read from mock input Custom macro to test for equality Setup environment specific test Run ELT using dbt Conclusion Further reading Introduction With the recent advancements in data warehouses and tools like dbt most transformations(T of ELT) are being done directly in the data warehouse. While this provides a lot of functionality out of the box, it gets tricky when you want to test your sql code locally before deploying to production.

SQL

SQL Data Warehouse Coding IT

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

The very strange way of doing Data Quality at Airbnb

François Nguyen

JANUARY 23, 2021

or why you should have a look at Data Observability ! This article is the second part on how Airbnb is managing data quality : “Part 2 — A New Gold Standard” The first part can be found here and it was just good principles about roles and responsabilities. The second part is really how they do it and all the steps to have a “certification” They name it Midas, the famous king who can turn everything into gold ( with a not so good ending ).

Finance

Finance Certification Data Pipeline Data

Skills you should have as a Data Engineer

Team Data Science

JANUARY 8, 2021

Big Data has become the dominant innovation in all high-performing companies. Notable businesses today focus their decision-making capabilities on knowledge gained from the study of big data. Big Data is a collection of large data sets, particularly from new sources, providing an array of possibilities for those who want to work with data and are enthusiastic about unraveling trends in rows of new, unstructured data.

Data Engineer

Data Engineer Data Engineering Engineering Unstructured Data

Property Based Testing Confluent Server Storage for Fun and Safety

Confluent

JANUARY 12, 2021

Confluent uses property-based testing to test various aspects of Confluent Server’s Tiered Storage feature. Tiered Storage shifts data from expensive local broker disks to cheaper, scalable object storage, thereby reducing […].

Data

Digital Transformation is a Data Journey From Edge to Insight

Cloudera

JANUARY 20, 2021

Digital transformation is a hot topic for all markets and industries as it’s delivering value with explosive growth rates. Consider that Manufacturing’s Industry Internet of Things (IIOT) was valued at $161b with an impressive 25% growth rate, the Connected Car market will be valued at $225b by 2027 with a 17% growth rate, or that in the first three months of 2020, retailers realized ten years of digital sales penetration in just three months.

Manufacturing

Manufacturing Data Warehouse Kafka Retail

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Making It Easier To Stick B2B Data Integration Pipelines Together With Hotglue

Data Engineering Podcast

JANUARY 25, 2021

Summary Businesses often need to be able to ingest data from their customers in order to power the services that they provide. For each new source that they need to integrate with it is another custom set of ETL tasks that they need to maintain. In order to reduce the friction involved in supporting new data transformations David Molot and Hassan Syyid built the Hotlue platform.

Data Integration

Data Integration IT Python BI

How to Backfill a SQL query using Apache Airflow

Start Data Engineering

JANUARY 6, 2021

What is backfilling ? Setup Prerequisites Apache Airflow - Execution Day Backfill Conclusion Further Reading References What is backfilling ? Backfilling refers to any process that involves modifying or adding new data to existing records in a dataset. This is a common use case in data engineering. Some examples can be a change in some business logic may need to be applied to an already processed dataset.

SQL

SQL Datasets Data Engineer Data Engineering

The last (but not least)”ops” you need for your data : DataGovops

François Nguyen

JANUARY 18, 2021

To finish the trilogy (Dataops, MLops), let’s talk about DataGovOps or how you can support your Data Governance initiative. The origin of the term : Datakitchen We must give credit to Chris Bergh and his team DataKictchen. You should visit their website , you will find incredible good stuff there. This article was published in October 2020 with this title : “Data Governance as Code” The idea behind that is you should “actively promotes the safe use of data with automation

Data Governance

Data Governance Metadata Government Data Pipeline

Big Data in Retail & CPG Requires a Scalpel, Not an Axe

Teradata

JANUARY 24, 2021

To satisfy the evolving demands of customers, Chief Commercial Officers need to wield big data in Retail & CPG using a precise scalpel rather than a blunt axe.

Retail

Retail Big Data Data

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Implementing mTLS and Securing Apache Kafka at Zendesk

Confluent

JANUARY 7, 2021

At Zendesk, Apache Kafka® is one of our foundational services for distributing events among different internal systems. We have pods, which can be thought of as isolated cloud environments where […].

Kafka

Kafka Cloud Systems

New Applied ML Research: Few-shot Text Classification

Cloudera

JANUARY 7, 2021

Text classification is a ubiquitous capability with a wealth of use cases. For example, recommendation systems rely on properly classifying text content such as news articles or product descriptions in order to provide users with the most relevant information. Classifying user-generated content allows for more nuanced sentiment analysis. And in the world of e-commerce, assigning product descriptions to the most fitting product category ensures quality control. .

Machine Learning

Machine Learning Algorithm Deep Learning Designing

Using Your Data Warehouse As The Source Of Truth For Customer Data With Hightouch

Data Engineering Podcast

JANUARY 18, 2021

Summary The data warehouse has become the central component of the modern data stack. Building on this pattern, the team at Hightouch have created a platform that synchronizes information about your customers out to third party systems for use by marketing and sales teams. In this episode Tejas Manohar explains the benefits of sourcing customer data from one location for all of your organization to use, the technical challenges of synchronizing the data to external systems with varying APIs, and

Data Warehouse

Data Warehouse BI Data Data Engineer

How to do Change Data Capture (CDC), using Singer

Start Data Engineering

JANUARY 1, 2021

Introduction Why Change Data Capture Setup Prerequisites Source setup Destination setup Source, MySQL CDC, MySQL => PostgreSQL Pros and Cons Pros Cons Conclusion References Introduction Change data capture is a software design pattern used to track every change(update, insert, delete) to the data in a database. In most databases these types of changes are added to an append only log (Binlog in MySQL, Write Ahead Log in PostgreSQL).

PostgreSQL

PostgreSQL MySQL Database Data

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

The Business Case for DataOps

DataKitchen

JANUARY 6, 2021

Savvy executives maximize the value of every budgeted dollar. Decisions to invest in new tools and methods must be backed up with a strong business case. As data professionals, we know the value and impact of DataOps: streamlining analytics workflows, reducing errors, and improving data operations transparency. Being able to quantify the value and impact helps leadership understand the return on past investments and supports alignment with future enterprise DataOps transformation initiatives.

Pharmaceutical

Pharmaceutical Consulting Utilities Programming

Digital Payments: An Explosion of Emerging Opportunities

Teradata

JANUARY 3, 2021

Digital payments generate 90% of financial institutions’ useful customer data. How can they exploit its value? Find out more.

IT Data

Apache Kafka Meets Table Football

Confluent

JANUARY 29, 2021

What happens when you let engineers go wild building an application to track the score of a table football game? This blog post shares SPOUD’s story of engineering a simple […].

Kafka

Kafka Engineering Building

Fostering community to help drive cultural change

Cloudera

JANUARY 18, 2021

2020 put on full display how humanity shows up in times of hardship. We saw everything from street celebrations to usher weary medical personnel home after long days fighting to save lives to places like food banks receiving more donations and volunteers than ever before. Some communities were harder hit than others, and we’ve seen the same in the global workplace.

Food

Food Medical Banking Programming

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

Enabling Version Controlled Data Collaboration With TerminusDB

Data Engineering Podcast

JANUARY 11, 2021

Summary As data professionals we have a number of tools available for storing, processing, and analyzing data. We also have tools for collaborating on software and analysis, but collaborating on data is still an underserved capability. Gavin Mendel-Gleason encountered this problem first hand while working on the Sesshat databank, leading him to create TerminusDB and TerminusHub.

PostgreSQL

PostgreSQL Python Computer Science Data Lake

How To Convert 100% of Your Proofs of Concept into Happy Customers

Monte Carlo

JANUARY 28, 2021

I’ve never done Sales before. Most of my professional history was spent working for a Customer Success company, Gainsight, where I provided the data, insights, and tools for our customer-facing teams. At Gainsight, my bonus was tied to 1) verified customer outcomes, 2) renewal rate, and 3) customer advocacy. I learned to be maniacally customer-focused.

BI Machine Learning Engineering Data

Do You Need a DataOps Dojo?

DataKitchen

JANUARY 20, 2021

As DataOps activity takes root within an enterprise, managers face the question of whether to build centralized or decentralized DataOps capabilities. Centralizing analytics brings it under control but granting analysts free reign is necessary to foster innovation and stay competitive. The beauty of DataOps is that you don’t have to choose between centralization and freedom.

Education

Education Coding Project Engineering

Six Crucial Refinements to Conventional Wisdom About Data Strategy

Teradata

JANUARY 28, 2021

This post shines a spotlight on the difference between what you already know & the more specific guidance you really need to create a successful data strategy.

Data

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineering

January, 2021

Data-driven 2021: Predictions for a new year in data, analytics and AI

Confluent and Microsoft Announce Strategic Alliance

Webinars

Trending Sources

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 1: The Set-Up & Basics

Webinars

Improving Population Health Through Citizen 360

A Guide to Debugging Apache Airflow® DAGs

How to update millions of records in MySQL?

Why do you need Agile Software Development for your Data Use Case ?

Job conversion possibilities within Data Science

Sign up to get articles personalized to your interests!

More Trending

Job conversion possibilities within Data Science

Helpful Tools for Apache Kafka Developers

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 3: Productionization of ML models

Optimizing the Aural Experience on Android Devices with xHE-AAC

How to unit test sql transforms in dbt

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

The very strange way of doing Data Quality at Airbnb

Skills you should have as a Data Engineer

Property Based Testing Confluent Server Storage for Fun and Safety

Digital Transformation is a Data Journey From Edge to Insight

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Making It Easier To Stick B2B Data Integration Pipelines Together With Hotglue

How to Backfill a SQL query using Apache Airflow

The last (but not least)”ops” you need for your data : DataGovops

Big Data in Retail & CPG Requires a Scalpel, Not an Axe

How to Modernize Manufacturing Without Losing Control

Implementing mTLS and Securing Apache Kafka at Zendesk

New Applied ML Research: Few-shot Text Classification

Using Your Data Warehouse As The Source Of Truth For Customer Data With Hightouch

How to do Change Data Capture (CDC), using Singer

Optimizing The Modern Developer Experience with Coder

The Business Case for DataOps

Digital Payments: An Explosion of Emerging Opportunities

Apache Kafka Meets Table Football

Fostering community to help drive cultural change

15 Modern Use Cases for Enterprise Business Intelligence

Enabling Version Controlled Data Collaboration With TerminusDB

How To Convert 100% of Your Proofs of Concept into Happy Customers

Do You Need a DataOps Dojo?

Six Crucial Refinements to Conventional Wisdom About Data Strategy

The Ultimate Guide to Apache Airflow DAGS

Stay Connected