Top Data Engineering Digest Data Engineer Data Engineering Content for Week of Jan 28

Sat.Jan 28, 2023 - Fri.Feb 03, 2023

Getting Started with The Basics of Docker

Analytics Vidhya

FEBRUARY 3, 2023

Introduction “Let’s containerize your code to ship worldwide!” If you read the above quote, you must think, what does this all mean? Well, my friend, this is what Docker is. Let me explain it with an example. Say Harish and Lisa are two people working on the same project but on two different systems(say windows and […] The post Getting Started with The Basics of Docker appeared first on Analytics Vidhya.

Coding

Coding Project Systems IT

Apple cracking down to enforce its RTO policy

The Pragmatic Engineer

FEBRUARY 2, 2023

Originally published 2 February 2023. 👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of seven topics in today’s subscriber-only The Scoop issue. To get this newsletter every week, subscribe here. Apple was the first Big Tech giant to mandate a proper return to the office and back in September 2022, this initiative was in full swing, being rolled out in the US and with 3 days per week in the office mandated in the UK.

IT Software Engineering Software Engineer Media

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Precision in Motion: Why Process Optimization Is the Future of Manufacturing

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Trending Sources

Simon Späti

Learn Machine Learning From These GitHub Repositories

KDnuggets

JANUARY 31, 2023

Kickstart your Machine Learning career with these curated GitHub repositories.

Machine Learning

Webinars

Precision in Motion: Why Process Optimization Is the Future of Manufacturing

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Table file formats - Change Data Capture: Delta Lake

Waitingforcode

FEBRUARY 3, 2023

It's time to start the 4th part of the Table file formats series. This time the topic will be Change Data Capture, so how to stream all changes made on the table. As for the 3rd part, I'm going to start with Delta Lake.

Data

Data IT

Precision in Motion: Why Process Optimization Is the Future of Manufacturing

Speaker: Jason Chester, Director, Product Management

In today’s manufacturing landscape, staying competitive means moving beyond reactive quality checks and toward real-time, data-driven process control. But what does true manufacturing process optimization look like—and why is it more urgent now than ever? Join Jason Chester in this new, thought-provoking session on how modern manufacturers are rethinking quality operations from the ground up.

Manufacturing

The Impact of Big Data on Healthcare Decision Making

Analytics Vidhya

JANUARY 31, 2023

Introduction Big data is revolutionizing the healthcare industry and changing how we think about patient care. In this case, big data refers to the vast amounts of data generated by healthcare systems and patients, including electronic health records, claims data, and patient-generated data. With the ability to collect, manage, and analyze vast amounts of data, […] The post The Impact of Big Data on Healthcare Decision Making appeared first on Analytics Vidhya.

Healthcare

Healthcare Big Data Electronics Data

Creating Health Plan Price Transparency in Coverage With the Lakehouse

databricks

FEBRUARY 1, 2023

What is price transparency and what challenges does it present? In the United States, health care delivery systems and health plans alike are.

Systems

Systems IT

How to Implement a Federated Learning Project with Healthcare Data

KDnuggets

FEBRUARY 3, 2023

Learn about Federated Learning and how you can use it in the healthcare sector.

Healthcare

Healthcare Project Data IT

More Trending

How to Implement a Federated Learning Project with Healthcare Data

KDnuggets

FEBRUARY 3, 2023

Learn about Federated Learning and how you can use it in the healthcare sector.

Healthcare

Healthcare Project Data IT

Data News — Week 23.05

Christophe Blefari

FEBRUARY 3, 2023

Delivering the data news ( credits ) Hey you, it's already February. Every week same analysis for me. I plan too many tasks but I slowly deliver. I guess that's how it is. Still I love this Friday rendezvous that we have together. I'm still amazed by how I changed my old habits to add the writing in my workflow. And it brings me a lot of joy.

BI Google Cloud Machine Learning SQL

How to Develop Serverless Code Using Azure Functions?

Analytics Vidhya

JANUARY 30, 2023

Introduction Azure Functions is a serverless computing service provided by Azure that provides users a platform to write code without having to provision or manage infrastructure in response to a variety of events. Whether we are analyzing IoT data streams, managing scheduled events, processing document uploads, responding to database changes, etc. Azure functions allow developers […] The post How to Develop Serverless Code Using Azure Functions?

Coding

Coding Database Management Process

YARN or Kubernetes for Apache Spark?

Waitingforcode

FEBRUARY 3, 2023

I've written my first Kubernetes on Apache Spark blog post in 2018 with a try to answer the question, what Kubernetes can bring to Apache Spark? Four years later this resource manager is a mature Spark component, but a new question has arisen in my head. Should I stay on YARN or switch to Kubernetes?

Management

20 Questions (with Answers) to Detect Fake Data Scientists: ChatGPT Edition, Part 2

KDnuggets

FEBRUARY 1, 2023

Can ChatGPT provide answers to data science questions to the same standard of humans? Check out this attempt to do so, and compare the answers to those from experts.

Data Science

Data Science Data

Airflow Best Practices for ETL/ELT Pipelines

Speaker: Kenten Danas, Senior Manager, Developer Relations

ETL and ELT are some of the most common data engineering use cases, but can come with challenges like scaling, connectivity to other systems, and dynamically adapting to changing data sources. Airflow is specifically designed for moving and transforming data in ETL/ELT pipelines, and new features in Airflow 3.0 like assets, backfills, and event-driven scheduling make orchestrating ETL/ELT pipelines easier than ever!

Data Engineering

Do You Need A Data Warehouse – A Quick Guide

Seattle Data Guy

FEBRUARY 1, 2023

Recently several consulting calls started with people asking, “Do we need a data warehouse?” This isn’t a question about whether you need data warehouse consultants, but instead whether you should event start a data warehouse project. Which is a very fair question. Not every company needs a data warehouse. That being said data warehouses can… Read more The post Do You Need A Data Warehouse – A Quick Guide appeared first on Seattle Data Guy.

Data Warehouse

Data Warehouse Consulting Data Project

Practicing Machine Learning with Imbalanced Dataset

Analytics Vidhya

JANUARY 31, 2023

Introduction In today’s world, machine learning and artificial intelligence are widely used in almost every sector to improve performance and results. But are they still useful without the data? The answer is No. The machine learning algorithms heavily rely on data that we feed to them. The quality of data we feed to the algorithms […] The post Practicing Machine Learning with Imbalanced Dataset appeared first on Analytics Vidhya.

Machine Learning

Machine Learning Datasets Algorithm Structured Data

What's new on the cloud for data engineers - part 7 (05-08.2022)

Waitingforcode

FEBRUARY 3, 2023

Four months in cloud history is a huge period of time. Even when 2 of the 4 months are the usual "holiday" months. As you can guess from the title, it's time to see what changed recently on the cloud from a data engineering perspective!

Data Engineering

Data Engineering Data Engineer Cloud Engineering

10 Free Machine Learning Courses from Top Universities

KDnuggets

FEBRUARY 2, 2023

Learn the basics of machine learning, including classification, SVM, decision tree learning, neural networks, convolutional, neural networks, boosting, and K nearest neighbors.

Machine Learning

Whats New in Apache Airflow 3.0 –– And How Will It Reshape Your Data Workflows?

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data Workflow

Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics

Data Engineering Podcast

JANUARY 29, 2023

Summary Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed.

Business Intelligence

Business Intelligence PostgreSQL Building BI

Top 10 Applications of Sentiment Analysis in Business

Analytics Vidhya

JANUARY 30, 2023

Introduction We are all aware of the Internet’s explosive expansion as a primary source of information and a platform for opinion expression. It has now become essential to gather and analyze the ever-expanding data that follows. While in the past, manual analysis of data has been possible and even served us well, the same cannot […] The post Top 10 Applications of Sentiment Analysis in Business appeared first on Analytics Vidhya.

Data

Data Retail IT Python

Predicate pushdown, why it doesn't work every time?

Waitingforcode

FEBRUARY 3, 2023

Pushdowns in Apache Spark are great to delegate some operations to the data sources. It's a great way to reduce the data volume to be processed in the job. However, there is one important gotcha. Watch out the definition of your predicate because from time to time, even though the pushdown predicate is supported by the data source, the predicate can still be executed by the Apache Spark job!

IT Process Data

skops: a new library to improve scikit-learn in production

KDnuggets

FEBRUARY 1, 2023

There are various challenges in MLOps and model sharing, including, security and reproducibility. To tackle these for scikit-learn models, we've developed a new open-source library: skops. In this article, I will walk you through how it works and how to use it with an end-to-end example.

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Five Challenges CIOs Need to Overcome in the New Year

databricks

JANUARY 31, 2023

As IT leaders kick off the new year during one of the most tumultuous times in recent history, CIOs are being forced to.

IT Data

An Ultimate Manual to Apache Oozie

Analytics Vidhya

FEBRUARY 2, 2023

Introduction Big data processing is crucial today. Big data analytics and learning help corporations foresee client demands, provide useful recommendations, and more. Hadoop, the Open-Source Software Framework for scalable and scattered computation of massive data sets, makes it easy. While MapReduce, Hive, Pig, and Cascading are all useful tools, completing all necessary processing or computing […] The post An Ultimate Manual to Apache Oozie appeared first on Analytics Vidhya.

Hadoop

Hadoop Big Data Data Analytics Data Process

Table formats - reading: Delta Lake

Waitingforcode

FEBRUARY 3, 2023

In the previous blog post about Delta Lake you discovered the logic for the writing part. Meantime Delta Lake 2 was released and it's for this brand new version that I'm going to share with you some findings related to the data reading.

IT Data

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Data Integration Strategies for Time Series Databases

Towards Data Science

FEBRUARY 3, 2023

Exploring popular data integration strategies for TSDBs including ETL, ELT, and CDC Continue reading on Towards Data Science »

Data Integration

Data Integration Database Data Science Data

YARN for Large Scale Computing: Beginner’s Edition

Analytics Vidhya

JANUARY 31, 2023

Introduction YARN stands for Yet Another Resource Negotiator. It is a powerful resource management system for a horizontal server environment. It is designed to be more flexible and generic than the original Hadoop MapReduce system, making it an attractive choice for companies looking to implement Hadoop. It allows companies to process data types and run […] The post YARN for Large Scale Computing: Beginner’s Edition appeared first on Analytics Vidhya.

Hadoop

Hadoop Designing Systems Management

Observable metrics

Waitingforcode

FEBRUARY 3, 2023

Observability is a hot topic nowadays, not only for the data but also the software industry. Apache Spark innovates in this field a lot, including new metrics for Structured Streaming and an important update added in the 3.0.0 release that I missed at the time, which are the observable metrics.

Data

Tapping into the Potential of Data Products in 2023

KDnuggets

JANUARY 31, 2023

Learn how data can be treated as a product and how it can be used to derive value.

Data

Data IT Data Engineer Data Engineering

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

Speaker: Tamara Fingerlin, Developer Advocate

Data Workflow

Asynchronous computing at Meta: Overview and learnings

Engineering at Meta

JANUARY 31, 2023

We’ve made architecture changes to Meta’s event driven asynchronous computing platform that have enabled easy integration with multiple event-sources. We’re sharing our learnings from handling various workloads and how to tackle trade offs made with certain design choices in building the platform. Asynchronous computing is a paradigm where the user does not expect a workload to be executed immediately; instead, it gets scheduled for execution sometime in the near future without blocking the la

Transportation

Transportation Architecture Data Warehouse MySQL

Table file formats - reading path: Apache Iceberg

Waitingforcode

FEBRUARY 3, 2023

Last week you could read about data reading in Delta Lake. Today it's time to cover this part in Apache Iceberg!

IT Data

10 Pandas One Liners for Data Access, Manipulation, and Management

KDnuggets

JANUARY 30, 2023

These 10 one liners will help you start to access, manipulate, and manage data using Pandas.

Accessibility

Accessibility Accessible Management Data

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

Data Pipeline

Sat.Jan 28, 2023 - Fri.Feb 03, 2023

Getting Started with The Basics of Docker

Apple cracking down to enforce its RTO policy

Webinars

Trending Sources

Learn Machine Learning From These GitHub Repositories

Webinars

Table file formats - Change Data Capture: Delta Lake

Precision in Motion: Why Process Optimization Is the Future of Manufacturing

The Impact of Big Data on Healthcare Decision Making

Creating Health Plan Price Transparency in Coverage With the Lakehouse

How to Implement a Federated Learning Project with Healthcare Data

Sign up to get articles personalized to your interests!

More Trending

How to Implement a Federated Learning Project with Healthcare Data

Data News — Week 23.05

How to Develop Serverless Code Using Azure Functions?

YARN or Kubernetes for Apache Spark?

20 Questions (with Answers) to Detect Fake Data Scientists: ChatGPT Edition, Part 2

Airflow Best Practices for ETL/ELT Pipelines

Do You Need A Data Warehouse – A Quick Guide

Practicing Machine Learning with Imbalanced Dataset

What's new on the cloud for data engineers - part 7 (05-08.2022)

10 Free Machine Learning Courses from Top Universities

Whats New in Apache Airflow 3.0 –– And How Will It Reshape Your Data Workflows?

Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics

Top 10 Applications of Sentiment Analysis in Business

Predicate pushdown, why it doesn't work every time?

skops: a new library to improve scikit-learn in production

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Five Challenges CIOs Need to Overcome in the New Year

An Ultimate Manual to Apache Oozie

Table formats - reading: Delta Lake

Top Posts January 23-29: The ChatGPT Cheat Sheet

How to Modernize Manufacturing Without Losing Control

Data Integration Strategies for Time Series Databases

YARN for Large Scale Computing: Beginner’s Edition

Observable metrics

Tapping into the Potential of Data Products in 2023

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

Asynchronous computing at Meta: Overview and learnings

Top 8 Interview Questions on Apache Sqoop

Table file formats - reading path: Apache Iceberg

10 Pandas One Liners for Data Access, Manipulation, and Management

A Guide to Debugging Apache Airflow® DAGs

Stay Connected