Top Data Engineering Digest Data Validation Data Programming Content for Week of Nov 16

Sat.Nov 16, 2024 - Fri.Nov 22, 2024

8 Essential Data Pipeline Design Patterns You Should Know

Monte Carlo

NOVEMBER 21, 2024

Let’s set the scene: your company collects data, and you need to do something useful with it. Whether it’s customer transactions, IoT sensor readings, or just an endless stream of social media hot takes, you need a reliable way to get that data from point A to point B while doing something clever with it along the way. That’s where data pipeline design patterns come in.

Data Pipeline

Data Pipeline Designing Lambda Architecture Kafka

From IC to Data Leader: Key Strategies for Managing and Growing Data Teams

Seattle Data Guy

NOVEMBER 18, 2024

There are plenty of statistics about the speed at which we are creating data in today’s modern world. On the flip side of all that data creation is a need to manage all of that data and thats where data teams come in. But leading these data teams is challenging and yet many new data… Read more The post From IC to Data Leader: Key Strategies for Managing and Growing Data Teams appeared first on Seattle Data Guy.

Management

Management Data Big Data Data Science

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Secrets of Spark to Snowflake Migration Success: Customer Stories

Snowflake

NOVEMBER 19, 2024

Today’s business landscape is increasingly competitive — and the right data platform can be the difference between teams that feel empowered or impaired. I love talking with leaders across industries and organizations to hear about what’s top of mind for them as they evaluate various data platforms. In these conversations, there are a number of questions that I hear time and time again: Will my data platform be scalable and reliable enough?

Data Governance

Data Governance Government Healthcare Building

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Expert Insights for Your 2025 Data, Analytics, and AI Initiatives

Precisely

NOVEMBER 18, 2024

Key Takeaways: Data integrity is required for AI initiatives, better decision-making, and more – but data trust is on the decline. Data quality and data governance are the top data integrity challenges, and priorities. A long-term approach to your data strategy is key to success as business environments and technologies continue to evolve. The rapid pace of technological change has made data-driven initiatives more crucial than ever within modern business strategies.

Data Analytics

Data Analytics Data Governance Data Integration Government

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

Data Pipeline

DuckDB … reading from s3 … with AWS Credentials and more.

Confessions of a Data Guy

NOVEMBER 18, 2024

In my never-ending quest to plumb the most boring depths of every single data tool on the market, I found myself annoyed when recently using DuckDB for a benchmark that was reading parquet files from s3. What was not clear, or easy, was trying to figure out how DuckDB would LIKE to read default AWS […] The post DuckDB … reading from s3 … with AWS Credentials and more. appeared first on Confessions of a Data Guy.

AWS

AWS Data Big Data SQL

Challenges You Will Face When Parsing PDFs With Python – How To Parse PDFs With Python

Seattle Data Guy

NOVEMBER 19, 2024

Scraping data from PDFs is a right of passage if you work in data. Someone somewhere always needs help getting invoices parsed, contracts read through, or dozens of other use cases. Most of us will turn to Python and our trusty list of Python libraries and start plugging away. Of course, there are many challenges… Read more The post Challenges You Will Face When Parsing PDFs With Python – How To Parse PDFs With Python appeared first on Seattle Data Guy.

Python

Python Data Data Science Data Engineer

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

Liang Mou; Staff Software Engineer, Logging Platform | Elizabeth (Vi) Nguyen; Software Engineer I, Logging Platform | In today’s data-driven world, businesses need to process and analyze data in real-time to make informed decisions. Change Data Capture (CDC) is a crucial technology that enables organizations to efficiently track and capture changes in their databases.

Kafka

Kafka MySQL Database Software Engineering

More Trending

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

Kafka

Kafka MySQL Database Software Engineering

Automation and Data Integrity: A Duo for Digital Transformation Success

Precisely

NOVEMBER 21, 2024

Key Takeaways: Harness automation and data integrity unlock the full potential of your data, powering sustainable digital transformation and growth. Data and processes are deeply interconnected. Successful digital transformation requires you to optimize both so that they work together seamlessly. Simplify complex SAP® processes with automation solutions that drive efficiency, reduce costs, and empower your teams to act quickly.

Data Integration

Data Integration High Quality Data Manufacturing Data

How Skyscanner Enabled Data & AI Governance with Monte Carlo

Monte Carlo

NOVEMBER 21, 2024

For over 20 years, Skyscanner has been helping travelers plan and book trips with confidence— including airfare, hotels, and car rentals. As digital natives, the organization is no stranger to staggering volume. Over the years, Skyscanner has grown organically to include a vast network of high-volume data producers and consumers, including: Serving over 110 million monthly users Partnering with hundreds of travel providers Operating in 30+ languages and 180 countries An fulfilling over 5,000

Government

Government Datasets Data Governance Data

10 Python Libraries Every Data Analyst Should Know

KDnuggets

NOVEMBER 19, 2024

Interested in data analytics? Here's a list of Python libraries you cannot do without.

Python

Python Data Analytics Data

Celebrating Innovation: Announcing the Finalists of the Databricks Generative AI Startup Challenge

databricks

NOVEMBER 20, 2024

We are thrilled to unveil the finalists for the Databricks Generative AI Startup Challenge , a competition designed to spotlight innovative early-stage startups.

Designing

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data

Expert Insights for Your 2025 Data, Analytics, and AI Initiatives

Precisely

NOVEMBER 18, 2024

Data Analytics

Data Analytics Data Governance Government Data Integration

How Skyscanner Enabled Data & AI Governance with Monte Carlo

Monte Carlo

NOVEMBER 21, 2024

Government

Government Datasets Data Governance Data

Mirroring SQL Server Database to Microsoft Fabric

Striim

NOVEMBER 19, 2024

SQL2Fabric Mirroring is a new fully managed service offered by Striim to mirror on premise SQL Databases. It’s a collaborative service between Striim and Microsoft based on Fabric Open Mirroring that enables real-time data replication from on-premise SQL Server databases to Azure Fabric OneLake. This fully managed service leverages Striim Cloud’s integration with the Microsoft Fabric stack for seamless data mirroring to Fabric Data Warehouse and Lake House.

SQL

SQL Database Data Warehouse Data Pipeline

How to present and share your Notebook insights in AI/BI Dashboards

databricks

NOVEMBER 21, 2024

We’re excited to announce a new integration between Databricks Notebooks and AI/BI Dashboards, enabling you to effortlessly transform insights from your notebooks into.

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

5 Essential Resources for Learning R

KDnuggets

NOVEMBER 21, 2024

Learn R from top institutions like Harvard, Stanford, and Codecademy.

What do Snowflake, Databricks, Redshift, BigQuery actually do?

Start Data Engineering

NOVEMBER 21, 2024

1. Introduction 2. Analytical databases aggregate large amounts of data 3. Most platforms enable you to do the same thing but have different strengths 3.1. Understand how the platforms process data 3.1.1. A compute engine is a system that transforms data 3.1.2. Metadata catalog stores information about datasets 3.1.3. Data platform support for SQL, Dataframe, and Dataset APIs 3.1.4.

Metadata

Metadata Datasets SQL Database

Exploring the Semantic Layer Through the Lens of MVC

Simon Späti

NOVEMBER 19, 2024

MVC is an interesting concept from the late 70s that separates the View (presentation) from the Controller via the Model. It has been used in designing web applications and is still heavily used, for example, in Ruby on Rails or Laravel, a popular PHP framework. This design pattern got me thinking: Wouldn’t it be convenient to separate the presentation from the storage through a data modeling layer, similar to the model layer?

Designing

Designing IT Data

Choosing Between Star Schema and Snowflake Schema: A Comprehensive Guide

Hevo

NOVEMBER 21, 2024

In today’s data-driven world, choosing the right schema to store data is equally important as collecting it. Schema design plays a crucial role in the performance, scalability, and usability of your data systems. Different data use cases require the selection of different schema designs.

Designing

Designing Systems Data IT

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

7 Advanced SQL Techniques for Data Manipulation in Data Science

KDnuggets

NOVEMBER 20, 2024

Can SQL be used for advanced data manipulation in data science? It sure can with these seven techniques.

Data Science

Data Science SQL Data IT

Introducing Predictive Optimization for Statistics

databricks

NOVEMBER 20, 2024

We are excited to introduce the gated Public Preview of Predictive Optimization for statistics. Announced at the Data + AI Summit, Predictive Optimization.

Data

Connect with Confluent Q4 Update: New Program Entrants and SAP Datasphere Hydration

Confluent

NOVEMBER 21, 2024

Confluent’s CwC partner program introduces bidirectional data streaming for SAP Datasphere, powered by Apache Kafka and Apache Flink; CwC Q4 2024 new entrants.

Programming

Programming Kafka Data

BigQuery Partitioning vs Clustering: Make the Right Choice for Your Workloads

Hevo

NOVEMBER 22, 2024

In the modern field of data analytics, proper data management is the only way to maximize performance while minimizing costs. Google BigQuery, one of the leading cloud-based data warehouses, shows great skills in managing huge datasets by partitioning and clustering.

Data Warehouse

Data Warehouse Datasets Data Analytics Data Engineer

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineering

Run Local LLMs with Cortex

KDnuggets

NOVEMBER 19, 2024

Check out this local AI model manager similar to Ollama, but better.

Management

Introducing an exclusively Databricks-hosted Assistant

databricks

NOVEMBER 21, 2024

We’re excited to announce that the Databricks Assistant , now fully hosted and managed within Databricks, is available in public preview! This version.

Management

Your Data Quality Checks Are Worth Less (Than You Think)

Towards Data Science

NOVEMBER 20, 2024

How to deliver outsized value on your data quality program Continue reading on Towards Data Science »

Data Science

Data Science Programming Data Data Analytics

CDC and Data Streaming: Capture Database Changes in Real Time with Debezium PostgreSQL Connector

Confluent

NOVEMBER 19, 2024

CDC has evolved to become a key component of data streaming platforms, and is easily enabled by managed connectors such as the Debezium PostgreSQL CDC connector.

PostgreSQL

PostgreSQL Database Data Management

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

Data

Exploring Ethics and Morality Through Machine Intelligence

KDnuggets

NOVEMBER 22, 2024

This article examines the challenges of aligning machine behavior with human values, and the role of ethical frameworks in shaping responsible AI.

Databricks training invests in closing the data + AI skills gap across enterprises

databricks

NOVEMBER 18, 2024

The Data + AI Skills Gap The “skills gap” has been a concern for CEOs and leaders for many years, and the gap.

Data

Collision Risk in Hash-Based Surrogate Keys

Towards Data Science

NOVEMBER 20, 2024

Various aspects and real-life analogies of the odds of having a hash collision when computing Surrogate Keys using MD5, SHA-1, and SHA-256.

Data Science

Data Science Data Data Architecture Database

Seamlessly Connect IoT Data Streams: Integrating Confluent Cloud with AWS IoT Core

Confluent

NOVEMBER 20, 2024

Combine AWS IoT Core with Confluent Cloud to contextualize your IoT data using your other data sources. Learn more and get a full setup tutorial.

AWS

AWS Cloud Data

How to Achieve High-Accuracy Results When Using LLMs

Speaker: Ben Epstein, Stealth Founder & CTO | Tony Karrer, Founder & CTO, Aggregage

When tasked with building a fundamentally new product line with deeper insights than previously achievable for a high-value client, Ben Epstein and his team faced a significant challenge: how to harness LLMs to produce consistent, high-accuracy outputs at scale. In this new session, Ben will share how he and his team engineered a system (based on proven software engineering approaches) that employs reproducible test variations (via temperature 0 and fixed seeds), and enables non-LLM evaluation m

Software Engineer

Sat.Nov 16, 2024 - Fri.Nov 22, 2024

8 Essential Data Pipeline Design Patterns You Should Know

From IC to Data Leader: Key Strategies for Managing and Growing Data Teams

Webinars

Trending Sources

Secrets of Spark to Snowflake Migration Success: Customer Stories

Webinars

Expert Insights for Your 2025 Data, Analytics, and AI Initiatives

A Guide to Debugging Apache Airflow® DAGs

DuckDB … reading from s3 … with AWS Credentials and more.

Challenges You Will Face When Parsing PDFs With Python – How To Parse PDFs With Python

Change Data Capture at Pinterest

Sign up to get articles personalized to your interests!

More Trending

Change Data Capture at Pinterest

Automation and Data Integrity: A Duo for Digital Transformation Success

How Skyscanner Enabled Data & AI Governance with Monte Carlo

10 Python Libraries Every Data Analyst Should Know

Celebrating Innovation: Announcing the Finalists of the Databricks Generative AI Startup Challenge

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Expert Insights for Your 2025 Data, Analytics, and AI Initiatives

How Skyscanner Enabled Data & AI Governance with Monte Carlo

Mirroring SQL Server Database to Microsoft Fabric

How to present and share your Notebook insights in AI/BI Dashboards

Agent Tooling: Connecting AI to Your Tools, Systems & Data

5 Essential Resources for Learning R

What do Snowflake, Databricks, Redshift, BigQuery actually do?

Exploring the Semantic Layer Through the Lens of MVC

Choosing Between Star Schema and Snowflake Schema: A Comprehensive Guide

How to Modernize Manufacturing Without Losing Control

7 Advanced SQL Techniques for Data Manipulation in Data Science

Introducing Predictive Optimization for Statistics

Connect with Confluent Q4 Update: New Program Entrants and SAP Datasphere Hydration

BigQuery Partitioning vs Clustering: Make the Right Choice for Your Workloads

The Ultimate Guide to Apache Airflow DAGS

Run Local LLMs with Cortex

Introducing an exclusively Databricks-hosted Assistant

Your Data Quality Checks Are Worth Less (Than You Think)

CDC and Data Streaming: Capture Database Changes in Real Time with Debezium PostgreSQL Connector

Apache Airflow® Best Practices: DAG Writing

Exploring Ethics and Morality Through Machine Intelligence

Databricks training invests in closing the data + AI skills gap across enterprises

Collision Risk in Hash-Based Surrogate Keys

Seamlessly Connect IoT Data Streams: Integrating Confluent Cloud with AWS IoT Core

How to Achieve High-Accuracy Results When Using LLMs

Stay Connected