Process and Project - Data Engineering Digest

An educational side project

The Pragmatic Engineer

JUNE 1, 2023

I’d like to share a story about an educational side project which could prove fruitful for a software engineer who’s seeking a new job. Juraj created a systems design explainer on how he built this project, and the technologies used: The systems design diagram for the Rides application The app uses: Node.js

Education

Education Project PostgreSQL Software Engineering

How to build a data project with step-by-step instructions

Start Data Engineering

SEPTEMBER 18, 2024

Identify what tool to use to process data 3.3. Define what the output dataset will look like 3.1.3. Define SLAs so stakeholders know what to expect 3.1.4. Define checks to ensure the output dataset is usable 3.2. Data flow architecture 3.

Project

Project Building Datasets Architecture

Introducing Accelerator for Machine Learning (ML) Projects: Summarization with Gemini from Vertex AI

Cloudera

DECEMBER 9, 2024

Were thrilled to announce the release of a new Cloudera Accelerator for Machine Learning (ML) Projects (AMP): Summarization with Gemini from Vertex AI . Benchmark tests indicate that Gemini Pro demonstrates superior speed in token processing compared to its competitors like GPT-4.

Machine Learning

Machine Learning Project Banking Accessibility

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

Data Engineering Podcast

OCTOBER 15, 2023

Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. How can you get the best results for your use case?

Process

Process Building SQL BI

New Study: 2018 State of Embedded Analytics Report

Why do some embedded analytics projects succeed while others fail? We surveyed 500+ application teams embedding analytics to find out which analytics features actually move the needle. Read the 6th annual State of Embedded Analytics Report to discover new best practices. Brought to you by Logi Analytics.

Project

5 Advance Projects for Data Science Portfolio

KDnuggets

MARCH 30, 2023

Work on data analytics, time series, natural language processing, machine learning, and ChatGPT projects to improve your chance of getting hired.

Portfolio

Portfolio Project Data Science Machine Learning

X-Ray Vision For Your Flink Stream Processing With Datorios

Data Engineering Podcast

JUNE 9, 2024

Summary Streaming data processing enables new categories of data products and analytics. Unfortunately, reasoning about stream processing engines is complex and lacks sufficient tooling. If you've learned something or tried out a project from the show then tell us about it!

Process

Process Data Lake High Quality Data Machine Learning

Data Security with Snowflake: Row Access, Masking, and Projection Policies

Cloudyard

NOVEMBER 1, 2024

To solve this, we’ll apply Projection Policies to ensure that only certain roles can see sensitive columns like Customer numbers. Snowflake provides several layers of data security, including Projection Policies , Masking Policies , and Row Access Policies , that work together to restrict access based on roles.

Data Security

Data Security Accessible Accessibility Project

Unlocking Your dbt Projects With Practical Advice For Practitioners

Data Engineering Podcast

NOVEMBER 19, 2023

Summary The dbt project has become overwhelmingly popular across analytics and data engineering teams. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects. What was your path to adoption of dbt?

Project

Project Data Lake SQL High Quality Data

The Ultimate Guide To Data-Driven Construction: Optimize Projects, Reduce Risks, & Boost Innovation

Speaker: Donna Laquidara-Carr, PhD, LEED AP, Industry Insights Research Director at Dodge Construction Network

In today’s construction market, owners, construction managers, and contractors must navigate increasing challenges, from cost management to project delays. Fortunately, digital tools now offer valuable insights to help mitigate these risks. That’s where data-driven construction comes in.

Project

10 GitHub Repositories to Master Natural Language Processing (NLP)

KDnuggets

OCTOBER 21, 2024

Enhance your NLP skills through a variety of resources, including roadmaps, frameworks, courses, tutorials, example code, and projects.

Process

Process Coding Project

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Data Engineering Podcast

JANUARY 7, 2024

Summary Data processing technologies have dramatically improved in their sophistication and raw throughput. When performing research and building prototypes of the projects, what is your process for incorporating user experience into the implementation of the product? Email hosts@dataengineeringpodcast.com ) with your story.

Data Process

Data Process Process Data Lake High Quality Data

Parallel Processing in Prompt Engineering: The Skeleton-of-Thought Technique

KDnuggets

OCTOBER 2, 2023

Explore how the Skeleton-of-Thought prompt engineering technique enhances generative AI by reducing latency, offering structured output, and optimizing projects.

Engineering

Engineering Process Project

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

LinkedIn Engineering

OCTOBER 19, 2023

Authors: Bingfeng Xia and Xinyu Liu Background At LinkedIn, Apache Beam plays a pivotal role in stream processing infrastructures that process over 4 trillion events daily through more than 3,000 pipelines across multiple production data centers.

Process

Process Lambda Architecture Kafka Machine Learning

How to Find and Test Assumptions in Product Development

Assumptions mapping is the process of identifying and testing your riskiest ideas. You'll learn: Why every product leader goes into a new project with untested, hidden assumptions. You'll learn: Why every product leader goes into a new project with untested, hidden assumptions.

Project

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

dbt Developer Hub

APRIL 20, 2025

In particular, we expect both Business Intelligence and Data Engineering will be driven by AI operating on top of the context defined in your dbt Projects. Weve known for a while that the combination of structured data from your dbt project + LLMs is a potent combo (particularly when using the dbt Semantic Layer).

Structured Data

Structured Data SQL BI Project

NLP Project Life Cycle: A Case Study on Automated Resume Screening

WeCloudData

MARCH 20, 2025

Natural Language Processing (NLP) has transformed technology by allowing machines to understand, decode, and generate human language. NLP plays a crucial role in multiple domains and NLP projects ranging from its automating customer service, improving search engines, or analyzing social media sentiments.

Project

Project Media Technology Engineering

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Data Engineering Podcast

MARCH 31, 2024

Summary Working with data is a complicated process, with numerous chances for something to go wrong. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. Can you start by outlining what elements of observability are most relevant for dbt projects?

Project

Project Data Lake High Quality Data Data Workflow

Paying down tech debt: further learnings

The Pragmatic Engineer

SEPTEMBER 19, 2024

This project helped onboard me to the software, its structure, its build, and our issue tracking and version control workflows. My first project was supporting i18n (internationalization) in the app. My goal was to fix the debt of hardcoded strings, but I learned a lot about the codebase and our process as I did it.

Recruitment

Recruitment Java Coding Project

Kafka to MongoDB: Building a Streamlined Data Pipeline

Analytics Vidhya

FEBRUARY 28, 2024

Introduction Data is fuel for the IT industry and the Data Science Project in today’s online world. Handling and processing the streaming data is the hardest work for Data Analysis. IT industries rely heavily on real-time insights derived from streaming data sources.

MongoDB

MongoDB Data Pipeline Kafka Building

How Games Typically Get Built

The Pragmatic Engineer

AUGUST 22, 2023

Tommy has built his own video games, consulted on a wide variety of game projects, and for a decade has taught game development at various universities. Each project typically takes several years to create, with shifting hardware specifications and emerging competitors and trends to anticipate and react to, during the process.

Software Engineering

Software Engineering Software Engineer Consulting Entertainment

Simplify Data Warehouse Migrations: Free SnowConvert with Redshift Support

Snowflake

JANUARY 28, 2025

Thats why we are announcing that SnowConvert , Snowflakes high-fidelity code conversion solution to accelerate data warehouse migration projects, is now available for download for prospects, customers and partners free of charge. And today, we are announcing expanded support for code conversions from Amazon Redshift to Snowflake.

Data Warehouse

Data Warehouse Professional Services SQL Coding

Building cost effective data pipelines with Python & DuckDB

Start Data Engineering

MAY 28, 2024

Project demo 3. Use DuckDB to process data, not for multiple users to access data 4.2. Cost calculation: DuckDB + Ephemeral VMs = dirt cheap data processing 4.3. Processing data less than 100GB? Introduction 2. Building efficient data pipelines with DuckDB 4.1. Use DuckDB 4.4.

Data Pipeline

Data Pipeline Python Building Data

The Roots of Today's Modern Backend Engineering Practices

The Pragmatic Engineer

NOVEMBER 21, 2023

Avoiding downtime was nerve-wracking, and the notion of a 'rollback' was as much a relief as a technical process. After this zero-byte file was deployed to prod, the Apache web server processes slowly picked up the empty configuration file. Our deployments were initially manual. Apache started to log like a maniac.

Engineering

Engineering Bytes Cloud Computing AWS

Simplify Data Warehouse Migrations: Free SnowConvert

Snowflake

JANUARY 28, 2025

Thats why we are announcing that SnowConvert , Snowflakes high-fidelity code conversion solution to accelerate data warehouse migration projects, is now available for download for prospects, customers and partners free of charge. And today, we are announcing expanded support for code conversions from Amazon Redshift to Snowflake.

Data Warehouse

Data Warehouse Professional Services SQL Data

Interesting startup idea: benchmarking cloud platform pricing

The Pragmatic Engineer

OCTOBER 17, 2024

Code and raw data repository: Version control: GitHub Heavily using GitHub Actions for things like getting warehouse data from vendor APIs, starting cloud servers, running benchmarks, processing results, and cleaning up after tuns. Internal comms: Chat: Slack Coordination / project management: Linear 3.

Cloud

Cloud AWS Metadata Cloud Computing

How to get started with dbt

Christophe Blefari

MARCH 1, 2023

dbt Labs also develop dbt Cloud which is a cloud product that hosts and runs dbt Core projects. a dbt project — a dbt project is a folder that contains all the dbt objects needed to work. You can initialise a project with the CLI command: dbt init. In a dbt project you can define YAML file everywhere.

Data Warehouse

Data Warehouse SQL Metadata Raw Data

Fueling the Future of GenAI with NiFi: Cloudera DataFlow 2.9 Delivers Enhanced Efficiency and Adaptability

Cloudera

DECEMBER 4, 2024

Our customers rely on NiFi as well as the associated sub-projects (Apache MiNiFi and Registry) to connect to structured, unstructured, and multi-modal data from a variety of data sources – from edge devices to SaaS tools to server logs and change data capture streams. Accelerating GenAI with Powerful New Capabilities Cloudera DataFlow 2.9

Data Pipeline

Data Pipeline Data Ingestion Data Preparation Architecture

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

The project showed that smaller, empowered teams achieve higher impact than larger ones. These small, cross-functional teams ensured that members were deeply involved in the project operations, the technical setup, and the feedback cycle, leading to fewer delays, fewer bottlenecks, and faster decision-making.

Pharmaceutical

Pharmaceutical Data Lake Cloud Storage Project

Scale Unstructured Text Analytics with Batch LLM Inference

Snowflake

MARCH 6, 2025

Customer intelligence teams analyze reviews and forum comments to identify sentiment trends, while support teams process tickets to uncover product issues and inform gaps in a product roadmap. As data volumes grow and AI automation expands, cost efficiency in processing with LLMs depends on both system architecture and model flexibility.

Unstructured Data

Unstructured Data Medical Media Data Workflow

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

This dampens confidence in the data and hampers access, in turn impacting the speed to launch new AI and analytic projects. This guarantees data quality and automates the laborious, manual processes required to maintain data reliability.

Metadata

Metadata Management Data Governance Government

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Snowflake

DECEMBER 4, 2024

Beyond working with well-structured data in a data warehouse, modern AI systems can use deep learning and natural language processing to work effectively with unstructured and semi-structured data in data lakes and lakehouses. AI projects should not be about “the latest” or “the best.” Leadership will be the antidote to AI exhaustion.

Unstructured Data

Unstructured Data Data Lake Deep Learning Structured Data

Snowflake’s Fully Managed Service: Beyond Serverless

Snowflake

FEBRUARY 13, 2025

These processes were costly and time-consuming and also introduced governance and security risks, as once data is moved, customers lose all control. We take care of planning, executing and verifying upgrades, and we do so using a rolling process without downtime. As a result, data often went underutilized.

Management

Management Government Cloud Unstructured Data

Introducing Configurable Metaflow

Netflix Tech

DECEMBER 19, 2024

Many of these projects are under constant development by dedicated teams with their own business goals and development best practices, such as the system that supports our content decision makers , or the system that ranks which language subtitles are most valuable for a specific piece ofcontent. ' "scikit-learn": '1.4.0'

Machine Learning

Machine Learning Project Data Warehouse Coding

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

So, why do so many automation projects fail to deliver? Let’s break down the practical steps to make automation and AI projects successful and discuss common pitfalls. Examples include “reduce data processing time by 30%” or “minimize manual data entry errors by 50%.”

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Surveying The Market Of Database Products

Data Engineering Podcast

OCTOBER 29, 2023

In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Data projects are notoriously complex.

Database

Database SQL BI Machine Learning

Snowflake Startup Spotlight: Contextual AI

Snowflake

MARCH 20, 2025

Enterprises are very security conscious when it comes to their data and their AI projects. Even with our own certifications, the security review process during deal cycles can be lengthy and really slow down momentum. How do security concerns impact the deployment of your platform?

Programming

Programming Certification Building Designing

The Developer Experience Upgrade: From Create React App to Vite

Tweag

DECEMBER 18, 2024

We all know how it feels: staring at the terminal while your development server starts up, or watching your CI/CD pipeline crawl through yet another build process. Webpack processes everything through JavaScript, which is single-threaded by nature and slower at CPU-intensive tasks compared to lower-level languages like Go or Rust.

Coding

Coding Project Architecture Building

Data News — Week 25.02

Christophe Blefari

JANUARY 11, 2025

There are multiple ways to start a new year, either with new projects, new ideas, new resolutions or by just keeping doing the same music. HNY 2025 ( credits ) Happy new year ✨ I wish you the best for 2025. I hope you will enjoy 2025. Thank you so much for your support through the years.

Data

Data Data Warehouse Coding Programming Language

Modern Data Governance: Trends for 2025

Precisely

JANUARY 30, 2025

Recognize that artificial intelligence is a data governance accelerator and a process that must be governed to monitor ethical considerations and risk. Align people, processes, and technology Successful data governance requires a holistic approach. Tools are important, but they need to complement your strategy.

Data Governance

Data Governance Government Metadata Data

PSPO Study Guide: The Best Plan to Crack PSPO Exam 2025

Knowledge Hut

NOVEMBER 25, 2024

Scrum is a quality-driven process for producing excellent business outcomes. This certification is not as well-known as the PSM (Professional Scrum Master™) I, but it is a fantastic choice if you are interested in product ownership (for example, if you are a business analyst who wants to start working on Scrum projects).

Certification

Certification Business Analyst Consulting Education

Calling All Builders: Get Hands-On With AI and Apps

Snowflake

NOVEMBER 4, 2024

and Executive Chairman of LandingAI, Andrew Ng, has long been a leading proponent of AI agents and agentic workflows — the iterative processes of multiple AI agents collaborating to solve problems and ultimately carry out complex tasks automatically.

Unstructured Data

Unstructured Data Python Machine Learning Data Pipeline

What Are Large Vision Models and How Do They Work?

phData: Data Engineering

NOVEMBER 7, 2024

However, with the introduction of the Transformer architecture—initially successful in Natural Language Processing (NLP)—the landscape has shifted. Then, a linear projection on the flattened patches is introduced, and positional embeddings are added. Each flattened patch is passed through a learnable linear projection.

Architecture

Architecture Project Datasets Utilities

Data Appending vs. Data Enrichment: How to Maximize Data Quality and Insights

Precisely

APRIL 7, 2025

The solution: They use a data appending process to match their existing data with a third-party database that contains full street addresses. Our Chief Data Officer, Dave Shuman, recently walked through a data appending and enrichment project for our CRM data. Why does this matter?

Retail

Retail Datasets Data Portfolio

An educational side project

How to build a data project with step-by-step instructions

Webinars

Trending Sources

Introducing Accelerator for Machine Learning (ML) Projects: Summarization with Gemini from Vertex AI

Webinars

Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

New Study: 2018 State of Embedded Analytics Report

5 Advance Projects for Data Science Portfolio

X-Ray Vision For Your Flink Stream Processing With Datorios

Data Security with Snowflake: Row Access, Masking, and Projection Policies

Unlocking Your dbt Projects With Practical Advice For Practitioners

The Ultimate Guide To Data-Driven Construction: Optimize Projects, Reduce Risks, & Boost Innovation

10 GitHub Repositories to Master Natural Language Processing (NLP)

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Parallel Processing in Prompt Engineering: The Skeleton-of-Thought Technique

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

How to Find and Test Assumptions in Product Development

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

NLP Project Life Cycle: A Case Study on Automated Resume Screening

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Paying down tech debt: further learnings

Kafka to MongoDB: Building a Streamlined Data Pipeline

How Games Typically Get Built

Simplify Data Warehouse Migrations: Free SnowConvert with Redshift Support

Building cost effective data pipelines with Python & DuckDB

The Roots of Today's Modern Backend Engineering Practices

Simplify Data Warehouse Migrations: Free SnowConvert

Interesting startup idea: benchmarking cloud platform pricing

How to get started with dbt

Fueling the Future of GenAI with NiFi: Cloudera DataFlow 2.9 Delivers Enhanced Efficiency and Adaptability

Drug Launch Case Study: Amazing Efficiency Using DataOps

Scale Unstructured Text Analytics with Batch LLM Inference

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Snowflake’s Fully Managed Service: Beyond Serverless

Introducing Configurable Metaflow

How To Prepare Your Data Team for 2025

Surveying The Market Of Database Products

Snowflake Startup Spotlight: Contextual AI

The Developer Experience Upgrade: From Create React App to Vite

Data News — Week 25.02

Modern Data Governance: Trends for 2025

PSPO Study Guide: The Best Plan to Crack PSPO Exam 2025

Calling All Builders: Get Hands-On With AI and Apps

What Are Large Vision Models and How Do They Work?

Data Appending vs. Data Enrichment: How to Maximize Data Quality and Insights

Stay Connected