Building, Coding and Definition - Data Engineering Digest

Indexing code at scale with Glean

Engineering at Meta

DECEMBER 19, 2024

Were sharing details about Glean , Metas open source system for collecting, deriving and working with facts about source code. In August 2021 we open-sourced our code indexing system Glean. Glean collects information about source code and provides it to developer tools through an efficient and flexible query language.

Coding

Coding Programming Language SQL Programming

Why did Google close its coding competitions after 20 years?

The Pragmatic Engineer

MARCH 3, 2023

On 22 February 2023, Google announced its coding competitions are coming to an end: The visual that accompanied the announcement of the end of Google’s coding competitions. Code Jam: competitive programming. Hash Code: team programming. Google Code Jam I/O for Women: algorithmic programming.

Coding

Coding IT Software Engineer Software Engineering

Why are Cloud Development Environments Spiking in Popularity, Now?

The Pragmatic Engineer

SEPTEMBER 26, 2023

Every day, there’s more code at a tech company, not less. This means more repositories are needed, which are fast enough to build and work with, but which increase fragmentation. However, monorepos result in codebases growing large, so that even checking out the code or updating to the head can be time consuming.

Cloud

Cloud Software Engineer Software Engineering Cloud Computing

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Building Holiday Finds: How Pinterest Engineers Reimagined Gift Discovery

Pinterest Engineering

MARCH 26, 2025

Personalization Stack Building a Gift-Optimized Recommendation System The success of Holiday Finds hinges on our ability to surface the right gift ideas at the right time. Unified Logging System: We implemented comprehensive engagement tracking that helps us understand how users interact with gift content differently from standardPins.

Building

Building Engineering Algorithm Systems

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Data Engineering Podcast

FEBRUARY 18, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20.

Data Lake

Data Lake High Quality Data Data Warehouse Google Cloud

Introducing Self-Service, No-Code Airflow Authoring UI in Cloudera Data Engineering

Cloudera

OCTOBER 19, 2021

This presented challenges for users in building more complex multi-step pipelines that are typical of DE workflows. In the process several key themes emerged: Low/No-code. Writing code is error prone and requires trial and error. By far the biggest barrier for new users is creating custom Airflow DAGs. Long-tail of operators.

Coding

Coding Data Engineer Data Engineering Engineering

A Tour Around Buck2, Meta's New Build System

Tweag

JULY 5, 2023

Buck2 is a from-scratch rewrite of Buck , a polyglot, monorepo build system that was developed and used at Meta (Facebook), and shares a few similarities with Bazel. As you may know, the Scalable Builds Group at Tweag has a strong interest in such scalable build systems. fix the code # fix code 7.

Systems

Systems Building Java Programming Language

Going from Developer to CEO: Chronosphere

The Pragmatic Engineer

OCTOBER 10, 2023

He’s solved interesting engineering challenges along the way, too – like building observability for Amazon’s EC2 offering, and being one of the first engineers on Uber’s observability platform. From learning to code in Australia, to working in Silicon Valley How did I learn to code?

Software Engineer

Software Engineer Software Engineering Architecture Media

DevOps Lifecycle: Definition, Phases

Knowledge Hut

NOVEMBER 20, 2023

The DevOps lifecycle phases are in order from left to right, with each phase building upon the last. It is about automating the process of building, testing, deploying, and maintaining applications to reduce time-to-market for new features and functionality. Code - During this point, the code is being developed.

Utilities

Utilities Programming Coding Designing

How to build a Data Dashboard Prototype with Generative AI

Towards Data Science

JANUARY 27, 2025

How to Build a Data Dashboard Prototype with Generative AI A book reading data visualization withVizro-AI This article is a tutorial that shows how to build a data dashboard to visualize book reading data taken from goodreads.com. Its still not complete and can definitely be extended and improved upon.

Building

Building Datasets Coding Data

Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data

Towards Data Science

JANUARY 30, 2025

Building more efficient AI TLDR : Data-centric AI can create more efficient and accurate models. Full code and results available here onGitHub. Moving experiment configs to a YAML, automatically saving results to a file, and having o1 write my visualization code made life mucheasier. Image byauthor. Image byauthor.

Database-centric

Database-centric Datasets Data Architecture

Building a Kimball dimensional model with dbt

dbt Developer Hub

APRIL 19, 2023

This tutorial aims to solve this by providing the definitive guide to dimensional modeling with dbt. Don’t repeat yourself** : Dimensions can be easily re-used with other fact tables to avoid duplication of effort and code logic. Performing joins between fact and dimension tables are made simple through the use of surrogate keys.

Building

Building PostgreSQL BI Database

Movie Recommendation System: Definition, Strategies, Usecase

Knowledge Hut

FEBRUARY 1, 2024

Today, we’ll talk about how Machine Learning (ML) can be used to build a movie recommendation system - from researching data sets & understanding user preferences all the way through training models & deploying them in applications. How to Build a Movie Recommendation System in Python?

Systems

Systems Entertainment Algorithm Datasets

Build Better Data Products By Creating Data, Not Consuming It

Data Engineering Podcast

NOVEMBER 6, 2022

In this episode Nick King discusses how you can be intentional about data creation in your applications and services to reduce the friction and errors involved in building data products and ML applications. Can you share your definition of "behavioral data" and how it is differentiated from other sources/types of data?

Building

Building IT Metadata MongoDB

Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code

Towards Data Science

JANUARY 30, 2025

That said, this tutorial aims to introduce airflow-parse-bench , an open-source tool I developed to help data engineers monitor and optimize their Airflow environments, providing insights to reduce code complexity and parsetime. Parsing occurs every time Airflow processes your Python files to build the DAGs dynamically.

Python

Python Coding Google Cloud Database

How to get started with dbt

Christophe Blefari

MARCH 1, 2023

a macro — a macro is a Jinja function that either do something or return SQL or partial SQL code. In a nutshell the dbt journey starts with sources definition on which you will define models that will transform these sources to something else you'll need in your downstream usage of the data.

Data Warehouse

Data Warehouse SQL Metadata Raw Data

Top 15 Python IDEs and Code Editors to Use in 2024

Knowledge Hut

DECEMBER 22, 2023

For this feature, Python encloses certain code editors and python IDEs used for software development say, Python itself. This article looks at the top python IDEs and code editors along with their features, pros, and cons and discusses the best suited for writing Python codes. What is a Code Editor?

Python

Python Coding Programming Language Data Science

The job market for new grads: worse than in 2008, but better than 2002

The Pragmatic Engineer

FEBRUARY 23, 2023

Chris Lee is the founder of US-based Launch School , which is one of the “anti bootcamp coding schools,” and an organization which impresses me. As a coding school operator, Chris has a unique perspective that gives him insight into lots of different companies and engineering departments.

Software Engineer

Software Engineer Software Engineering Recruitment Portfolio

Are reports of StackOverflow’s fall greatly exaggerated?

The Pragmatic Engineer

AUGUST 10, 2023

Ayhan visualized this data and observed a definite fall in all metrics: page views, visits, questions asked, votes. Q&A activity is definitely down: the company is aware of this metric taking a dive, and said they’re actively working to address it. When it comes to GenAI, Stack Overflow for Teams is getting a lot more love.

Retail

Retail Utilities Software Engineer Software Engineering

Title Launch Observability at Netflix Scale

Netflix Tech

JANUARY 6, 2025

Part 2: Navigating Ambiguity By: VarunKhaitan With special thanks to my stunning colleagues: Mallika Rao , Esmir Mesic , HugoMarques Building on the foundation laid in Part 1 , where we explored the what behind the challenges of title launch observability at Netflix, this post shifts focus to the how.

Metadata

Metadata Algorithm Systems Building

Pioneering Data Observability:Data, Code, Infrastructure, & AI

Towards Data Science

AUGUST 8, 2023

Pioneering Data Observability: Data, Code, Infrastructure, & AI The four dimensions of data observability: data, code, infrastructure, and ai? Unreliable data doesn’t live in a silo… it’s impacted by all three ingredients of the data ecosystem: data + code + infrastructure. You look at the code.

Coding

Coding Data Software Engineer Software Engineering

Building Your Data Warehouse On Top Of PostgreSQL

Data Engineering Podcast

MAY 13, 2021

If you want to build a warehouse that gives you both control and flexibility then you might consider building on top of the venerable PostgreSQL project. In this episode Thomas Richter and Joshua Drake share their advice on how to build a production ready data warehouse with Postgres.

PostgreSQL

PostgreSQL Data Warehouse Building MySQL

Building Real-Time Data Platforms For Large Volumes Of Information With Aerospike

Data Engineering Podcast

OCTOBER 2, 2021

If you need to deal with massive data, at high velocities, in milliseconds, then Aerospike is definitely worth learning about. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Building

Building BI Data Architecture Architecture

Improving the code quality of your dbt models with unit tests and TDD

Towards Data Science

JUNE 2, 2023

How to improve the code quality of your dbt models with unit tests and TDD All you need to know to start unit testing your dbt SQL models Photo by Christin Hume on Unsplash If you are a data or analytics engineer, you are probably comfortable writing SQL models and testing for data quality with dbt tests. Kent Beck ?

Coding

Coding SQL Software Engineer Software Engineering

Building a Customer 360 in the Snowflake Data Cloud with RudderStack

Snowflake

OCTOBER 2, 2023

To help customers overcome these challenges, RudderStack and Snowflake recently launched Profiles , a new product that allows every data team to build a customer 360 directly in their Snowflake Data Cloud environment. Gone are the months of complex data wrangling and the constraints of no-code SaaS tools.

Cloud

Cloud Building Insurance Data Engineer

Designing And Building Data Platforms As A Product

Data Engineering Podcast

SEPTEMBER 3, 2021

In this episode Lior Gavish, Lior Solomon, and Atul Gupte share their view of what it means to have a data platform, discuss their experiences building them at various companies, and provide advice on how to treat them like a software product. Who are the stakeholders in a data platform? When is a data platform the wrong choice?

Designing

Designing Building SQL BI

Announcing Open Source DataOps Data Quality TestGen 3.0

DataKitchen

FEBRUARY 20, 2025

It assesses your data, deploys production testing, monitors progress, and helps you build a constituency within your company for lasting change. Enhanced Testing & Profiling Copy & Move Tests with Ease The Test Definitions page now supports seamless test migration between test suites.

Datasets

Datasets Metadata Data Government

Version Your Data Lakehouse Like Your Software With Nessie

Data Engineering Podcast

MARCH 10, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today! Your first 30 days are free!

Data Lake

Data Lake High Quality Data Architecture Machine Learning

Low-Code Data Connectors and Destinations

Towards Data Science

OCTOBER 9, 2024

Get started with Airbyte and Cloud Storage Coding the connectors yourself? Not only do you have to make it scalable and useful, but every architectural decision builds up over time. And building them yourself from scratch gives you full control of how you want them to behave. Building the data source.

Coding

Coding Cloud Storage Data Data Ingestion

Apache Spark MLlib vs Scikit-learn: Building Machine Learning Pipelines

Towards Data Science

MARCH 9, 2023

Code implementations for ML pipelines: from raw data to predictions Photo by Rodion Kutsaiev on Unsplash Real-life machine learning involves a series of tasks to prepare the data before the magic predictions take place. And that’s it.

Machine Learning

Machine Learning Building Datasets Big Data

Top 10 Data Engineering & AI Trends for 2025

Monte Carlo

NOVEMBER 26, 2024

Prediction: AI copilots that can complete a sentence, correct code errors, etc. And if Twitter has taught us anything, Sam Altman definitely has a lot to say.) We’re seeing teams build out vector databases or embedding models at scale. According to Tomasz, the current state of AI can be summed up in three categories.

Data Engineer

Data Engineer Data Engineering Engineering Unstructured Data

When And How To Conduct An AI Program

Data Engineering Podcast

MARCH 3, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today! Your first 30 days are free!

Programming

Programming Data Lake High Quality Data Machine Learning

Build and Manage ML features for Production-Grade Pipelines

Snowflake

OCTOBER 7, 2024

When scaling data science and ML workloads, organizations frequently encounter challenges in building large, robust production ML pipelines. Define an Entity: Define a Feature View: feature_df is a Snowpark DataFrame object containing your feature definition.

Management

Management Building Datasets Government

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

One of the main reasons this feature exists is just like with food samples, to give you “a taste” of the production quality ETL code that you could encounter inside the Netflix data ecosystem. This is one way to build trust with our internal user base. " , country_code STRING COMMENT "Country code of the playback session."

Data Pipeline

Data Pipeline Scala Metadata Food

Building a maintainable and modular LLM application stack with Hamilton

Towards Data Science

JULY 13, 2023

Building a maintainable and modular LLM application stack with Hamilton in 13 minutes LLM Applications are dataflows, use a tool specifically designed to express them LLM stacks. Hamilton is great for describing any type of dataflow , which is exactly what you’re doing when building an LLM powered application. Image from pixabay.

Building

Building Database-centric Database Coding

Next-Level Apps with Snowpark Container Services and Snowflake Native Apps

Snowflake

NOVEMBER 20, 2023

While such apps are being created at a very fast pace, there are two main challenges: Many modern powerful apps utilize containers to package and use code; however, this typically requires data to be moved from protected environments, increasing data privacy and security risk.

Utilities

Utilities Machine Learning Coding AWS

Simplifying Data Architecture and Security to Accelerate Value

Snowflake

NOVEMBER 11, 2024

What if you could streamline your efforts while still building an architecture that best fits your business and technology needs? At BUILD 2024, we announced several enhancements and innovations designed to help you build and manage your data architecture on your terms. Here’s a closer look.

Data Architecture

Data Architecture Architecture Data Lake Kafka

Fast And Flexible Headless Data Analytics With Cube.JS

Data Engineering Podcast

DECEMBER 21, 2021

Summary One of the perennial challenges of data analytics is having a consistent set of definitions, along with a flexible and performant API endpoint for querying them. a framework for building analytics APIs to power your applications and BI dashboards Interview Introduction How did you get involved in the area of data management?

Data Analytics

Data Analytics BI Computer Science SQL

Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata

Data Engineering Podcast

SEPTEMBER 11, 2022

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode.

Systems

Systems Metadata Building MongoDB

Announcing Topiary

Tweag

MARCH 8, 2023

Users benefit from uniform, comparable code style, across multiple languages, with the convenience of a single formatter tool. In this first release, we have concentrated on formatting OCaml code, capitalising on the OCaml expertise within the Topiary Team and our colleague, Nicolas Jeannerod. Expect idempotency. Prettier ).

Coding

Coding Engineering Designing Programming

Layoffs push down scores on Glassdoor: this is how companies respond

The Pragmatic Engineer

MAY 25, 2023

Such a log would build confidence that Glassdoor is a neutral platform which is only enforcing its own terms and conditions, and could validate this. However, there’s a definite and ongoing uptick since the mid-2021. month-long code freeze at Stack Overflow. What’s going on, and when will Bedrock be available?

Software Engineer

Software Engineer Software Engineering AWS Engineering

Google Shutting down Firebase Dynamic Links

The Pragmatic Engineer

AUGUST 3, 2023

These links were especially helpful for: Promotions and marketing campaigns QR codes Content sharing links that “just work” Converting desktop users to mobile ones The shutdown Dynamic links powered Firebase Invites: an app invite service where users could send app invite links to their friends, to drive installation of the app.

Metadata

Metadata Engineering Building Technology

Audio Analysis With Machine Learning: Building AI-Fueled Sound Detection App

AltexSoft

MAY 12, 2022

Audacity doesn’t require coding skills. Commercial audio sets for machine learning are definitely more reliable in terms of data integrity than free ones. Building an app for snore and teeth grinding detection. AltexSoft & SleepScore Labs: Building an iOS App for Snoring and Teeth Grinding Detection.

Machine Learning

Machine Learning Building Deep Learning Healthcare

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

Data Engineering Podcast

MAY 22, 2022

In this episode Ketan Umare and Haytham Abuelfutuh share the story of the Flyte project and how their work at Union is focused on supporting and scaling the code and community that has made Flyte successful. What are the core primitives that Flyte exposes for building up complex workflows? What do you see as the closest alternatives?

Machine Learning

Machine Learning Data Engineer Data Engineering Cloud

Indexing code at scale with Glean

Why did Google close its coding competitions after 20 years?

Webinars

Trending Sources

Why are Cloud Development Environments Spiking in Popularity, Now?

Webinars

Building Holiday Finds: How Pinterest Engineers Reimagined Gift Discovery

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Introducing Self-Service, No-Code Airflow Authoring UI in Cloudera Data Engineering

A Tour Around Buck2, Meta's New Build System

Going from Developer to CEO: Chronosphere

DevOps Lifecycle: Definition, Phases

How to build a Data Dashboard Prototype with Generative AI

Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data

Building a Kimball dimensional model with dbt

Movie Recommendation System: Definition, Strategies, Usecase

Build Better Data Products By Creating Data, Not Consuming It

Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code

How to get started with dbt

Top 15 Python IDEs and Code Editors to Use in 2024

The job market for new grads: worse than in 2008, but better than 2002

Are reports of StackOverflow’s fall greatly exaggerated?

Title Launch Observability at Netflix Scale

Pioneering Data Observability:Data, Code, Infrastructure, & AI

Building Your Data Warehouse On Top Of PostgreSQL

Building Real-Time Data Platforms For Large Volumes Of Information With Aerospike

Improving the code quality of your dbt models with unit tests and TDD

Building a Customer 360 in the Snowflake Data Cloud with RudderStack

Designing And Building Data Platforms As A Product

Announcing Open Source DataOps Data Quality TestGen 3.0

Version Your Data Lakehouse Like Your Software With Nessie

Low-Code Data Connectors and Destinations

Apache Spark MLlib vs Scikit-learn: Building Machine Learning Pipelines

Top 10 Data Engineering & AI Trends for 2025

When And How To Conduct An AI Program

Build and Manage ML features for Production-Grade Pipelines

Ready-to-go sample data pipelines with Dataflow

Building a maintainable and modular LLM application stack with Hamilton

Next-Level Apps with Snowpark Container Services and Snowflake Native Apps

Simplifying Data Architecture and Security to Accelerate Value

Fast And Flexible Headless Data Analytics With Cube.JS

Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata

Announcing Topiary

Layoffs push down scores on Glassdoor: this is how companies respond

Google Shutting down Firebase Dynamic Links

Audio Analysis With Machine Learning: Building AI-Fueled Sound Detection App

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

Stay Connected