This site uses cookies to improve your experience. To help us insure we adhere to various privacy regulations, please select your country/region of residence. If you do not select a country, we will assume you are from the United States. Select your Cookie Settings or view our Privacy Policy and Terms of Use.
Cookie Settings
Cookies and similar technologies are used on this website for proper function of the website, for tracking performance analytics and for marketing purposes. We and some of our third-party providers may use cookie data for various purposes. Please review the cookie settings below and choose your preference.
Used for the proper function of the website
Used for monitoring website traffic and interactions
Cookie Settings
Cookies and similar technologies are used on this website for proper function of the website, for tracking performance analytics and for marketing purposes. We and some of our third-party providers may use cookie data for various purposes. Please review the cookie settings below and choose your preference.
Strictly Necessary: Used for the proper function of the website
Performance/Analytics: Used for monitoring website traffic and interactions
Imagine walking through an art exhibition at the renowned Gagosian Gallery , where paintings seem to be a blend of surrealism and lifelike accuracy. One particular piece catches your eye: it depicts a child staring at the viewer with wind-tossed hair, evoking the feel of the Victorian era through its coloring and what appears to be a simple linen dress.
In the world of data, two crucial roles play a significant part in unlocking the power of information: Data Scientists and Data Engineers. But what sets these wizards of data apart? Welcome to the ultimate showdown of Data Scientist vs Data Engineer! In this captivating journey, we’ll explore the distinctive paths these tech titans take […] The post Data Engineer vs Data Scientist: Which Career to Choose?
Originally published 6 July 2023 👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of six topics in today’s subscriber-only The Scoop issue. If you’re not yet a full subscriber, you missed this week’s deep-dive on What a senior engineer is at Big Tech. To get the full issues twice a week, subscribe here.
Nothing gives me greater joy than rocking the boat. I take pleasure in finding what people love most in tech and trying to poke holes in it. Everything is sacred. Nothing is sacred. I also enjoy doing simple things, things that have a “real-life” feel to them. I suppose I could be like the others […] The post Polars vs Pandas. Inside an AWS Lambda. appeared first on Confessions of a Data Guy.
Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.
Meta recently announced they have made Buck2 open-source. Buck2 is a from-scratch rewrite of Buck , a polyglot, monorepo build system that was developed and used at Meta (Facebook), and shares a few similarities with Bazel. As you may know, the Scalable Builds Group at Tweag has a strong interest in such scalable build systems. We were thrilled to have the opportunity to work with Meta on Buck2 to help make the tool useful and successful in the open-source use case.
Summary All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to.
🧜♂️ ( credits ) Hey, this is a mid-2023 edition with some of my favourite articles and the popular articles that have been shared this year in the newsletter. There isn't any fancy calculation on how to find the popular articles. Here how it's done. Every link sent in each newsletter is tracked in 2 ways: when you click on a link it first redirect you to my blog so I know that you've clicked on it it adds ref=blef.fr to the url, so the original articl
🧜♂️ ( credits ) Hey, this is a mid-2023 edition with some of my favourite articles and the popular articles that have been shared this year in the newsletter. There isn't any fancy calculation on how to find the popular articles. Here how it's done. Every link sent in each newsletter is tracked in 2 ways: when you click on a link it first redirect you to my blog so I know that you've clicked on it it adds ref=blef.fr to the url, so the original articl
Introduction In this era of Generative Al, data generation is at its peak. Building an accurate machine learning and AI model requires a high-quality dataset. The quality assurance of the dataset is the most critical task, as poor data causes inaccurate analytics and unidentified predictions that can affect the entire repo of any business and […] The post Getting Started with Amazon SageMaker Ground Truth appeared first on Analytics Vidhya.
👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover topics related to Big Tech and startups through the lens of engineering managers and senior engineers. In this article, we cover one out of five topics from today’s subscriber-only deepdive on Advice on how to sell a startup. To get full issues twice a week, subscribe here.
You certainly know it, the watermark (aka GC Watermark) is responsible for cleaning state store in Apache Spark Structured Streaming. But you may not know that it's not the single time-based condition. There is a different one involved in the stream-to-stream joins.
1. Introduction 2. Sample project 3. Best practices 3.1. Use standard patterns that progressively transform your data 3.2. Ensure data is valid before exposing it to its consumers (aka data quality checks) 3.3. Avoid data duplicates with idempotent pipelines 3.4. Write DRY code & keep I/O separate from data transformation 3.5. Know the when, how, & what (aka metadata) of pipeline runs for easier debugging 3.
Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage
There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.
Summary Real-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. Despite that, it is still a complex set of capabilities. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. In this episode he explains how investing in high performance and operationally simplified streaming with a familiar API can yield significant benefits for software and data teams together.
See you on the road ( credits ) Hey, I hope this newsletter finds you well. This is a small blogpost to give you a few reads while waiting for your next travel. We can already feel summer, I found less articles to enter the selection this week. Also be ready for the Data News: Summer Edition. For the next 5 releases it will be a bit different than usual, less curation and more original articles written in advance to allow me to take a break.
The ETL & ELT tool market is experiencing continuous transformation, propelled by fluctuating pricing structures and the advent of inventive alternatives. This industry remains fiercely competitive due to these changing elements and a swiftly growing user base. In the following sections, we will explore four emerging alternatives to Fivetran. Of course, that is if you… Read more The post 4 Alternatives to Fivetran: The Evolving Dynamics of the ETL & ELT Tool Market appeared first
👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of six topics in today’s subscriber-only The Pulse issue. If you’re not yet a full subscriber, you missed this week’s deep-dive on Software architect archetypes. To get the full issues, twice a week, subscribe here. Before we start, a small change.
Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives
Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri
Starting from Apache Spark 3.2.0 is now possible to load an initial state of the arbitrary stateful pipelines. Even though the feature is easy to implement, it hides some interesting implementation details!
Sometimes it seems like the Data Engineering landscape is starting to shoot off into infinity. With the rise of Rust, new tools like DuckDB, Polars, and whatever else, things do seem to shifting at a fundamental level. It seems like there is someone at the base of a titering rock with a crowbar, picking and […] The post Ballista (Rust) vs Apache Spark.
Summary Data has been one of the most substantial drivers of business and economic value for the past few decades. Bob Muglia has had a front-row seat to many of the major shifts driven by technology over his career. In his recent book "Datapreneurs" he reflects on the people and businesses that he has known and worked with and how they relied on data to deliver valuable services and drive meaningful change.
Have fun train models on this ( credits ) Hey, it's Saturday I hope you're enjoying July, taking deserve break, reading data engineering articles while at the beach or traveling to unknown places. Sometimes there are Fridays when I don't find any glue between articles for the newsletter and I have an idea of something to compensate but it takes me the whole Friday of exploration.
With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you
Some data teams need to have their data near real-time for dashboards and reporting. So how can they implement a near real-time data pipeline? One possible choice is a method called change data capture, also known as CDC. I have seen companies employ multiple ways to use CDC or CDC-like approaches to pull data from… Read more The post What Is Change Data Capture appeared first on Seattle Data Guy.
Reinforcement learning basics to get your feet wet. Learn the components and key concepts in the reinforcement loading framework: from agents and rewards to value functions, policy, and more.
That's the conference I've heard only recently about. What a huge mistake! Despite the lack of "data" word in the name, it covers many interesting data topics and before I share with you my notes from this year's Data+AI Summit, let me do the same for Berlin Buzzwords!
Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.
Summary For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.
Who's leading the data peloton? ( credits ) Hey you, this is the Saturday Data News edition 🥲 Time flies. I'm working for the Series of articles in advance for August about "creating data platforms" and I'm looking for ideas about the data I could use for this. Having some kind of simulated real-time data would be the best.
Building a datalake for semi-structured data or json has always been challenging. Imagine if the json documents are streaming or continuously flowing from healthcare vendors then we need a robust modern architecture that can deal with such a high volume. At the same time analytics layer also needs to be created so as to generate value from it.
Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?
That's often a dilemma, whether we should put multiple sinks working on the same data source in the same or in different Apache Spark Structured Streaming applications? Both solutions may be valid depending on your use case but let's focus here on the former one including multiple sinks together.
Today, we are excited to announce the public preview of Databricks Assistant, a context-aware AI assistant, available natively in Databricks Notebooks, SQL editor.
Summary Feature engineering is a crucial aspect of the machine learning workflow. To make that possible, there are a number of technical and procedural capabilities that must be in place first. In this episode Razi Raziuddin shares how data engineering teams can support the machine learning workflow through the development and support of systems that empower data scientists and ML engineers to build and maintain their own features.
2 summits ( credits I cropped the image) Hey, since I said I should try to send the newsletter at a specific schedule I did not. Haha. Still here the newsletter for last week. This is a small wrap-up from the Snowflake and Databricks Data + AI summits which have taken place last week. There are so many sessions at both summits that this is impossible to watch everything, more Databricks and Snowflake do not put in free access online everything so I can't wait everything.
In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!
We organize all of the trending information in your field so you don't have to. Join 37,000+ users and stay up to date on the latest articles your peers are reading.
You know about us, now we want to get to know you!
Let's personalize your content
Let's get even more personalized
We recognize your account from another site in our network, please click 'Send Email' below to continue with verifying your account and setting a password.
Let's personalize your content