This site uses cookies to improve your experience. To help us insure we adhere to various privacy regulations, please select your country/region of residence. If you do not select a country, we will assume you are from the United States. Select your Cookie Settings or view our Privacy Policy and Terms of Use.
Cookie Settings
Cookies and similar technologies are used on this website for proper function of the website, for tracking performance analytics and for marketing purposes. We and some of our third-party providers may use cookie data for various purposes. Please review the cookie settings below and choose your preference.
Used for the proper function of the website
Used for monitoring website traffic and interactions
Cookie Settings
Cookies and similar technologies are used on this website for proper function of the website, for tracking performance analytics and for marketing purposes. We and some of our third-party providers may use cookie data for various purposes. Please review the cookie settings below and choose your preference.
Strictly Necessary: Used for the proper function of the website
Performance/Analytics: Used for monitoring website traffic and interactions
Summary A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and sc
1. Introduction 2. Setup & Logging architecture 3. Data Pipeline Logging Best Practices 3.1. Metadata: Information about pipeline runs, & data flowing through your pipeline 3.2. Obtain visibility into the code’s execution sequence using text logs 3.3. Understand resource usage by tracking Metrics 3.4. Monitoring UI & Traceability 3.5.
In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate
I bet you know it already. You can limit the max throughput for Apache Spark Structured Streaming jobs for popular data sources such as Apache Kafka, Delta Lake, or raw files. Have you known that you can also control the lower limit, at least for Apache Kafka?
We previously announced Snowflake’s Unistore workload , which continues Snowflake’s legacy of breaking down data silos by uniting transactional and analytical data in a consistent and governed platform. Today, we are pleased to announce that Hybrid Tables — the core feature powering Unistore — is in public preview in select AWS regions. Hybrid Tables is a new table type that enables transactional use cases within Snowflake with fast, high-concurrency point operations.
We previously announced Snowflake’s Unistore workload , which continues Snowflake’s legacy of breaking down data silos by uniting transactional and analytical data in a consistent and governed platform. Today, we are pleased to announce that Hybrid Tables — the core feature powering Unistore — is in public preview in select AWS regions. Hybrid Tables is a new table type that enables transactional use cases within Snowflake with fast, high-concurrency point operations.
We are excited to announce the upcoming general availability of Azure Private Link support for Databricks SQL (DBSQL) Serverless, planned in April 2024.
A lot of the buzz around AI focuses on its future potential. And we get it — we’re talking about a transformative technology that presents seemingly limitless possibilities. But an important aspect of this world-changing tech story that gets lost in the hype is understanding exactly what AI solutions are available for you and your team to employ right now, today.
Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.
In today's environment, proactive cybersecurity is crucial to any public sector agency. For many organizations, log data that security professionals need for effective.
There is a great evil Spirit that is haunting the streets of code in the land of programmers. It’s a Spirit of obfuscation and twisting things into what they are not. The Spirit wanders around on the loose looking for someone, and it finds ready victims among the ranks of new programmers and the innocent […] The post The Abstraction Problem – A Great Evil appeared first on Confessions of a Data Guy.
As you design your career in data, you’ve got to avoid getting stuck in your comfort zone or allowing your manager or current situation to determine your path.
We’ve partnered with Voltron Data and the Arrow community to align and converge Apache Arrow with Velox , Meta’s open source execution engine. Apache Arrow 15 includes three new format layouts developed through this partnership: StringView, ListView, and Run-End-Encoding (REE). This new convergence helps Meta and the larger community build data management systems that are unified, more efficient, and composable.
Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage
There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.
Dive into transformative trends in data science, encompassing AI-powered automation, NLP, ethical considerations, decentralized computing, and interdisciplinary collaboration.
As the world grapples with the escalating climate crisis, many industries are re-examining their operations to identify and implement sustainable practices. The telecommunications industry is no exception. Telecom companies face growing pressure from consumers, investors and regulators to reduce their carbon footprint and achieve net-zero emissions.
Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives
Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri
New SQL Practice Problems I’m trying something new. I get a lot of questions from folks about getting into the Data Engineering space, how to get better, grow, learn, etc. So I came up with a solution. SQL Practice Problems. Some moons ago I wrote a Data Engineering Practice repo on GitHub for free, and some 1.2K stars later […] The post New SQL Practice Problems appeared first on Confessions of a Data Guy.
This week on Unapologetically Technical, I had the wonderful pleasure of interviewing Gunnar Morling, the creator of the Billion Row Challenge and Senior Staff Software Engineer at Decodable. In this episode, we talk about why it is so important to stay in a position long enough to gain experience and see the success or failure of decisions. He also shares his experiences at RedHat and working on Debezium.
Learn a modern approach to stream real-time data in Jupyter Notebook. This guide covers dynamic visualizations, a Python for quant finance use case, and Bollinger Bands analysis with live data.
Nearly every facet of society has felt the impact of AI since it burst into the mainstream in late 2022 with the public launch of ChatGPT. In 2024, the retail and consumer goods industry is expected to experience massive upheaval due to the proliferation of generative AI (gen AI) tools as well as changes in customer engagement and the general manner in which products are now sold.
With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you
No. This question generated a lot of content last week, and a lot of words were written. I wanted to keep my answer short so as not to burden you with a few thousand more words to read. Modern data stack has been coined by US companies and VCs—mainly Fivetran / dbt Labs—as a word to quickly emphasis a way to build data stack in the cloud related to ELT.
Why Stakeholder Management? One of the most critical aspects of project management is doing what’s necessary to develop and control relationships with all individuals that the project impacts. In this article, you will learn techniques for identifying stakeholders, analyzing their influence on the project, and developing strategies to communicate, set boundaries, and manage competing expectations.
Sam Wang | Sr. Technical Program Manager; Joe Gordon | Sr. Staff Software Engineer At Pinterest we are continuously looking for ways to improve our developer experience, and we have recently shipped AI-assisted development for everyone while balancing safety, security, and cost. In this blog post, we share our journey of unlocking AI-assisted development, from the initial idea to the General Availability (GA) stage.
In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!
How to Stream and Apply Real-Time Prediction Models on High-Throughput Time-Series Data Photo by JJ Ying on Unsplash Most of the stream processing libraries are not python friendly while the majority of machine learning and data mining libraries are python based. Although the Faust library aims to bring Kafka Streaming ideas into the Python ecosystem, it may pose challenges in terms of ease of use.
What is Agile Testing? As the name implies, agile course projects are executed very quickly and with flexibility. Agile methods involve tasks executed in short iterations or sprints. Agile Testing is also iterative and takes place after each sprint, rather than towards the end of the project. Testing courses iteratively helps to validate the client requirements and adapt to changing conditions in a better manner.
by Herbert Kateu 1. Introduction The WebSocket protocol enables persistent two-way communication between a client and a server where packets can be passed in both directions without the need for additional HTTP requests. The specification for this protocol is outlined in RFC 6455. WebSockets are used in applications such as Instant Messaging, Gaming, Simultaneous editing, and stock tickers to mention but a few.
Speaker: Ben Epstein, Stealth Founder & CTO | Tony Karrer, Founder & CTO, Aggregage
When tasked with building a fundamentally new product line with deeper insights than previously achievable for a high-value client, Ben Epstein and his team faced a significant challenge: how to harness LLMs to produce consistent, high-accuracy outputs at scale. In this new session, Ben will share how he and his team engineered a system (based on proven software engineering approaches) that employs reproducible test variations (via temperature 0 and fixed seeds), and enables non-LLM evaluation m
We organize all of the trending information in your field so you don't have to. Join 37,000+ users and stay up to date on the latest articles your peers are reading.
You know about us, now we want to get to know you!
Let's personalize your content
Let's get even more personalized
We recognize your account from another site in our network, please click 'Send Email' below to continue with verifying your account and setting a password.
Let's personalize your content