October, 2024

article thumbnail

Did Automattic commit open source theft?

The Pragmatic Engineer

The below was originally published in The Pragmatic Engineer. To get timely analysis on the tech industry like this, on a weekly basis: sign up to The Pragmatic Engineer Newsletter. If you are into podcasts, check out The Pragmatic Engineer Podcast. Imagine Apple decided Spotify was a big enough business threat that it had to take unfair measures to limit Spotify’s growth on the App Store.

article thumbnail

Unapologetically Technical Episode 14 – Cliff Crosland

Jesse Anderson

Unapologetically Technical’s newest episode is now live! In this episode of Unapologetically Technical, I interview Cliff Crosland, the co-founder and CEO of Scanner.dev. Cliff Crosland is a data engineer passionate about helping people wrangle massive log volumes. He sees logs as a treasure trove of insights and believes effective log analysis is critical in today’s complex systems.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Modern Data Architecture: Data Mesh and Data Fabric 101

Precisely

Key Takeaways: Data mesh is a decentralized approach to data management, designed to shift creation and ownership of data products to domain-specific teams. Data fabric is a unified approach to data management, creating a consistent way to manage, access, and share data across distributed environments. Both approaches empower your organization to be more agile, data-driven, and responsive so you can make informed decisions in real time.

article thumbnail

Testing DuckDB’s Large Than Memory Processing Capabilities.

Confessions of a Data Guy

I am a glutton for punishment, a harbinger of tidings, a storm crow, a prophet of the data land, my sole purpose is to plumb the depths of the tools we use every day in Data Engineering. I find the good, the bad, the ugly, and splay them out before you, string ’em up and […] The post Testing DuckDB’s Large Than Memory Processing Capabilities. appeared first on Confessions of a Data Guy.

Process 113
article thumbnail

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

article thumbnail

Demystifying Azure Storage Account network access

Towards Data Science

Demystifying Azure Storage Account Network Access Service endpoints and private endpoints hands-on: including Azure Backbone, storage account firewall, DNS, VNET and NSGs Connected Network — image by Nastya Dulhiier on Unsplash 1. Introduction Storage accounts play a vital role in a medallion architecture for establishing an enterprise data lake. They act as a centralized repository, enabling seamless data exchange between producers and consumers.

article thumbnail

Microsoft’s Drasi: An Open-Source Tool for Efficient Change Management Systems

Analytics Vidhya

Introduction Today, data systems evolve quickly, demanding efficient monitoring and response. Real-time change detection is essential to keeping systems stable, preventing failures, and ensuring business continuity. Microsoft’s open-source tool, Drasi, addresses this need by effortlessly detecting, monitoring, and responding to data changes across platforms, including relational and graph databases.

Systems 175

More Trending

article thumbnail

10 GitHub Repositories for Advanced Machine Learning Projects

KDnuggets

Where can you find projects dealing with advanced ML topics? GitHub is a perfect source with its many repositories. I’ve selected ten to talk about in this article.

article thumbnail

Data Engineering Weekly #195

Data Engineering Weekly

Astasia Myers: The three components of the unstructured data stack LLMs and vector databases significantly improved the ability to process and understand unstructured data. I never thought of PDF as a self-contained document database, but that seems a reality that we can’t deny. The blog is an excellent summary of the existing unstructured data landscape.

article thumbnail

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Monte Carlo

The rise of AI and GenAI has brought about the rise of new questions in the data ecosystem – and new roles. One job that has become increasingly popular across enterprise data teams is the role of the AI data engineer. Demand for AI data engineers has grown rapidly in data-driven organizations. But what does an AI data engineer do? What are they responsible for?

article thumbnail

Complete Guide to Data Transformation: Basics to Advanced

Ascend.io

What is Data Transformation? Data transformation is the process of converting raw data into a usable format to generate insights. It involves cleaning, normalizing, validating, and enriching data, ensuring that it is consistent and ready for analysis. Data transformation is key for data-driven decision-making, allowing organizations to derive meaningful insights from varied data sources.

article thumbnail

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

article thumbnail

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Cloudera

In August, we wrote about how in a future where distributed data architectures are inevitable, unifying and managing operational and business metadata is critical to successfully maximizing the value of data, analytics, and AI. One of the most important innovations in data management is open table formats, specifically Apache Iceberg , which fundamentally transforms the way data teams manage operational metadata in the data lake.

article thumbnail

Interesting startup idea: benchmarking cloud platform pricing

The Pragmatic Engineer

Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover topics related to Big Tech and startups through the lens of engineering managers and senior engineers. In this article, we cover one section from this week’s from last week’s The Pulse issue. To get full issues twice a week, subscribe here.

Cloud 332
article thumbnail

7 Computer Vision Projects for All Levels

KDnuggets

Each project, from beginner tasks like Image Classification to advanced ones like Anomaly Detection, includes a link to the dataset and source code for easy access and implementation.

Project 153
article thumbnail

Skip Lines of CSV files with DuckDB and Polars

Confessions of a Data Guy

There are some things you don’t need until you need them. I ran into that situation recently with needing to process some CSV / Flatfiles on short notice. At first, it appeared to be easy, but then I realized, as usual, there was a little monkey wrench thrown into the middle of it. It is […] The post Skip Lines of CSV files with DuckDB and Polars appeared first on Confessions of a Data Guy.

Process 147
article thumbnail

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

article thumbnail

Introducing Databricks Apps

databricks

Summary Databricks Apps, a new way to build and deploy internal data and AI applications, is now available in Public Preview on AWS.

AWS 144
article thumbnail

Introducing a New Visual Identity Reflecting Robinhood’s Growth and Vision for the Future

Robinhood

When Robinhood was founded, we set out to build a platform that gives everyone access to the financial markets. Over the last decade, we’ve disrupted and changed the industry for the better, becoming the first U.S. retail broker to offer commission-free trading, and saving investors billions in the process. In recent years, we’ve expanded our offering, ushering in a number of new cutting-edge products and services that help everyone – regardless of income – trade, invest, and earn.

Banking 136
article thumbnail

Migrating in-place from PostgreSQL to MySQL

Yelp Engineering

The Yelp Reservations service (yelp_res) is the service that powers reservations on Yelp. It was acquired along with Seatme in 2013, and is a Django service and webapp. It powers the reservation backend and logic for Yelp Guest Manager, our iPad app for restaurants, and handles diner and partner flows that create reservations. Along with that, it serves a web UI and backend API for our Yelp Reservations app, which has been superseded by Yelp Guest Manager but is still used by many of our restaur

MySQL 135
article thumbnail

Enabling Seamless Cloud Migration and Real-Time Data Integration for a Nonprofit Educational Healthcare Organization with Striim

Striim

A nonprofit educational healthcare organization is faced with the challenge of modernizing its critical systems while ensuring uninterrupted access to essential services. With Striim’s real-time data integration solution, the institution successfully transitioned to a cloud infrastructure, maintaining seamless operations and paving the way for future advancements.

article thumbnail

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

article thumbnail

7 Cool Data Science Project Ideas for Beginners

KDnuggets

Are you a data science beginner looking to build your portfolio? Start working on these projects today.

article thumbnail

25 SQL tips to level up your data engineering skills

Start Data Engineering

Introduction Setup SQL tips 1. Handy functions for common data processing scenarios 1.1. Need to filter on WINDOW function without CTE/Subquery use QUALIFY 1.2. Need the first/last row in a partition, use DISTINCT ON 1.3. STRUCT data types are sorted based on their keys from left to right 1.4. Get the first/last element with ROW_NUMBER() + QUALIFY 1.5.

SQL 130
article thumbnail

The Long Context RAG Capabilities of OpenAI o1 and Google Gemini

databricks

Retrieval Augmented Generation (RAG) is the top use case for Databricks customers who want to customize AI workflows on their own data. The.

Data 143
article thumbnail

The Rise of the Declarative Data Stack

Simon Späti

Data stacks have come a long way, evolving from monolithic, one-fits-all systems like Oracle/SAP to today’s modular open data stacks. This begs the question, what’s next? Or why is the current not meeting our needs? As we see more analytics engineering and software best practices, embracing codeful, Git-based, and more CLI-based workflows, the future looks more code-first.

SQL 130
article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

How To Automate PDF Data Extraction – 3 Different Methods To Parse PDFs For Analytics

Seattle Data Guy

I.f you work in data, then at some point in your career, you’ll likely need to parse data from a PDF. You might need to parse thousands of PDFs in order to pull out invoice information. Or maybe you need to parse financial filing documents such as 10-Ks. This can seem challenging at first. Afterall,… Read more The post How To Automate PDF Data Extraction – 3 Different Methods To Parse PDFs For Analytics appeared first on Seattle Data Guy.

Data 130
article thumbnail

Webinar: DataOps For Beginners – 2024

DataKitchen

“That should take two hours, not two months. Can’t your Data & Analytics Team go any faster?” “The executives’ dashboard broke! The data’s wrong! Can I ever trust our data?” If you’ve ever heard (or had) these complaints about speed-to-insight or data reliability, you should watch our webinar, DataOps for Beginners, on demand. DataKitchen’s VP Gil Benghiat breaks down what DataOps is (spoiler: it’s not just DevOps for data) and how DataOps can take your Data & Analytics factory fro

Data 52
article thumbnail

Roadmap for AI Engineers

KDnuggets

Learn about 10 easy steps to becoming an AI engineer in 2024.

article thumbnail

How to use nested data types effectively in SQL

Start Data Engineering

1. Introduction 2. Code & Data 3. Using nested data types effectively 3.1. Use STRUCT for one-to-one & hierarchical relationships 3.2. Use ARRAY[STRUCT] for one-to-many relationships 3.3. Using nested data types in data processing 3.3.1. STRUCT enables more straightforward data schema and data access 3.3.2. Nested data types can be sorted 3.3.3.

SQL 130
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Open Source Security at Databricks

databricks

The Databricks Product Security team is deeply committed to ensuring the security and integrity of its products, which are built on top of.

IT 138
article thumbnail

The Enterprise Case for DuckDB: 5 Key Categories and Why Use It

Simon Späti

DuckDB has a significant share and is frequently featured in the latest data engineering news. However, it’s still in its early adopter phase and has yet to be adopted by larger enterprises. Sure, all data creators and startups have used and potentially grown to love DuckDB, but is it also suitable for enterprises? What about scaling out and sharing it with others in the organization?

IT 130
article thumbnail

React at Meta Connect 2024

Engineering at Meta

At Meta, React and React Native are more than just tools; they are integral to our product development and innovation. With over five thousand people at Meta building products and experiences with React every month, these technologies are fundamental to our engineering culture and our ability to quickly build and ship high quality products. In this post, we will dive into the development experiences of some of the product teams who leveraged React and React Native to deliver exciting projects sh

Coding 126
article thumbnail

Tales from the Pipeline: 4 Data Horror Stories To Keep You Up at Night

Monte Carlo

“As he lay awake in his Bay Area apartment, the data leader couldn’t shake the feeling that something wasn’t right. He tried to shut his eyes—to force them closed—but the more the data engineer tried, the more convinced he became. Suddenly, a light appeared from the darkness. It was a Slack from the CEO. She was working late. And the data…it couldn’t be…it looked wrong.

article thumbnail

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you