Sat.Aug 05, 2023 - Fri.Aug 11, 2023

article thumbnail

Why Is Data Modeling So Challenging – How To Data Model For Analytics

Seattle Data Guy

Learning about how to data models from basic star schemas on the internet is like learning data science using the IRIS data set. It works great as a toy example. But it doesn’t match real life at all. Data modeling in real life requires you fully understand the data sources and your business use cases.… Read more The post Why Is Data Modeling So Challenging – How To Data Model For Analytics appeared first on Seattle Data Guy.

article thumbnail

A senior engineer/EM job search story

The Pragmatic Engineer

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover topics related to Big Tech and startups through the lens of engineering managers and senior engineers. In this article, we cover one out of five topics from today’s subscriber-only The Pulse issue. To get full issues twice a week, subscribe here.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Senior Engineer – The Number One Skill

Confessions of a Data Guy

Do you think I’m just trying to get you to click? Maybe. Maybe not. After working in and around Data Teams for well over a decade, with both the smartest people to touch the keyboard, and the others, it’s become quite clear to me what the number one skill that identifies a Senior level Engineering […] The post Senior Engineer – The Number One Skill appeared first on Confessions of a Data Guy.

article thumbnail

_spark_metadata in Apache Spark Structured Streaming issue is no more!

Waitingforcode

There are probably not that many people working today on the flat files with Structured Streaming than 5 years ago thanks to the table file formats. However, if you are in this group and are still generating CSVs or JSONs with the streaming sink, brace yourself, the memory problems are coming if you don't take action!

130
130
article thumbnail

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

article thumbnail

Quantifying The Return On Investment For Your Data Team

Data Engineering Podcast

Summary As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company.

article thumbnail

Are reports of StackOverflow’s fall greatly exaggerated?

The Pragmatic Engineer

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover topics related to Big Tech and startups through the lens of engineering managers and senior engineers. In this article, we cover one out of five topics from today’s subscriber-only The Pulse issue. To get full issues twice a week, subscribe here.

Retail 207

More Trending

article thumbnail

Confluent Champion: Niki Kapsi’s Journey From SDR to Commercial Account Executive

Confluent

Meet Commercial AE Niki Kapsi and learn about the “entrepreneurial” side of her role at Confluent.

98
article thumbnail

What is Data Observability? 5 Key Pillars To Know

Monte Carlo

Editor’s Note : So much has happened since we first published this post and created the data observability category and Monte Carlo in 2019. We have updated this post to reflect this rapidly maturing space. You can read the original article linked at the bottom of this page. What is Data observability? The five pillars My data observability definition has not changed since I first coined it in 2019: Data observability refers to an organization’s comprehensive understanding of the health an

article thumbnail

What’s new with Databricks SQL?

databricks

At this year's Data+AI Summit, Databricks SQL continued to push the boundaries of what a data warehouse can be, leveraging AI across the.

SQL 98
article thumbnail

Data Scientists Need to Specialize to Survive the Tech Winter

KDnuggets

In this article, I explore the benefits of specialization for data scientists. Drawing on my own experience as a data scientist, I argue that specializing in a specific area can help you stand out in a crowded job market and provide you with more fulfilling career opportunities.

Data 108
article thumbnail

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

article thumbnail

5 Ways Generative AI Changes How Companies Approach Data (And How It Doesn’t)

Towards Data Science

Experts from venture capital, Snowflake, and more discuss how generative AI will benefit data teams and the challenges they must solve. Image courtesy of the author. Generated by DiffusionBee. Generative AI is not a new concept. It’s been studied for decades and applied in limited capacities. That is until ChatGPT shocked and awed our collective consciousness in late 2022.

IT 98
article thumbnail

Startup Spotlight: Tesorio Helps Finance Teams Tackle Cash Flow Challenges

Snowflake

Welcome to Snowflake’s Startup Spotlight, where we learn about awesome companies building businesses on Snowflake. Can accounts receivable be an agent of change? Tesorio Co-Founder and CTO Fabio Fleitas thinks so, and his startup’s AI/ML-driven platform aims to give finance teams better control over their cash flow so they can have greater impact on their organizations’ success.

Finance 98
article thumbnail

How to execute your operating model for Data and AI

databricks

In Part 1 of this blog series, we discussed how Databricks enables organizations to develop, manage and operate processes that extract value from.

Data 98
article thumbnail

Best Python Tools for Building Generative AI Applications Cheat Sheet

KDnuggets

KDnuggets' new cheat sheet summarizes the top Python libraries for building generative AI apps, from OpenAI and Transformers to tools like Gradio, Diffusers, LangChain, and more. Ideal for both beginners and experts looking for a quick reference.

Python 108
article thumbnail

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

article thumbnail

How to Build a Fully Automated Data Drift Detection Pipeline

Towards Data Science

An Automate Guide to Detect and Handle Data Drift Continue reading on Towards Data Science »

article thumbnail

Supercharging your Rust static executables with mimalloc

Tweag

Why link statically against musl? Have you ever faced compatibility issues when dealing with Linux binary executables? The culprit is often the libc implementation, glibc. Acting as the backbone of nearly all Linux distros, glibc is the library responsible for providing standard C functions. Yet, its version compatibility often poses a challenge. Binaries compiled with a newer version of glibc may not function on systems running an older one, creating a compatibility headache.

article thumbnail

How Verana Health Uses the Databricks Lakehouse to Democratize Data and Deploy AI for Medical Innovation

databricks

Across industries, data scientists spend up to 80% of their time trying to properly prepare and cleanse datasets for data mining and artificial.

Medical 98
article thumbnail

Fundamentals Of Statistics For Data Scientists and Analysts

KDnuggets

Key statistical concepts for your data science or data analysis journey.

article thumbnail

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

article thumbnail

Pioneering Data Observability:Data, Code, Infrastructure, & AI

Towards Data Science

Pioneering Data Observability: Data, Code, Infrastructure, & AI The four dimensions of data observability: data, code, infrastructure, and ai? Image courtesy of the author. Outlining the past, present, and future of architecting reliable data systems. When we launched the data observability category in 2019, the term was something I could barely pronounce.

Coding 98
article thumbnail

The LLM Factory: Driven by Snowflake and NVIDIA 

Snowflake

Snowflake recently announced a collaboration with NVIDIA to make it easy to run NVIDIA accelerated computing workloads directly within Snowflake accounts. One interesting use case is to train, customize, and deploy large language models (LLMs) safely and securely within Snowflake. Our new Snowpark Container Services , currently in private preview, together with NVIDIA AI, makes this possible.

article thumbnail

A New Partnership with Redox and How We Unlock Healthcare Data to Drive Advanced Analytics

databricks

Healthcare is sitting on mountains of data Pop quiz: Which industry accounts for about 30% of newly created data around the world and.

article thumbnail

Overcoming Barriers in Multi-lingual Voice Technology: Top 5 Challenges and Innovative Solutions

KDnuggets

Voice assistants like Siri, Alexa and Google Assistant are household names, but they still don't do well in multilingual settings. This article first provides an overview of how voice assistants work, and then dives into the top 5 challenges for voice assistants when it comes to providing a superior multilingual user experience. It also provides strategies for mitigation of these challenges.

article thumbnail

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

article thumbnail

What is an Apache Kafka Cluster? (And Why You Should Care)

Confluent

Learn what an Apache Kafka cluster is, and what makes a cluster special.

Kafka 96
article thumbnail

Scaling the Instagram Explore recommendations system

Engineering at Meta

Explore is one of the largest recommendation systems on Instagram. We leverage machine learning to make sure people are always seeing content that is the most interesting and relevant to them. Using more advanced machine learning models, like Two Towers neural networks, we’ve been able to make the Explore recommendation system even more scalable and flexible.

Systems 94
article thumbnail

Multiple Stateful Operators in Structured Streaming

databricks

In the world of data engineering, there are operations that have been used since the birth of ETL. You filter. You join. You.

article thumbnail

A Comprehensive Guide to MLOps

KDnuggets

Machine Learning Operations (MLOps) is a relatively new discipline that provides the structure and support necessary for machine learning (ML) models to thrive in production environments.

article thumbnail

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Reimagining a classic Cheysson thematic map

ArcGIS

Here's a a re-think on a classic. I'll rationalize some data-viz choices and layout choices and end up with something completely different.

Data 93
article thumbnail

Using short-lived certificates to protect TLS secrets

Engineering at Meta

Short-lived certificates (SLCs) are part of our latest efforts to further secure our Transport Layer Security (TLS) private keys on our edge networks. SLCs have a very short exposure compared to traditional certificates and lower the chances of a compromised private key being abused. Implementing SLCs has required us to address tradeoffs between operability and reliability, while satisfying the strict security requirements of our edge environment.

article thumbnail

HDFS Snapshot Best Practices

Cloudera

Introduction The snapshots feature of the Apache Hadoop Distributed Filesystem ( HDFS) enables you to capture point-in-time copies of the file system and protect your important data against corruption, user-, or application errors. This feature is available in all versions of Cloudera Data Platform (CDP), Cloudera Distribution for Hadoop (CDH) and Hortonworks Data Platform (HDP).

Hadoop 79
article thumbnail

Unveiling StableCode: A New Horizon in AI-Assisted Coding

KDnuggets

This article explores StableCode, an innovative AI product by Stability AI, designed to enhance coding efficiency and accessibility. It delves into its unique features, underlying technology, and potential impact on the developer community.

Coding 108
article thumbnail

How to Achieve High-Accuracy Results When Using LLMs

Speaker: Ben Epstein, Stealth Founder & CTO | Tony Karrer, Founder & CTO, Aggregage

When tasked with building a fundamentally new product line with deeper insights than previously achievable for a high-value client, Ben Epstein and his team faced a significant challenge: how to harness LLMs to produce consistent, high-accuracy outputs at scale. In this new session, Ben will share how he and his team engineered a system (based on proven software engineering approaches) that employs reproducible test variations (via temperature 0 and fixed seeds), and enables non-LLM evaluation m