My Obsidian Note-Taking Workflow
Simon Späti
JULY 28, 2024
A Vim-Inspired Approach to Efficient Note Management with Obsidian and Markdown
Simon Späti
JULY 28, 2024
A Vim-Inspired Approach to Efficient Note Management with Obsidian and Markdown
Start Data Engineering
JULY 16, 2024
1. Introduction 2. Data Quality(DQ) checks are run as part of your pipeline 2.1. Ensure your consumers don’t get incorrect data with output DQ checks 2.2. Catch upstream issues quickly with input DQ checks 2.3. Waiting a long time to run output DQ checks? Save time & money with mid-pipeline DQ checks. 2.4. Track incoming and outgoing row counts with Audit logs 3.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Confessions of a Data Guy
JULY 24, 2024
I’ve had something rattling around in the old noggin for a while; it’s just another strange idea that I can’t quite shake out. We all keep hearing about Arrow this and Arrow that … seems every new tool built today for Data Engineering seems to be at least partly based on Arrow’s in-memory format. So, […] The post PyArrow vs Polars (vs DuckDB) for Data Pipelines. appeared first on Confessions of a Data Guy.
The Pragmatic Engineer
JULY 15, 2024
The past 18 months have seen major change reshape the tech industry. What does it all mean for businesses and dev teams – and what will pragmatic software engineering approaches look like in the future? I tackled these burning questions in my conference talk, “What’s Old is New Again,” which was the keynote of the Craft Conference in May 2024.
Advertisement
Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?
Christophe Blefari
JULY 26, 2024
Tallinn ( credits ) Dear members, it's Summer Data News, the only news you can consume by the pool, the beach or at the office—if you're not lucky. This week, I'm writing from the Baltics, nomading a bit in Eastern and Northern Europe. I'm pleased to announce that we have successfully closed the CfP for Forward Data Conf, we received nearly 100 submissions and the program committee is currently reviewing all submissions.
Waitingforcode
JULY 16, 2024
With this blog I'm starting a follow-up series for my Data+AI Summit 2024 talk. I missed this family of blog posts a lot as the previous DAIS with me as speaker was 4 years ago! As previously, this time too I'll be writing several blog posts that should help you remember the talk and also cover some of the topics left aside because of the time constraints.
Data Engineering Digest brings together the best content for data engineering professionals from the widest variety of industry thought leaders.
Start Data Engineering
JULY 26, 2024
1. Introduction 2. Project overview 3. Check your data before making it available to end-users; Write-Audit-Publish(WAP) pattern 4. TL;DR: How the greatexpectations library works 4.1. greatexpectations quick setup 5. From an implementation perspective, there are four types of tests 5.1. Running checks on one dataset 5.2. Checks involving the current dataset and its historical data 5.3.
Confluent
JULY 29, 2024
Apache Kafka 3.8 adds 17 new KIPs (13 for Core, 3 for Streams & 1 for Connect). Highlights include 2 new Docker images, the ability to set task assignors, and more!
Snowflake
JULY 9, 2024
Snowflake is committed to helping customers protect their accounts and data. That’s why we have been working on product capabilities that allow Snowflake admins to make multifactor authentication (MFA) mandatory and monitor compliance with this new policy. As part of that effort, today we’re announcing several key features: 1. A new authentication policy that requires MFA for all users in a Snowflake account 2.
Christophe Blefari
JULY 13, 2024
EuroSeagull ( credits ) Dear members, it's been a few weeks since I did not catch you on a proper Data News with a collection of links. Here we are. This week, I attended EuroPython in Prague. While I spent most of my time at the dltHub booth in the sponsors hall, I didn't attend many talks. However, I did give a few presentations on my SQL orchestration library, yato , which pairs well with dlt.
Speaker: Jay Allardyce, Deepak Vittal, and Terrence Sheflin
As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.
Waitingforcode
JULY 10, 2024
Welcome to the first Data+AI Summit 2024 retrospective blog post. I'm opening the series with the topic close to my heart at the moment, stream processing!
databricks
JULY 8, 2024
The Prodvana team joins Databricks to support new innovations in the Data Intelligence Platform infrastructure. Learn more about the vision and what's ahead.
KDnuggets
JULY 15, 2024
Is it possible to learn data engineering for free? I claim it is and present the evidence for that in the form of 10 free data engineering courses.
Data Engineering Weekly
JULY 21, 2024
Editor’s Note: A New Series on Data Engineering Tools Evaluation There are plenty of data tools and vendors in the industry. But how can we choose a tool for the specific need? The traditional evaluation of running PoC on all the selected vendor tools is time-consuming and practically unviable for growth-driven companies. Data Engineering Weekly is launching a new series on software evaluation focused on data engineering to better guide data engineering leaders in evaluating data tools.
Speaker: Nikhil Joshi, Founder & President of Snic Solutions
Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.
Snowflake
JULY 25, 2024
Snowflake Cortex Search, a fully managed search service for documents and other unstructured data, is now in public preview. With Cortex Search, organizations can effortlessly deploy retrieval-augmented generation (RAG) applications with Snowflake, powering use cases like customer service, financial research and sales chatbots. Cortex Search offers state-of-the-art semantic and lexical search over your text data in Snowflake behind an intuitive user interface, and it comes with the robust securi
Engineering at Meta
JULY 16, 2024
The key to developer velocity across AI lies in minimizing time to first batch (TTFB) for machine learning (ML) engineers. AI Lab is a pre-production framework used internally at Meta. It allows us to continuously A/B test common ML workflows – enabling proactive improvements and automatically preventing regressions on TTFB. AI Lab prevents TTFB regressions whilst enabling experimentation to develop improvements.
Precisely
JULY 12, 2024
Data can be your organization’s most valuable asset, but only if it’s data you can trust. When companies work with data that is untrustworthy for any reason, it can result in incorrect insights, skewed analysis, and reckless recommendations to become data integrity vs data quality. Two terms can be used to describe the condition of data: data integrity and data quality.
databricks
JULY 31, 2024
We’re excited to announce the Public Preview of LakeFlow Connect for SQL Server, Salesforce, and Workday. These ingestion connectors enable simple and efficient.
Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage
When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.
KDnuggets
JULY 12, 2024
Discover the essential tools every data scientist should know to elevate their data science game, from Python and R to SQL and advanced visualization tools.
Data Engineering Weekly
JULY 28, 2024
Meta: Introducing Llama 3.1: Our most capable models to date Probability one of the hottest announcements this week is Llama 3.1 release - the first-ever open-sourced frontier AI model competitive with leading foundation models across a range of tasks, including GPT-4, GPT-4o, and Claude 3.5 Sonnet. The Llama3 herd of models is an insightful paper that helps one deeply understand the foundational model.
Zalando Engineering
JULY 24, 2024
A disrupted gaming night I do not usually read code when dealing with production incidents, as it is one of the slower ways to understand and mitigate what is happening. But on that Friday night, I was glad I did. I was about to start another session of Elden Ring (a video game in which everything is pretty much trying to kill the player) when I was paged with the following: "campaign service is consuming all resources we throw at it".
Netflix Tech
JULY 22, 2024
By Jun He , Natallia Dzenisenka , Praneeth Yenugutala , Yingyi Zhang , and Anjali Norwood TL;DR We are thrilled to announce that the Maestro source code is now open to the public! Please visit the Maestro GitHub repository to get started. If you find it useful, please give us a star. What is Maestro Maestro is a general-purpose, horizontally scalable workflow orchestrator designed to manage large-scale workflows such as data pipelines and machine learning model training pipelines.
Speaker: Donna Laquidara-Carr, PhD, LEED AP, Industry Insights Research Director at Dodge Construction Network
In today’s construction market, owners, construction managers, and contractors must navigate increasing challenges, from cost management to project delays. Fortunately, digital tools now offer valuable insights to help mitigate these risks. However, the sheer volume of tools and the complexity of leveraging their data effectively can be daunting. That’s where data-driven construction comes in.
ArcGIS
JULY 12, 2024
Esri is working with partners (Maxar, TomTom) to enhance our 3D basemaps with high-quality commercial data for elevation and buildings layers.
databricks
JULY 23, 2024
We are excited to partner with Meta to release the Llama 3.1 series of models on Databricks, further advancing the standard of powerful.
KDnuggets
JULY 9, 2024
This article is for anyone looking to maximize their use of Amazon Web Services (AWS) generative AI (GenAI) services. Here are eight courses that range from beginner to expert level.
Engineering at Meta
JULY 10, 2024
Meta’s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per second across Meta’s family of apps. Maintaining reliability of these ML systems helps ensure the highest level of service and uninterrupted benefit delivery to our users and advertisers. To minimize disruptions and ensure our ML systems are intrinsically resilient, we have built a comprehensive set of prediction robustness solutions that ensure stability w
Speaker: Aindra Misra, Senior Manager, Product Management (Data, ML, and Cloud Infrastructure) at BILL
Join us for an insightful webinar that explores the critical intersection of data privacy and AI governance. In today’s rapidly evolving tech landscape, building robust governance frameworks is essential to fostering innovation while staying compliant with regulations. Our expert speaker, Aindra Misra, will guide you through best practices for ensuring data protection while leveraging AI capabilities.
Cloudera
JULY 10, 2024
There’s nothing worse than wasting money on unnecessary costs. In on-premises data estates, these costs appear as wasted person-hours waiting for inefficient analytics to complete, or troubleshooting jobs that have failed to execute as expected, or at all. They manifest as idle hardware waiting for urgent workloads to come in, ensuring sufficient spare capacity to run them amidst noisy neighbors and resource-hungry, lower-priority workloads.
Data Engineering Weekly
JULY 14, 2024
Canva: How Canva collects 25 billion events per day Canva writes about its event collection infrastructure capabilities, handling 25 billion events per day (800 billion events per month) with 99.999% uptime. At our team’s inception, a key decision we made, one we still believe to be a big part of our success, was that every collected event must have a machine-readable, well-documented schema.
ArcGIS
JULY 15, 2024
In this blog article, we'll explore BlueBikes data, a bike share service in bustling Boston, and uncover hidden insights through the power of visualization
databricks
JULY 22, 2024
Evaluating long-form LLM outputs quickly and accurately is critical for rapid AI development. As a result, many developers wish to deploy LLM-as-judge methods.
Advertisement
Entity Resolution Sometimes referred to as data matching or fuzzy matching, entity resolution, is critical for data quality, analytics, graph visualization and AI. Learn what entity resolution is, why it matters, how it works and its benefits. Advanced entity resolution using AI is crucial because it efficiently and easily solves many of today’s data quality and analytics problems.
Let's personalize your content