September, 2023

article thumbnail

Top 20 Data Engineering Project Ideas [With Source Code]

Analytics Vidhya

Data engineering plays a pivotal role in the vast data ecosystem by collecting, transforming, and delivering data essential for analytics, reporting, and machine learning. Aspiring data engineers often seek real-world projects to gain hands-on experience and showcase their expertise. This article presents the top 20 data engineering project ideas with their source code.

article thumbnail

Why are Cloud Development Environments Spiking in Popularity, Now?

The Pragmatic Engineer

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover topics related to Big Tech and startups through the lens of engineering managers and senior engineers. In this article, we cover a fresh industry trends: Cloud Developent Environments — which is analysis full subscribers have received 3 weeks ago.

Cloud 258
article thumbnail

Airflow XCOM: The Ultimate Guide

Marc Lamberti

Wondering how to share data between tasks? What are XCOMs in Apache Airflow? Well, you are at the right place. In this tutorial, you will learn about XComs in Airflow. What they are, how they work, how you can define them, how to get them, and more. If you checked my course “Apache Airflow: The Hands-On Guide”, Aiflow XCom should not sound unfamiliar.

MySQL 246
article thumbnail

ETL vs. ELT?

Waitingforcode

In our social media and marketing-driven era, it's quite hard to get things right. For me there is one common misconception brought by the Modern Data Stack idea that everything should be now ELT. In fact no, it shouldn't but only can.

Media 228
article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem

Data Engineering Podcast

Summary Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its im

BI 208
article thumbnail

The Role of DevOps and CI/CD in Data Engineering

Confessions of a Data Guy

In the vast world of data, it’s not just about gathering and analyzing information anymore; it’s also about ensuring that data pipelines, processes, and platforms run seamlessly and efficiently. Nothing screams “why are flying by night,” than coming into a Data Team only to find no tests, no docs, no deployments, no Docker, no nothing. […] The post The Role of DevOps and CI/CD in Data Engineering appeared first on Confessions of a Data Guy.

More Trending

article thumbnail

Bun: lessons from disrupting a tech ecosystem

The Pragmatic Engineer

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of four topics in yesterday’s subscriber-only The Pulse issue. To get full newsletters twice a week, subscribe here. Two weeks ago, a JavaScript runtime and toolkit called Bun was released and took the Node.js world by storm. Bun was mostly built by Jared Sumner , a former Stripe engineer, and recipient of the Thiel Fellowship (a grant of $100,000 for young people to drop out of s

article thumbnail

Upgrade your Modern Data Stack

Christophe Blefari

Make your data stack take-off ( credits ) Hello, another edition of Data News. This week, we're going to take a step back and look at the current state of data platforms. What are the current trends and why are people fighting around the concept of the modern data stack. Early September is usually conference season. All over the world, people gather in huge venues to attend conferences.

Big Data 147
article thumbnail

Arbitrary stateful processing in PySpark with applyInPandasWithState

Waitingforcode

It's always a huge pleasure to see the PySpark API covering more and more Scala API features. Starting from Apache Spark 3.4.0 you can even write arbitrary stateful processing jobs! But since the API is a little bit different than the one available on the Scala side, I wanted to take a deeper look.

Process 147
article thumbnail

Building Linked Data Products With JSON-LD

Data Engineering Podcast

Summary A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.

Building 189
article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

DuckDB + Delta Lake (the new lake house?)

Confessions of a Data Guy

I always leave it to my dear readers and followers to give me pokes in the right direction. Nothing like the teaming masses to set you straight. Recently I was working on my Substack Newsletter, on the topic of Polars + Delta Lake, reading remove files from s3 … I left a question open on […] The post DuckDB + Delta Lake (the new lake house?

Data 147
article thumbnail

5 Free Books to Help You Master Python

KDnuggets

From the basics of Python to clean architecture and more, here are five free books to level up your Python skills.

Python 157
article thumbnail

Working at a Startup vs in Big Tech

The Pragmatic Engineer

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of four topics in today’s subscriber-only The Pulse issue. To get full newsletters twice a week, subscribe here. Willem Spruijt is a software engineer whom I worked on the same team with at Uber in Amsterdam, building payments systems.

article thumbnail

GPT and LLMs from a Data Engineering Perspective

Jesse Anderson

There has been quite a bit of writing covering GPT and LLMs from data science and business perspectives. I haven’t seen much from the data engineering side. Let me share my perspective, having been in data and AI for a while and using LLMs before they became popular. It is interesting to see the general public having the same amount of excitement as there was a year ago in the LLM space.

article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Best Practices for LLM Evaluation of RAG Applications

databricks

Chatbots are the most widely adopted use case for leveraging the powerful chat and reasoning capabilities of large language models (LLM). The retrieval.

article thumbnail

Eliminate The Overhead In Your Data Integration With The Open Source dlt Library

Data Engineering Podcast

Summary Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system.

article thumbnail

Threads: The inside story of Meta’s newest social app

Engineering at Meta

Earlier this year, a small team of engineers at Meta started working on an idea for a new app. It would have all the features people expect from a text-based conversations app, but with one very key, distinctive goal – being an app that would allow people to share their content across multiple platforms. We wanted to build a decentralized (or federated) app that would enable people to post content that is viewable by anyone on other social apps, and vice versa.

Media 143
article thumbnail

Top 7 Free Cloud Notebooks for Data Science

KDnuggets

Cloud notebooks are game-changers for data science, providing free access to computing, pre-built environments, collaboration features, and third-party integrations - everything you need to enhance your workflow.

article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Scala as a Junior Developer

Rock the JVM

By Lucas Nouguier Hey everyone, Daniel here. Lucas’ story is shared by lots of beginner Scala developers, which is why I wanted to post it here on the blog. I’ve watched thousands of developers learn Scala from scratch, and, like Lucas, they love it! If you want to learn Scala well and fast, take a look at my Scala Essentials course at Rock the JVM.

Scala 142
article thumbnail

Predicting Snow Crab Habitat Using Machine Learning

ArcGIS

In collaboration with NOAA, we used the Presence-Only Prediction (Maxent) tool to predict snow crab habitat under changing climate conditions.

article thumbnail

Deploy Private LLMs using Databricks Model Serving

databricks

We are excited to announce public preview of GPU and LLM optimization support for Databricks Model Serving! With this launch, you can deploy.

article thumbnail

Powering Vector Search With Real Time And Incremental Vector Indexes

Data Engineering Podcast

Summary The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles.

SQL 147
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Meta Quest 2: Defense through offense

Engineering at Meta

Meta’s Native Assurance team regularly performs manual code reviews as part of our ongoing commitment to improve the security posture of Meta’s products. In 2021, we discovered a vulnerability in the Meta Quest 2’s Android-based OS that never made it to production but helped us find new ways to improve the security of Meta Quest products. We’re sharing our journey to get arbitrary native code execution in the privileged VR Runtime service on the Meta Quest 2 by exploiting a memory corruption v

Bytes 138
article thumbnail

Ensemble Learning Techniques: A Walkthrough with Random Forests in Python

KDnuggets

A practical walkthrough for random forests in Python.

Python 154
article thumbnail

Getting started with Airflow in 10 mins

Marc Lamberti

At the end of this introduction to Airflow, you will be all set for getting started with Airflow. You will start with the basics, such as what Airflow is and the essential concepts. Then you will set up and run your local development environment using the Astro CLI to create your first data pipeline. I hope you’re getting excited. Fasten your seatbelt, take a deep breath, and let’s go For a complete hands-on introduction to Apache Airflow, here is a 6-hour course at a discount.

article thumbnail

Data News — Week 23.38 (late)

Christophe Blefari

Early like my run ( credits ) Hey. This is a super late Data News, I wanted to send it earlier but I was travelling then enjoying time with friends and family. I'm still struggling a bit to write as fast as I would like, but 🤷‍♂️ So, sorry for the late edition and enjoy. Gen AI 🤖 Announcing Microsoft Copilot — Having everything under a common brand is great and Copilot is a great name.

Data 130
article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Introducing MLflow 2.7 with new LLMOps capabilities

databricks

As part of MLflow 2’s support for LLMOps, we are excited to introduce the latest updates to support prompt engineering in MLflow 2.7. A.

article thumbnail

Top 20 Data Engineering Project Ideas with Source Code

Analytics Vidhya

Data engineering plays a pivotal role in the vast data ecosystem by collecting, transforming, and delivering data essential for analytics, reporting, and machine learning. Aspiring data engineers often seek real-world projects to gain hands-on experience and showcase their expertise. This article presents the top 20 data engineering project ideas with their source code.

article thumbnail

What's new on the cloud for data engineers - part 11 (06-09.2023)

Waitingforcode

It's time for another part of "What's new on the cloud for data engineers" Let's see what happened in the last 4 months.

article thumbnail

Build Your Own PandasAI with LlamaIndex

KDnuggets

Learn how to leverage LlamaIndex and GPT-3.5-Turbo to easily add natural language capabilities to Pandas for intuitive data analysis and conversation.

Building 154
article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.