December, 2022

article thumbnail

Data Pipeline Design Patterns - #1. Data flow patterns

Start Data Engineering

1. Introduction 2. Source & Sink 2.1. Source Replayability 2.2. Source Ordering 2.3. Sink Overwritability 3. Data pipeline patterns 3.1. Extraction patterns 3.1.1. Time ranged 3.1.2. Full Snapshot 3.1.3. Lookback 3.1.4. Streaming 3.2. Behavioral 3.2.1. Idempotent 3.2.2. Self-healing 3.3. Structural 3.3.1. Multi-hop pipelines 3.3.2. Conditional/ Dynamic pipelines 3.3.3.

article thumbnail

Dataframe Showdown – Polars vs Spark vs Pandas vs DataFusion. Guess who wins?

Confessions of a Data Guy

There once was a day when no one used DataFrames that much. Back before Spark had really gone mainstream, Data Scientists were still plinking around with Pandas a lot. My My, what would your mother say? How things have changed. Now everyone wants a piece of the DataFrame pie. I mean it tastes so good, […] The post Dataframe Showdown – Polars vs Spark vs Pandas vs DataFusion.

Data 147
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

A Return to the Office (RTO) Wave?

The Pragmatic Engineer

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of five topics in today’s subscriber-only The Scoop issue. To get this newsletter every week, subscribe here. On Thursday, 29 November, Snap CEO Evan Spiegel, sent an email announcing Snap will mandate 4 days/week in the office, starting from January.

article thumbnail

Data News — must-read 2022 articles

Christophe Blefari

kitsch moment, from me to you ( credits ) Hey you, this is the last article of the year and it's gonna be about the articles and trends that made 2022 according to me. You'll see articles that I've already share during the year. 💡 You can also read the 2021's must-read that I've done one year and half ago or how to learn data engineering that contains key articles to understand the field.

article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Should We Get Rid Of ETLs?

Seattle Data Guy

AWS has jumped on the bandwagon of removing the need for ETLs. Snowflake announced this both with their hybrid tables and their partnership with Salesforce. Now, I do take a little issue with the naming “Zero ETLs”. Because at the very surface the functionality described is often closer to a zero integration future, which probably… Read more The post Should We Get Rid Of ETLs?

AWS 130
article thumbnail

Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams

Data Engineering Podcast

Summary With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term i

Data Lake 130

More Trending

article thumbnail

I asked ChatGPT to write a blog post about Data Engineering. Here it is.

Confessions of a Data Guy

Data engineering is a vital field within the realm of data science that focuses on the practical aspects of collecting, storing, and processing large amounts of data. It involves designing and building the infrastructure to store and process data, as well as developing the tools and systems to extract valuable insights and knowledge from that […] The post I asked ChatGPT to write a blog post about Data Engineering.

article thumbnail

What Can AI-Powered RPA and IA Mean For Businesses?

KDnuggets

RPA and IA have stunned the business world by availing impressive, intelligent automation capabilities for scales of businesses across industries, which we'll know in this blog.

160
160
article thumbnail

How to manage and schedule dbt

Christophe Blefari

Last week dbt Labs decided to change the pricing of their Cloud offering. I've already analysed this in week #22.50 of the Data News. In a nutshell, dbt Cloud pricing is per seat based, which means you pay for each dbt developer. Previously for a team it was $50/month/dev and they increase to $100/month/dev, a 100% increase with a team limit of 8 devs and only one project.

article thumbnail

Data warehouses vs Data Lakes vs Databases – Which One Do You Need

Seattle Data Guy

By Reseun McClendon Today, your enterprise must effectively collect, store, and integrate data from disparate sources to both provide operational and analytical benefits. Whether its helping increase revenue by finding new customers or reducing costs, all of it starts with data. Data analysts, data scientists, engineers, and managers all require a robust data storage solution for… Read more The post Data warehouses vs Data Lakes vs Databases – Which One Do You Need appeared first on

Data Lake 130
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, and Terrence Sheflin

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Data Engineering Podcast

Summary Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective.

article thumbnail

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

by Jasmine Omeke , Obi-Ike Nwoke , Olek Gorajek Intro This post is for all data practitioners, who are interested in learning about bootstrapping, standardization and automation of batch data pipelines at Netflix. You may remember Dataflow from the post we wrote last year titled Data pipeline asset management with Dataflow. That article was a deep dive into one of the more technical aspects of Dataflow and didn’t properly introduce this tool in the first place.

article thumbnail

What is Apache Arrow? Asking for a friend.

Confessions of a Data Guy

We’ve all been in that spot, especially in tech. You wanted to fit in, be cool, and look smart, so you didn’t ask any questions. And now it’s too late. You’re stuck. Now you simply can’t ask … you’re too afraid. I get it. Apache Arrow is probably one of those things. It keeps popping […] The post What is Apache Arrow?

IT 130
article thumbnail

Data Science Minimum: 10 Essential Skills You Need to Know to Start Doing Data Science

KDnuggets

Data science is ever-evolving, so mastering its foundational technical and soft skills will help you be successful in a career as a Data Scientist, as well as pursue advance concepts, such as deep learning and artificial intelligence.

article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Data News — Week 22.50

Christophe Blefari

Prepping me to deliver Christmas' Data News ( credits ) Hey you, the end of the year is coming soon. I really liked this year with you. It was super fun to write every Friday of the year my opinion on data topics, I don't know yet if next year I'll be able to pull out stuff without repeating myself, I hate repeating myself, but for sure I'll try and I'll continue.

Kafka 130
article thumbnail

Reducing Data Analytics Costs In 2023 – Doing More With Less

Seattle Data Guy

If you haven’t started looking for ways to improve your data analytics budget for 2023, then you’re probably already behind. The truth is that between all of the various economic indicators and investor letters, everyone is looking to improve audit all parts of their business. Especially where there has likely been bloat. One of those… Read more The post Reducing Data Analytics Costs In 2023 – Doing More With Less appeared first on Seattle Data Guy.

article thumbnail

Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems

Data Engineering Podcast

Summary Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies.

article thumbnail

Building a Telegram Bot Powered by Apache Kafka and ksqlDB

Confluent

ksqlDB use case: see how apps can use ksqlDB to ingest, filter, enrich, aggregate, and query data directly with Kafka—no complex architectures or data stores needed.

Kafka 144
article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.

article thumbnail

Why Data Migrations Suck.

Confessions of a Data Guy

I’ve often wondered what purgatory would be like, doing penance for millennia into eternity. It would probably be doing data migrations. I suppose they are not all that dissimilar from normal software migrations, but there are a few things that make data migrations a little more horrible and soul-sucking. Data migrations are able to slow […] The post Why Data Migrations Suck. appeared first on Confessions of a Data Guy.

Data 130
article thumbnail

More Data Science Cheatsheets

KDnuggets

It's time again to look at some data science cheatsheets. Here you can find a short selection of such resources which can cater to different existing levels of knowledge and breadth of topics of interest.

article thumbnail

Data News — Week 22.49

Christophe Blefari

This is what we call a Chat in French ( credits ) Hello there, this is Christophe, live from the human world. Last week have been totally driven by ChatGPT frenzy, the social networks I use to follow are spammed with conversation screenshots and hype. On my side I don't know what the future holds for us but for sure MaaS—Models as a Service—looks not bright to me.

SQL 130
article thumbnail

Best of 2022: 5 Most Popular Cybersecurity Blogs Of The Year

U-Next

Introduction. Are you a Cybersecurity enthusiast looking to know the latest trends and goings in the cybersecurity industry? Or are you just a tech enthusiast who likes to be updated with the ongoings around them? Then you are at the perfect place. As another year comes to an end, we decided the best way to look back was to revisit the most popular and sought-after blogs of Cybersecurity and list the same for all our Cybersecurity enthusiasts.

Education 105
article thumbnail

The Ultimate Guide To Data-Driven Construction: Optimize Projects, Reduce Risks, & Boost Innovation

Speaker: Donna Laquidara-Carr, PhD, LEED AP, Industry Insights Research Director at Dodge Construction Network

In today’s construction market, owners, construction managers, and contractors must navigate increasing challenges, from cost management to project delays. Fortunately, digital tools now offer valuable insights to help mitigate these risks. However, the sheer volume of tools and the complexity of leveraging their data effectively can be daunting. That’s where data-driven construction comes in.

article thumbnail

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Data Engineering Podcast

Summary One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing

Metadata 130
article thumbnail

Broadcom Modernizes Machine Learning and Anomaly Detection with ksqlDB

Confluent

Broadcom's Mainframe Operational Intelligence Product (MOI) collects and analyzes data at mass scale, using ksqlDB to improve anomaly detection and custom alarm filtering.

article thumbnail

Safety First: Using vehicle data to make us all better drivers

Teradata

Vehicle data is invaluable in improving the safety & safe operation of vehicles for their occupants & other drivers. The next gen of vehicles will use real-time analysis to make driving even safer.

Data 105
article thumbnail

How To Overcome The Fear of Math and Learn Math For Data Science

KDnuggets

Many aspiring Data Scientists, especially when self-learning, fail to learn the necessary math foundations. These recommendations for learning approaches along with references to valuable resources can help you overcome a personal sense of not being "the math type" or belief that you "always failed in math.".

article thumbnail

Driving Responsible Innovation: How to Navigate AI Governance & Data Privacy

Speaker: Aindra Misra, Senior Manager, Product Management (Data, ML, and Cloud Infrastructure) at BILL

Join us for an insightful webinar that explores the critical intersection of data privacy and AI governance. In today’s rapidly evolving tech landscape, building robust governance frameworks is essential to fostering innovation while staying compliant with regulations. Our expert speaker, Aindra Misra, will guide you through best practices for ensuring data protection while leveraging AI capabilities.

article thumbnail

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

Cloudera

Cloudera has been providing enterprise support for Apache NiFi since 2015, helping hundreds of organizations take control of their data movement pipelines on premises and in the public cloud. Working with these organizations has taught us a lot about the needs of developers and administrators when it comes to developing new dataflows and supporting them in mission-critical production environments. .

Designing 100
article thumbnail

Making GHC faster at emitting code

Tweag

One common complaint from industrial users of Haskell is that of compilation times: they are sometimes painfully slow. Some of that slowness is difficult to avoid—no matter how you slice it, typechecking and optimizing Haskell code takes a lot of work—but nobody would argue that there is not ample room for improvement. For the past few months, Krzysztof Gogolewski and I have had the opportunity to work with Mercury to identify what some of those improvements might be, and I am pleased to report

Coding 72
article thumbnail

Data Catalog - A Broken Promise

Data Engineering Weekly

Data catalogs are the most expensive data integration systems you never intended to build. Data Catalog as a passive web portal to display metadata requires significant rethinking to adopt modern data workflow, not just adding “modern” in its prefix. I know that is an expensive statement to make😊 To be fair, I’m a big fan of data catalogs, or metadata management , to be precise.

article thumbnail

From Eager to Smarter in Apache Kafka Consumer Rebalances

Confluent

Major improvements to the Kafka consumer, Streams, and ksqlDB for incremental cooperative rebalancing while maintaining at-least-once and exactly-once guarantees.

Kafka 138
article thumbnail

What Is Entity Resolution? How It Works & Why It Matters

Entity Resolution Sometimes referred to as data matching or fuzzy matching, entity resolution, is critical for data quality, analytics, graph visualization and AI. Learn what entity resolution is, why it matters, how it works and its benefits. Advanced entity resolution using AI is crucial because it efficiently and easily solves many of today’s data quality and analytics problems.