Sat.Feb 25, 2023 - Fri.Mar 03, 2023

article thumbnail

AWS Lambdas – Python vs Rust. Performance and Cost Savings.

Confessions of a Data Guy

Save money, save money!! Hear Hear! Someone on Linkedin recently brought up the point that companies could save gobs of money by swapping out AWS Python lambdas for Rust ones. While it raised the ire of many a Python Data Engineer, I thought it sounded like a great idea. At least it’s an excuse to […] The post AWS Lambdas – Python vs Rust.

AWS 356
article thumbnail

Azure Databricks: A Comprehensive Guide

Analytics Vidhya

Introduction Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform that is built on top of the Microsoft Azure cloud. A collaborative and interactive workspace allows users to perform big data processing and machine learning tasks easily. In this blog post, we will take a closer look at Azure Databricks, its key features, […] The post Azure Databricks: A Comprehensive Guide appeared first on Analytics Vidhya.

Big Data 310
article thumbnail

Finding My Pathless Path

Simon Späti

As I sit down to write this article, I’m filled with a sense of vulnerability and excitement. You see, this is a story that only I can tell. It’s a tale of finding my Pathless Path and discovering who I am in the process. I have learned that some of my best decision-making comes from following my gut, heart, and intuition, a place of inner knowing.

Process 289
article thumbnail

How to get started with dbt

Christophe Blefari

This article is meant to be a resource hub in order to understand dbt basics and to help get started your dbt journey. When I write dbt, I often mean dbt Core. dbt Core is an open-source framework that helps you organise data warehouse SQL transformation. dbt Core has been developed by dbt Labs, which was previously named Fishtown Analytics. The company has been founded in May 2016. dbt Labs also develop dbt Cloud which is a cloud product that hosts and runs dbt Core projects.

article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Big Tech job-switching stats

The Pragmatic Engineer

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of five topics from The Scoop #39 , published two weeks ago, 23 February. To get full newsletters twice a week, subscribe here. I have collaborated with a tech recruiter - they’ve asked to be anonymous - who’s been running some very interesting queries on LinkedIn for software engineers.

article thumbnail

30 Best Data Science Books to Read in 2023

Analytics Vidhya

Introduction Data science has taken over all economic sectors in recent times. To achieve maximum efficiency, every company strives to use various data at every stage of its operations. Each aspect of data science, like data preparation, the importance of big data, and the process of automation, contributes to how data science is the future […] The post 30 Best Data Science Books to Read in 2023 appeared first on Analytics Vidhya.

More Trending

article thumbnail

Building A Data Mesh Platform At PayPal

Data Engineering Podcast

Summary There has been a lot of discussion about the practical application of data mesh and how to implement it in an organization. Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. In this episode he shares that journey and the combination of technical and organizational challenges that he encountered in the process.

Building 147
article thumbnail

Why did Google close its coding competitions after 20 years?

The Pragmatic Engineer

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. We cover one out of five topics in yesterday's subscriber-only The Scoop issue. To get full newsletters twice a week, subscribe here. On 22 February 2023, Google announced its coding competitions are coming to an end: The visual that accompanied the announcement of the end of Google’s coding competitions.

Coding 176
article thumbnail

How to Normalize Relational Databases With SQL Code?

Analytics Vidhya

Introduction Data is the new oil in this century. The database is the major element of a data science project. To generate actionable insights, the database must be centralized and organized efficiently. If a corrupted, unorganized, or redundant database is used, the results of the analysis may become inconsistent and highly misleading. So, we are […] The post How to Normalize Relational Databases With SQL Code?

article thumbnail

Top Free Data Science Online Courses for 2023

KDnuggets

Learn Data Science in 2023 for FREE with these online courses.

article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Introducing Compute-Compute Separation for Real-Time Analytics

Rockset

Every database built for real-time analytics has a fundamental limitation. When you deconstruct the core database architecture, deep in the heart of it you will find a single component that is performing two distinct competing functions: real-time data ingestion and query serving. These two parts running on the same compute unit is what makes the database real-time: queries can reflect the effect of the new data that was just ingested.

article thumbnail

What is a Data Mesh?

Confessions of a Data Guy

The post What is a Data Mesh? appeared first on Confessions of a Data Guy.

Data 130
article thumbnail

Top 10 Hadoop Interview Questions You Must Know

Analytics Vidhya

Introduction The Hadoop Distributed File System (HDFS) is a Java-based file system that is Distributed, Scalable, and Portable. Due to its lack of POSIX conformance, some believe it to be data storage instead. Still, it does include shell commands and Java Application Programming Interface (API) functions that are similar to other file systems. HDFS and […] The post Top 10 Hadoop Interview Questions You Must Know appeared first on Analytics Vidhya.

Hadoop 233
article thumbnail

PySpark for Data Science

KDnuggets

In this tutorial, we will learn to Initiates the Spark session, load, and process the data, perform data analysis, and train a machine learning model.

article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

Filtering rules accumulator

Waitingforcode

Data can have various quality issues, from missing to badly formatted values. However, there is another issue less people talk about, the erroneous filtering logic.

Data 130
article thumbnail

Finding My Pathless Path

Simon Späti

As I sit down to write this article, I’m filled with a sense of vulnerability and excitement. You see, this is a story that only I can tell. It’s a tale of finding my Pathless Path and discovering who I am in the process. I have learned that some of my best decision-making comes from following my gut, heart, and intuition - a place of inner knowing.

Process 130
article thumbnail

Understanding Dimensional Modeling

Analytics Vidhya

Introduction One of the most important assets of any organization is the data it produces on a daily basis. This data is used by an organization to find valuable insights which help in improving an organization’s growth and strategies and give them an upper hand over its competitors. This article explains to you the idea […] The post Understanding Dimensional Modeling appeared first on Analytics Vidhya.

article thumbnail

Top Posts February 20-26: 5 SQL Visualization Tools for Data Engineers

KDnuggets

5 SQL Visualization Tools for Data Engineers • Free TensorFlow 2.

SQL 137
article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Snowflake

Snowflake enables organizations to be data-driven by offering an expansive set of features for creating performant, scalable, and reliable data pipelines that feed dashboards, machine learning models, and applications. But before data can be transformed and served or shared, it must be ingested from source systems. The volume of data generated in real time from application databases, sensors, and mobile devices continues to grow exponentially.

Kafka 128
article thumbnail

Announcing Ray support on Databricks and Apache Spark Clusters

databricks

Ray is a prominent compute framework for running scalable AI and Python workloads, offering a variety of distributed machine learning tools, large-scale hyperparameter.

article thumbnail

Top 5 Interview Questions on Cassandra

Analytics Vidhya

Introduction Cassandra is an Apache-developed free and open-source distributed NoSQL database management system. It manages huge volumes of data across many commodity servers, ensures fault tolerance with the swift transfer of data, and provides high availability with no single point of failure. Java-written Apache Cassandra is highly scalable for Big Data models and comprises flexible […] The post Top 5 Interview Questions on Cassandra appeared first on Analytics Vidhya.

NoSQL 223
article thumbnail

3 Julia Packages for Data Visualization

KDnuggets

A gentle introduction of Plots.jl, Gadfly.jl, and VegaLite with code examples.

Coding 134
article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Cloudera

Recently, we announced enhanced multi-function analytics support in Cloudera Data Platform (CDP) with Apache Iceberg. Iceberg is a high-performance open table format for huge analytic data sets. It allows multiple data processing engines, such as Flink, NiFi, Spark, Hive, and Impala to access and analyze data in simple, familiar SQL tables. In this blog post, we are going to share with you how Cloudera Stream Processing ( CSP ) is integrated with Apache Iceberg and how you can use the SQL Stream

Process 119
article thumbnail

Upsert your datasets using the Append tool in ArcGIS Pro 3.1

ArcGIS

In ArcGIS Pro 3.1, you can use the Append tool to upsert (update and insert) a target dataset with data from a new or updated dataset.

Datasets 105
article thumbnail

Step-by-Step Roadmap to Learn SQL in 2023

Analytics Vidhya

Introduction Structured Query Language is a powerful language to manage and manipulate data stored in databases. SQL is widely used in the field of data science and is considered an essential skill to have if you work with data. After being introduced in the 70s, it has become the standard querying language for relational databases. […] The post Step-by-Step Roadmap to Learn SQL in 2023 appeared first on Analytics Vidhya.

SQL 223
article thumbnail

5 Data Analysis Projects For Beginners

KDnuggets

Are you a data analyst newbie looking to boost your resume to land your first job? If yes, then up your game as a beginner with these 5 projects that you can’t afford to miss.

article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

Scalable Spark Structured Streaming for REST API Destinations

databricks

Spark Structured Streaming is the widely-used open source engine at the foundation of data streaming on the Databricks Lakehouse Platform. It can elegantly.

article thumbnail

Multi-Geo Replication 101 for Apache Kafka: The What, How, and Why

Confluent

Learn the what, how, and why for multi-geo replication. In this post, we’ll share the best tools, practices, and patterns for planning geo-replicated Kafka deployments.

Kafka 100
article thumbnail

Understanding the Basics of Database Normalization

Analytics Vidhya

Introduction Data normalization is the process of building a database according to what is known as a canonical form, where the final product is a relational database with no data redundancy. More specifically, normalization involves organizing data according to attributes assigned as part of a larger data model. The main goals of database normalization are […] The post Understanding the Basics of Database Normalization appeared first on Analytics Vidhya.

Database 221
article thumbnail

SQL Query Optimization Techniques

KDnuggets

Learn how to optimize the queries written in SQL to make them execute faster and more memory efficient.

SQL 122
article thumbnail

The Cloud Development Environment Adoption Report

Cloud Development Environments (CDEs) are changing how software teams work by moving development to the cloud. Our Cloud Development Environment Adoption Report gathers insights from 223 developers and business leaders, uncovering key trends in CDE adoption. With 66% of large organizations already using CDEs, these platforms are quickly becoming essential to modern development practices.