Blog and Bytes - Data Engineering Digest

Data Scientist Vs Data Analyst: Key Differences, Career Paths, and How to Choose the Right Role

WeCloudData

FEBRUARY 13, 2025

quintillion bytes of data are generated every day and thats a great sign for anyone interested in a data-driven career. This blog focuses […] The post Data Scientist Vs Data Analyst: Key Differences, Career Paths, and How to Choose the Right Role appeared first on WeCloudData.

Bytes

Bytes BI Data Engineering

Handling Network Throttling with AWS EC2 at Pinterest

Pinterest Engineering

APRIL 7, 2025

In this blog post, well discuss our experiences in identifying the challenges associated with EC2 network throttling. In the remainder of this blog post, well share how we root cause and mitigate the aboveissues. This prompted us to engage with AWS and dive deep into the network performance of our clusters. 4xl with up to 12.5

AWS

AWS Bytes Database Data Ingestion

Fault Tolerance in Distributed Systems: Tracing with Apache Kafka and Jaeger

Confluent

JULY 24, 2019

Instead, in this post I will point you to an earlier blog post where I already answered that question and then I will focus on what should be your next question: now that I’m relying on Jaeger to trace how data is flowing through my distributed system, what if Jaeger goes down? Distributed tracing with Apache Kafka and Jaeger.

Kafka

Kafka Systems Bytes Project

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

Pinterest Engineering

NOVEMBER 22, 2023

In the first blog, we will share a short summary on the GokuS and GokuL architecture, data format for Goku Long Term, and how we improved the bootstrap time for our storage and serving components. More information about the architecture can be found in the GokuL blog and the cost reduction blog.

Database

Database Bytes Kafka Architecture

LLM finetuning memory requirements by Alex Birch

Scott Logic

NOVEMBER 23, 2023

Cost increases when gradient accumulation is enabled, or becomes ~free if used in concert with DDP DDP usually costs ~4 bytes/param, but becomes cheaper if used in concert with AMP DDP can be made 2.5 Transformer Math does not mention a "4 bytes/param master gradients" cost.

Bytes

Bytes Education IT Utilities

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

LinkedIn Engineering

JUNE 15, 2023

In this blog post, we will discuss the AvroTensorDataset API, techniques we used to improve data processing speeds by up to 162x over existing solutions (thereby decreasing overall training time by up to 66%), and performance results from benchmarks and production. an array within a map, within a union, etc…). Default is 128 * 1024 (128KB).

Datasets

Datasets Bytes Process Data Ingestion

Memory Optimizations for Analytic Queries in Cloudera Data Warehouse

Cloudera

MARCH 2, 2022

You can read previous blog posts on Impala’s performance and querying techniques here – “ New Multithreading Model for Apache Impala ”, “ Keeping Small Queries Fast – Short query optimizations in Apache Impala ” and “ Faster Performance for Selective Queries ”. . Total size of the Bucket is 16 bytes. Folding data into pointers.

Data Warehouse

Data Warehouse Bytes Data Business Intelligence

Introducing Netflix’s Key-Value Data Abstraction Layer

Netflix Tech

SEPTEMBER 18, 2024

The first level is a hashed string ID (the primary key), and the second level is a sorted map of a key-value pair of bytes. A future blog post will describe the chunking architecture in more detail, including its intricacies and optimization strategies. The idempotency token ties all these writes together into one atomic operation.

Bytes

Bytes Metadata Database Data

Life Cycle of Data Science Project

WeCloudData

MARCH 1, 2025

quintillion bytes of data are generated every day. The world is becoming increasingly dependent on data, about 2.5 Data is shaping our decisions, from personalized shopping experiences to checking weather forecasts before leaving home. All of these data science applications have a life cycle to follow.

Data Science

Data Science Project Bytes Data

Pinterest is now on HTTP/3

Pinterest Engineering

FEBRUARY 23, 2023

These advancements fit well with Pinterest use cases — enabling faster connection establishment (time to first byte of first request), improved congestion control (large media as we have), multiplexing without TCP head-of-line blocking (multiple downloads at the same time), and continued in-flight requests when pinners’ device network/ip changes.

Bytes

Bytes Media Software Engineering Software Engineer

Launching the Engineering Blog

Zalando Engineering

JUNE 30, 2020

Our Engineering Blog was launched in June 2020 after a long break of the previous tech blog. What customizations we applied to design the blog and the publishing process. Static Site Generator Our previous tech blog used a CMS which only a limited number of people had access to. So which static site generator to choose?

Engineering

Engineering Bytes AWS Python

Netflix Cloud Packaging in the Terabyte Era

Netflix Tech

SEPTEMBER 24, 2021

Our previous tech blog Packaging award-winning shows with award-winning technology detailed our packaging technology deployed on the streaming side. Writable MezzFS As described in a previous blog post, MezzFS is a tool developed by Netflix that allows cloud storage objects to be mounted as local files via FUSE.

Cloud

Cloud Bytes Cloud Storage Media

Patching the PostgreSQL JDBC Driver

Zalando Engineering

NOVEMBER 8, 2023

Introduction This blog post describes a recent contribution from Zalando to the Postgres JDBC driver to address a long-standing issue with the driver’s integration with Postgres’ logical replication that resulted in runaway Write-Ahead Log (WAL) growth. However as you may imagine, this blog post concerns a path that is anything but happy.

PostgreSQL

PostgreSQL Java Database Bytes

Monitoring Cloudera DataFlow Deployments With Prometheus and Grafana

Cloudera

JANUARY 17, 2024

In this blog we will dive into how CDF-PC’s support for NiFi reporting tasks can be used to monitor key metrics in Prometheus and Grafana. By using component_name and “Hello World Prometheus,” we’re monitoring the bytes received aggregated by the entire process group and therefore the flow. Select the nifi_amount_bytes_received metric.

Bytes

Bytes Architecture Building Designing

How to Stream JSON Data Using Server-Sent Events and FastAPI in Python over HTTP?

Workfall

SEPTEMBER 26, 2023

Reading Time: 9 minutes In this blog, we will cover: What are Server-Sent Events? We’re taking in 16 bytes of data at a time from the stream. This function will provide basic units of data in the form of raw bytes. These bytes can then be converted into a readable JSON format. appeared first on The Workfall Blog.

Python

Python Bytes Coding Project

Deploying Kafka Streams and KSQL with Gradle – Part 2: Managing KSQL Implementations

Confluent

MAY 29, 2019

If you want to follow along and execute all the commands included in this blog post (and the next), you can check out this GitHub repository , which also includes the necessary Docker Compose functionality for running a compatible KSQL and Confluent Platform environment using the recently released Confluent 5.2.1. Sample repository.

Kafka

Kafka Management Bytes SQL

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

DoorDash Engineering

JANUARY 16, 2024

In this blog post, we describe the journey DoorDash took using a service mesh to realize data transfer cost savings without sacrificing service quality. This led us to use a number of observability tools, including VPC flow logs , ebpf agent metrics , and Envoy networking bytes metrics to rectify the situation.

Bytes

Bytes Cloud Management PostgreSQL

BPFAgent: eBPF for Monitoring at DoorDash

DoorDash Engineering

AUGUST 15, 2023

For a more detailed introduction to BPF portability and CO-RE, see Andrii Nakryiko’s blog post on the subject. We also have an unmarshalling function to convert the raw bytes from the kernel into our structure. The post BPFAgent: eBPF for Monitoring at DoorDash appeared first on DoorDash Engineering Blog.

Bytes

Bytes PostgreSQL Coding Database

Postgres Aurora DB major version upgrade with minimal downtime

Lyft Engineering

MARCH 11, 2024

This blog would be of immense help to understand what happens under the hood with AWS blue/green deployment! The diff_bytes is 0 now! We now need to reset sequences in, which we accomplished with the following script: [link] This ensures that the sequence starts from the last entry of the individual tables.

Bytes

Bytes PostgreSQL AWS Database

The Rise of Unstructured Data

Cloudera

NOVEMBER 15, 2021

This blog discusses quantifications, types, and implications of data. The International Data Corporation (IDC) estimates that by 2025 the sum of all data in the world will be in the order of 175 Zettabytes (one Zettabyte is 10^21 bytes). The post The Rise of Unstructured Data appeared first on Cloudera Blog.

Unstructured Data

Unstructured Data Pipeline-centric Database-centric Entertainment

Optimizing Hive on Tez Performance

Cloudera

MAY 9, 2022

Refer to the YARN – The Capacity Scheduler blog to understand these configuration settings.) . This can be tuned using the user limit factor of the YARN queue (refer the details in Capacity Scheduler blog ). Tez determines the reducers automatically based on the data (number of bytes) to be processed.

Bytes

Bytes SQL Professional Services Utilities

How to Navigate the Costs of Legacy SIEMS with Snowflake

Snowflake

APRIL 18, 2024

This blog post explores how Snowflake can help with this challenge. In the cloud, computing can be measured in various ways, like bytes scanned or CPU cycles. But what if security teams didn’t have to make tradeoffs? Detection and investigation processing: Security teams depend on detection rules to find important events automatically.

Data Lake

Data Lake Data Ingestion Bytes Cloud Computing

Introducing Netflix TimeSeries Data Abstraction Layer

Netflix Tech

OCTOBER 8, 2024

In previous blog posts, we introduced the Key-Value Data Abstraction Layer and the Data Gateway Platform , both of which are integral to Netflix’s data architecture. We may go into detail on this subject in one of our future blog posts. The next section describes how this is achieved.

Bytes

Bytes Datasets Metadata Data

Getting Started with Rust and Apache Kafka

Confluent

OCTOBER 24, 2019

I’ve written an event sourcing bank simulation in Clojure (a lisp build for Java virtual machines or JVMs) called open-bank-mark , which you are welcome to read about in my previous blog post explaining the story behind this open source example. Make sure it is indeed an ID and that the Value matches the expected type Fixed , with 16 bytes.

Kafka

Kafka Java Banking Bytes

Building a Simple CRUD web application and image store using Cloudera Operational Database and Flask

Cloudera

OCTOBER 6, 2020

In this blog, I will demonstrate how COD can easily be used as a backend system to store data and images for a simple web application. The post Building a Simple CRUD web application and image store using Cloudera Operational Database and Flask appeared first on Cloudera Blog.

Database

Database Building Bytes NoSQL

Data Engineering Weekly #151

Data Engineering Weekly

DECEMBER 3, 2023

Github writes an excellent blog to capture the current state of the LLM integration architecture. The blog is an excellent read to understand late-arriving data, backfilling, and incremental processing complications. I experienced similar drawbacks to what Lyft is talking about in Druid. Rebalancing, the awkward middle child.

Data Engineering

Data Engineering Data Engineer Engineering Bytes

Packaging award-winning shows with award-winning technology

Netflix Tech

FEBRUARY 25, 2021

By Cyril Concolato Introduction In previous blog posts, our colleagues at Netflix have explained how 4K video streams are optimized , how even legacy video streams are improved and more recently how new audio codecs can provide better aural experiences to our members. Figure 1?—?Simplified

Technology

Technology Bytes Media Entertainment

Solving Espresso’s scalability and performance challenges to support our member base

LinkedIn Engineering

SEPTEMBER 7, 2023

Espresso System Overview Figure 1 is a high-level overview of the Espresso ecosystem, which includes the online operation section of Espresso (the main focus of this blog post). Improvements to Encode/Decode performance This section focuses on the performance improvements we made when converting bytes to Http objects and vice versa.

Bytes

Bytes Transportation Utilities Java

A Glimpse into the Redesigned Goku-Ingestor vNext at Pinterest

Pinterest Engineering

NOVEMBER 28, 2023

Pyoung = Seden / Ralloc where Pyoung is the period between young GC, Seden is the size of Eden and Ralloc is the rate of memory allocations (bytes per second). To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site.

Kafka

Kafka Bytes Architecture Software Engineer

Practical API Design at Netflix, Part 1: Using Protobuf FieldMask

Netflix Tech

SEPTEMBER 3, 2021

If a consumer is only interested in production titles and format, they can set a FieldMask with paths “title” and “format”: [link] Masking fields Please note, even though code samples in this blog post are written in Java, demonstrated concepts apply to any other language supported by protocol buffers. Field names are not included.

Designing

Designing Java Bytes Utilities

Apache Kafka Deployments and Systems Reliability – Part 1

Cloudera

SEPTEMBER 20, 2021

In this blog series, we will discuss each of these deployments and the deployment choices made along with how they impact reliability. The post Apache Kafka Deployments and Systems Reliability – Part 1 appeared first on Cloudera Blog. There are many ways that Apache Kafka has been deployed in the field.

Kafka

Kafka Systems Utilities Bytes

Netflix Drive

Netflix Tech

MAY 5, 2021

We will cover the different namespaces of Netflix Drive in more detail in a subsequent blog post. Data Store Characteristics Netflix Drive relies on a data store that allows streaming bytes into files/objects persisted on the storage media. The transfer mechanism for transport of bytes is a function of the data store.

Metadata

Metadata Bytes Media Cloud Storage

Carbon Emissions of End-User Devices: Part One - SWD Method by David Rees

Scott Logic

APRIL 5, 2024

Introduction This series of blog posts discusses the methods of estimating carbon emissions of end-user devices. After intending to write a single blog post, the research journey prompted me to reconsider how to present this to an audience. js is a javascript library that returns an estimated CO2e value for a web page.

Bytes

Bytes Systems Designing Data Storage

The Stream Processing Model Behind Google Cloud Dataflow

Towards Data Science

APRIL 30, 2024

This blog post is my note after reading the paper: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. In the rest of this blog, we will see how Google enables this contribution. See you next blog!

Google Cloud

Google Cloud Process Cloud Lambda Architecture

Google BigQuery: A Game-Changing Data Warehousing Solution

ProjectPro

JANUARY 24, 2023

This blog is your comprehensive guide to Google BigQuery, its architecture, and a beginner-friendly tutorial on how to use Google BigQuery for your data warehousing activities. This blog presents a detailed overview of Google BigQuery and its architecture. Due to this, combining and contrasting the STRING and BYTE types is impossible.

Bytes

Bytes Google Cloud Data Warehouse Cloud Storage

Data Engineering Weekly #117

Data Engineering Weekly

FEBRUARY 5, 2023

The ML for large-scale production systems highlights the improvement made from the existing heuristic in the YouTube cache replacement algorithm with a new hybrid algorithm that combines a simple heuristic with a learned model, improving the byte miss ratio at the peak by ~9%. The blog talks about four types of architecture.

Data Engineering

Data Engineering Data Engineer Engineering Food

Apache Ozone Fault Injection Framework

Cloudera

AUGUST 14, 2020

The target could be a particular Node (network endpoint), a file-system, a directory, a data-file or a byte-offset range within a given data-file. The post Apache Ozone Fault Injection Framework appeared first on Cloudera Blog. A Typical flow control for Apache Ozone using this Fault Injection Framework looks like this: .

Hadoop

Hadoop Bytes Metadata Programming Language

Streaming Big Data Files from Cloud Storage

Towards Data Science

JANUARY 26, 2023

Check out this informative blog for more details on how S5cmd works and its significant performance advantages. Here we show how to download specific byte-ranges of the file using the Boto3 get_object data streaming API. CPU cores and TCP connections). The S5cmd concurrency flag allows for controlling the download speed.

Cloud Storage

Cloud Storage Big Data Cloud AWS

Collaboration is Key to Reducing Pain and Finding Value in Data

Cloudera

OCTOBER 6, 2020

This is a guest blog post, authored by John Zantey, Director and Co-founder, Qabsu. Globally, there are quintillions of bytes of data being generated and collected, every day. The post Collaboration is Key to Reducing Pain and Finding Value in Data appeared first on Cloudera Blog. collect, enrich, report, serve, and predict).

Bytes

Bytes Education Cloud Data

Deploying Kafka Streams and KSQL with Gradle – Part 3: KSQL User-Defined Functions and Kafka Streams

Confluent

JULY 10, 2019

The repository’s README contains a bit more detail, but in a nutshell, we check out the repo and then use Gradle to initiate docker-compose : git clone [link] cd kafka-examples git checkout confluent-blog./gradlew jar Zip file size: 5849 bytes, number of entries: 5. jar Zip file size: 11405084 bytes, number of entries: 7422.

Kafka

Kafka Java Bytes SQL

HDFS Data Encryption at Rest on Cloudera Data Platform

Cloudera

APRIL 23, 2021

sent 11,286 bytes received 172 bytes 2,546.22 However, we can continue without enabling TLS for the purpose of this blog. The post HDFS Data Encryption at Rest on Cloudera Data Platform appeared first on Cloudera Blog. [root@ccycloud-4 ~]# rsync -zav --exclude.ssl /var/lib/keytrustee/.keytrustee keytrustee ccycloud-3.cdpvcb.root.hwx.site:/var/lib/keytrustee/.

MySQL

MySQL Java Bytes Data

Data Quality + Data Lineage = ???

Datakin

SEPTEMBER 2, 2021

Blog Data Quality + Data Lineage = Written by Peter Hicks on Sep 2, 2021 In a prior life, I dwelled in the day-to-day cycles of an e-commerce platform. In previous blog posts, we’ve talked before about the importance of understanding data lineage from debugging to privacy and governance.

Bytes

Bytes Food Datasets Data Pipeline

Optimization Strategies for Iceberg Tables

Cloudera

FEBRUARY 14, 2024

This blog discusses a few problems that you might encounter with Iceberg tables and offers strategies on how to optimize them in each of those scenarios. In a later blog, we will go into details about how to take advantage of the time travel feature. The rest of the blog will go into this in more detail. Opening files is costly.

Bytes

Bytes Metadata Data Lake SQL

Processing medical images at scale on the cloud

Tweag

APRIL 19, 2023

In this blog post, I will explain the underlying technical challenges and share the solution that we helped implement at kaiko.ai , a MedTech startup in Amsterdam that is building a Data Platform to support AI research in hospitals. A solution is to read the bytes that we need when we need them directly from Blob Storage.

Medical

Medical Process Cloud Bytes

Data Scientist Vs Data Analyst: Key Differences, Career Paths, and How to Choose the Right Role

Handling Network Throttling with AWS EC2 at Pinterest

Webinars

Trending Sources

Fault Tolerance in Distributed Systems: Tracing with Apache Kafka and Jaeger

Webinars

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

LLM finetuning memory requirements by Alex Birch

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

Memory Optimizations for Analytic Queries in Cloudera Data Warehouse

Introducing Netflix’s Key-Value Data Abstraction Layer

Life Cycle of Data Science Project

Pinterest is now on HTTP/3

Launching the Engineering Blog

Netflix Cloud Packaging in the Terabyte Era

Patching the PostgreSQL JDBC Driver

Monitoring Cloudera DataFlow Deployments With Prometheus and Grafana

How to Stream JSON Data Using Server-Sent Events and FastAPI in Python over HTTP?

Deploying Kafka Streams and KSQL with Gradle – Part 2: Managing KSQL Implementations

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

BPFAgent: eBPF for Monitoring at DoorDash

Postgres Aurora DB major version upgrade with minimal downtime

The Rise of Unstructured Data

Optimizing Hive on Tez Performance

How to Navigate the Costs of Legacy SIEMS with Snowflake

Introducing Netflix TimeSeries Data Abstraction Layer

Getting Started with Rust and Apache Kafka

Building a Simple CRUD web application and image store using Cloudera Operational Database and Flask

Data Engineering Weekly #151

Packaging award-winning shows with award-winning technology

Solving Espresso’s scalability and performance challenges to support our member base

A Glimpse into the Redesigned Goku-Ingestor vNext at Pinterest

Practical API Design at Netflix, Part 1: Using Protobuf FieldMask

Apache Kafka Deployments and Systems Reliability – Part 1

Netflix Drive

Carbon Emissions of End-User Devices: Part One - SWD Method by David Rees

The Stream Processing Model Behind Google Cloud Dataflow

Google BigQuery: A Game-Changing Data Warehousing Solution

Data Engineering Weekly #117

Apache Ozone Fault Injection Framework

Streaming Big Data Files from Cloud Storage

Collaboration is Key to Reducing Pain and Finding Value in Data

Deploying Kafka Streams and KSQL with Gradle – Part 3: KSQL User-Defined Functions and Kafka Streams

HDFS Data Encryption at Rest on Cloudera Data Platform

Data Quality + Data Lineage = ???

Optimization Strategies for Iceberg Tables

Processing medical images at scale on the cloud

Stay Connected