Data Warehouse and Java - Data Engineering Digest

Data News — Week 25.02

Christophe Blefari

JANUARY 11, 2025

Python and Java still leads the programming language interest, but with a decrease in interest (-5% and -13%) while Rust gaining traction (+13%), not sure it's related, tho. He listed 4 things that are the most difficult data integration tasks: from mutable data to IT migrations, everything adds complexity to ingestion systems.

Data

Data Data Warehouse Coding Programming Language

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Data Engineering Podcast

FEBRUARY 18, 2024

Summary A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality.

Data Lake

Data Lake High Quality Data Data Warehouse Google Cloud

Data News — Week 24.11

Christophe Blefari

MARCH 15, 2024

Obviously Benoit prefers Kestra, at the expense of writing YAML and running a Java application. Arrow doing a lot of the data operation heavy lifting. New Apache Arrow engines — Arrow has become one of the most used library when it comes to built in-memory engines.

Metadata

Metadata Data Data Warehouse Software Engineer

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Optimizing data warehouse storage

Netflix Tech

DECEMBER 21, 2020

By Anupom Syam Background At Netflix, our current data warehouse contains hundreds of Petabytes of data stored in AWS S3 , and each day we ingest and create additional Petabytes. Some of the optimizations are prerequisites for a high-performance data warehouse.

Data Warehouse

Data Warehouse Metadata Algorithm Data

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Cloudera

APRIL 3, 2023

In this blog, we will share with you in detail how Cloudera integrates core compute engines including Apache Hive and Apache Impala in Cloudera Data Warehouse with Iceberg. We will publish follow up blogs for other data services. Try Cloudera Data Warehouse (CDW) by signing up for a 60 day trial , or test drive CDP.

Data Warehouse

Data Warehouse Java Metadata Data

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

Snowflake was founded in 2012 around its data warehouse product, which is still its core offering, and Databricks was founded in 2013 from academia with Spark co-creator researchers, becoming Apache Spark in 2014. you could write the same pipeline in Java, in Scala, in Python, in SQL, etc.—with What's Iceberg?

Metadata

Metadata Data Warehouse BI MySQL

Data Lake vs. Data Warehouse vs. Data Lakehouse

Sync Computing

NOVEMBER 7, 2024

Data volume and velocity, governance, structure, and regulatory requirements have all evolved and continue to. Despite these limitations, data warehouses, introduced in the late 1980s based on ideas developed even earlier, remain in widespread use today for certain business intelligence and data analysis applications.

Data Lake

Data Lake Data Warehouse Business Intelligence Unstructured Data

Using SQL to democratize streaming data

Cloudera

MARCH 2, 2021

This data engineering skillset typically consists of Java or Scala programming skills mated with deep DevOps acumen. The result is that streaming data tends to be “locked away” from everyone but a small few, and the data engineering team is highly overworked and backlogged. A rare breed.

SQL

SQL Java Data Lake Scala

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Monte Carlo

OCTOBER 31, 2024

Both traditional and AI data engineers should be fluent in SQL for managing structured data, but AI data engineers should be proficient in NoSQL databases as well for unstructured data management. Data Storage Solutions As we all know, data can be stored in a variety of ways.

Data Engineer

Data Engineer Data Engineering Engineering Unstructured Data

Data News — Week 23.37

Christophe Blefari

SEPTEMBER 15, 2023

How to reduce warehouse costs? — Hugo propose 7 hacks to optimise data warehouse cost. Scrape & analyse football data — Benoit nicely put in perspective how to use Kestra, Malloy and DuckDB to analyse data. teej/titan — Titan is a Python library to manage data warehouse infrastructure.

Data Warehouse

Data Warehouse Data SQL Python

The Dawn of the AI-Native Data Stack - Part 1

Data Engineering Weekly

OCTOBER 11, 2024

Agent systems powered by LLMs are already transforming how we code and interact with data. I converted a Java streaming platform into Rust, completing the task faster and gaining valuable insights into Rust's intricacies. These systems provided centralized data storage and processing at the cost of agility.

Manufacturing

Manufacturing Transportation Data Warehouse Unstructured Data

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

Data engineering inherits from years of data practices in US big companies. Hadoop initially led the way with Big Data and distributed computing on-premise to finally land on Modern Data Stack — in the cloud — with a data warehouse at the center. My advice on this point is to learn from others.

Data Engineer

Data Engineer Data Engineering Engineering Hadoop

Don’t Blink: You’ll Miss Something Amazing!

Cloudera

OCTOBER 4, 2023

What we need is: An openness to support a wide range in streaming ingest sources, including NiFi, Spark Streaming, Flink, as well as APIs for languages like C++, Java, and Python. The ability to support not just “insert” type data changes, but Insert+update patterns as well, to accommodate both new data, and changing data.

Data Warehouse

Data Warehouse Telecommunication Java Manufacturing

Self Service Open Source Data Integration With AirByte

Data Engineering Podcast

FEBRUARY 22, 2021

Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. RudderStack’s smart customer data pipeline is warehouse-first.

Data Integration

Data Integration Data Warehouse Data Pipeline BI

Clean Up Your Data Using Scalable Entity Resolution And Data Mastering With Zingg

Data Engineering Podcast

NOVEMBER 6, 2022

Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows.

MongoDB

MongoDB MySQL Scala Machine Learning

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

Data Engineering Podcast

NOVEMBER 20, 2022

Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows.

Data Lake

Data Lake Data Ingestion MongoDB MySQL

Database Refactoring Patterns with Pramod Sadalage - Episode 22

Data Engineering Podcast

MARCH 11, 2018

Links Database Refactoring Website Book Thoughtworks Martin Fowler Agile Software Development XP (Extreme Programming) Continuous Integration The Book Wikipedia Test First Development DDL (Data Definition Language) DML (Data Modification Language) DevOps Flyway Liquibase DBMaintain Hibernate SQLAlchemy ORM (Object Relational Mapper) ODM (Object Document (..)

Database

Database MongoDB NoSQL Database Design

10 Lessons from 10 Years of Innovation and Engineering at Picnic

Picnic Engineering

FEBRUARY 13, 2025

In the early days, data was the foundation to support basic operations and learn how to achieve operational excellence. Over time, data became the driver for strategic decision-making and innovation. Our journey began by building a strong Master Data Foundation , which laid the groundwork for our first generation of systems.

Engineering

Engineering Database-centric Generalist Java

Analytics Engineering Without The Friction Of Complex Pipeline Development With Optimus and dbt

Data Engineering Podcast

OCTOBER 30, 2022

Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows.

Engineering

Engineering MongoDB MySQL Scala

12 Times Faster Query Planning With Iceberg Manifest Caching in Impala

Cloudera

JULY 13, 2023

The Apache Iceberg project continues developing an implementation of Iceberg specification in the form of Java Library. Several compute engines such as Impala, Hive, Spark, and Trino have supported querying data in Iceberg table format by adopting this Java Library provided by the Apache Iceberg project.

Java

Java Metadata PostgreSQL Data Warehouse

Migrate And Modify Your Data Platform Confidently With Compilerworks

Data Engineering Podcast

AUGUST 18, 2021

What is a typical workflow for someone using Compilerworks to manage their data lineage? How does Compilerworks simplify the process of migrating between data warehouses/processing platforms? What is a typical workflow for someone using Compilerworks to manage their data lineage?

SQL

SQL Programming Language Java Metadata

Solving Data Lineage Tracking And Data Discovery At WeWork

Data Engineering Podcast

DECEMBER 16, 2019

You work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics.

Metadata

Metadata PostgreSQL Datasets Data Warehouse

Driving Agility and Scalability through Smart Data

Cloudera

MAY 3, 2021

Contrast this with the skills honed over decades for gaining access, building data warehouses, performing ETL, creating reports and/or applications using structured query language (SQL). Benefits of Streaming Data for Business Owners. A rare breed. What does all this mean for those in business leadership roles? .

Scala

Scala Retail Java SQL

An Exploration Of The Expectations, Ecosystem, and Realities Of Real-Time Data Applications

Data Engineering Podcast

AUGUST 21, 2022

Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Go to dataengineeringpodcast.com/ascend and sign up for a free trial.

Lambda Architecture

Lambda Architecture MongoDB MySQL Scala

Data News — Week 24.30

Christophe Blefari

JULY 26, 2024

How Mixpanel delivers funnels up to 7x faster than the data warehouse — Mixpanel team is proud to say that they have better performance than Snowflake. It's written in Java and it does what other orchestrator are already doing. I think it will unlock a lot of use-cases in BigQuery. Curious to see if it will pick up.

MySQL

MySQL Data PostgreSQL Datasets

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

The past decades of enterprise data platform architectures can be summarized in 69 words. First-generation – expensive, proprietary enterprise data warehouse and business intelligence platforms maintained by a specialized team drowning in technical debt. Data professionals are not perfectly interchangeable.

Pharmaceutical

Pharmaceutical Data Lake Data Architecture Architecture

Data Scientist vs Data Engineer: Differences and Why You Need Both

AltexSoft

OCTOBER 30, 2021

Data engineer’s integral task is building and maintaining data infrastructure — the system managing the flow of data from its source to destination. This typically includes setting up two processes: an ETL pipeline , which moves data, and a data storage (typically, a data warehouse ), where it’s kept.

Data Engineer

Data Engineer Data Engineering Engineering Machine Learning

Easier Stream Processing On Kafka With ksqlDB

Data Engineering Podcast

MARCH 2, 2020

Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams.

Kafka

Kafka Process PostgreSQL MySQL

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Data Engineering Podcast

APRIL 24, 2022

You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. How do you maintain feature parity between the Python and Java integrations? How do you maintain feature parity between the Python and Java integrations?

Machine Learning

Machine Learning Systems Data Lake Java

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Data Engineering Podcast

AUGUST 13, 2022

Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Go to dataengineeringpodcast.com/ascend and sign up for a free trial.

Metadata

Metadata MongoDB MySQL Scala

Building The Materialize Engine For Interactive Streaming Analytics In SQL

Data Engineering Podcast

DECEMBER 22, 2019

Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying levels of support for fast reads on quickly changing data. Data warehouses are optimized for batched writes and complex analytical queries.

SQL

SQL Engineering Building Java

Simplify Data Security For Sensitive Information With The Skyflow Data Privacy Vault

Data Engineering Podcast

JUNE 5, 2022

Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows.

Data Security

Data Security Metadata MongoDB MySQL

Be Confident In Your Data Integration By Quickly Validating Matching Records With data-

Data Engineering Podcast

JULY 3, 2022

In order to quickly identify if and how two data systems are out of sync Gleb Mezhanskiy and Simon Eskildsen partnered to create the open source data-diff utility. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability.

Data Integration

Data Integration MongoDB MySQL Scala

Power Your Real-Time Analytics Without The Headache Using Fivetran's Change Data Capture Integrations

Data Engineering Podcast

SEPTEMBER 25, 2022

The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java.

Food

Food MongoDB MySQL Scala

An Exploration Of The Open Data Lakehouse And Dremio's Contribution To The Ecosystem

Data Engineering Podcast

OCTOBER 16, 2022

Summary The "data lakehouse" architecture balances the scalability and flexibility of data lakes with the ease of use and transaction support of data warehouses. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability.

Data Lake

Data Lake Food MongoDB MySQL

Taking A Look Under The Hood At CreditKarma's Data Platform

Data Engineering Podcast

NOVEMBER 13, 2022

Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows.

MongoDB

MongoDB MySQL Google Cloud Scala

Big Data Technologies that Everyone Should Know in 2024

Knowledge Hut

APRIL 25, 2024

Spark provides an interactive shell that can be used for ad-hoc data analysis, as well as APIs for programming in Java, Python, and Scala. NoSQL databases are designed for scalability and flexibility, making them well-suited for storing big data. The two most popular data warehouse systems are Teradata and Oracle Exadata.

Big Data

Big Data Technology Hadoop NoSQL

Data Movement in Netflix Studio via Data Mesh

Netflix Tech

JULY 26, 2021

The CDC events are passed on to the Data Mesh enrichment processor, which issues GraphQL queries to Studio Edge to enrich the data. Once the data has landed in the Iceberg tables in Netflix Data Warehouse, they could be used for ad-hoc or scheduled querying and reporting. Currently Iceberg sink is appended only.

Data

Data MySQL Data Pipeline Data Warehouse

Build Hybrid Data Pipelines and Enable Universal Connectivity With CDF-PC Inbound Connections

Cloudera

JUNE 17, 2022

In the second blog of the Universal Data Distribution blog series , we explored how Cloudera DataFlow for the Public Cloud (CDF-PC) can help you implement use cases like data lakehouse and data warehouse ingest, cybersecurity, and log optimization, as well as IoT and streaming data collection.

Data Pipeline

Data Pipeline Building Kafka Java

Data Engineer vs Data Analyst: Key Differences and Similarities

Knowledge Hut

MAY 3, 2023

A degree in computer science, software engineering, or a similar subject is often required of data engineers. They have extensive knowledge of databases, data warehousing, and computer languages like Python or Java. Also, data engineers are well-versed in distributed systems, cloud computing, and data modeling.

Data Engineer

Data Engineer Data Engineering Engineering Data Cleanse

Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus

Data Engineering Podcast

AUGUST 6, 2022

Summary The optimal format for storage and retrieval of data is dependent on how it is going to be used. For analytical systems there are decades of investment in data warehouses and various modeling techniques. For analytical systems there are decades of investment in data warehouses and various modeling techniques.

Machine Learning

Machine Learning Database MySQL MongoDB

Bringing Automation To Data Labeling For Machine Learning With Watchful

Data Engineering Podcast

AUGUST 13, 2022

Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Go to dataengineeringpodcast.com/ascend and sign up for a free trial.

Machine Learning

Machine Learning Pipeline-centric Database-centric MongoDB

Make Data Lineage A Ubiquitous Part Of Your Work By Simplifying Its Implementation With Alvin

Data Engineering Podcast

OCTOBER 2, 2022

The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java.

IT

IT Food MongoDB PostgreSQL

Introduce Climate Analytics Into Your Data Platform Without The Heavy Lifting Using Sust Global

Data Engineering Podcast

SEPTEMBER 4, 2022

Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Go to dataengineeringpodcast.com/ascend and sign up for a free trial.

MongoDB

MongoDB MySQL Scala Machine Learning

Data News — Week 25.02

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Webinars

Trending Sources

Data News — Week 24.11

Webinars

Optimizing data warehouse storage

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Databricks, Snowflake and the future

Data Lake vs. Data Warehouse vs. Data Lakehouse

Using SQL to democratize streaming data

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Data News — Week 23.37

The Dawn of the AI-Native Data Stack - Part 1

How to learn data engineering

Don’t Blink: You’ll Miss Something Amazing!

Self Service Open Source Data Integration With AirByte

Clean Up Your Data Using Scalable Entity Resolution And Data Mastering With Zingg

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

Database Refactoring Patterns with Pramod Sadalage - Episode 22

10 Lessons from 10 Years of Innovation and Engineering at Picnic

Analytics Engineering Without The Friction Of Complex Pipeline Development With Optimus and dbt

12 Times Faster Query Planning With Iceberg Manifest Caching in Impala

Migrate And Modify Your Data Platform Confidently With Compilerworks

Solving Data Lineage Tracking And Data Discovery At WeWork

Driving Agility and Scalability through Smart Data

An Exploration Of The Expectations, Ecosystem, and Realities Of Real-Time Data Applications

Data News — Week 24.30

What is a Data Mesh?

Data Scientist vs Data Engineer: Differences and Why You Need Both

Easier Stream Processing On Kafka With ksqlDB

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Building The Materialize Engine For Interactive Streaming Analytics In SQL

Simplify Data Security For Sensitive Information With The Skyflow Data Privacy Vault

Be Confident In Your Data Integration By Quickly Validating Matching Records With data-

Power Your Real-Time Analytics Without The Headache Using Fivetran's Change Data Capture Integrations

An Exploration Of The Open Data Lakehouse And Dremio's Contribution To The Ecosystem

Taking A Look Under The Hood At CreditKarma's Data Platform

Big Data Technologies that Everyone Should Know in 2024

Data Movement in Netflix Studio via Data Mesh

Build Hybrid Data Pipelines and Enable Universal Connectivity With CDF-PC Inbound Connections

Data Engineer vs Data Analyst: Key Differences and Similarities

Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus

Bringing Automation To Data Labeling For Machine Learning With Watchful

Make Data Lineage A Ubiquitous Part Of Your Work By Simplifying Its Implementation With Alvin

Introduce Climate Analytics Into Your Data Platform Without The Heavy Lifting Using Sust Global

Stay Connected