Blog, Data Ingestion and Structured Data

Blog

Data Ingestion

Structured Data

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

This blog post expands on that insightful conversation, offering a critical look at Iceberg's potential and the hurdles organizations face when adopting it. Data ingestion tools often create numerous small files, which can degrade performance during query execution. What are your data governance and security requirements?

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Data Engineering Zoomcamp – Data Ingestion (Week 2)

Hepta Analytics

FEBRUARY 14, 2022

DE Zoomcamp 2.2.1 – Introduction to Workflow Orchestration Following last weeks blog , we move to data ingestion. We already had a script that downloaded a csv file, processed the data and pushed the data to postgres database. This week, we got to think about our data ingestion design.

Data Ingestion

Data Ingestion Data Engineering Data Engineer Engineering

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

A Guide to Data Pipelines (And How to Design One From Scratch)

Striim

SEPTEMBER 11, 2024

Data Collection/Ingestion The next component in the data pipeline is the ingestion layer, which is responsible for collecting and bringing data into the pipeline. By efficiently handling data ingestion, this component sets the stage for effective data processing and analysis.

Data Pipeline

Data Pipeline Designing Data Lake Data Warehouse

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Smart Schema: Enabling SQL Queries on Semi-Structured Data

Rockset

NOVEMBER 19, 2020

In this blog post, we show how Rockset’s Smart Schema feature lets developers use real-time SQL queries to extract meaningful insights from raw semi-structured data ingested without a predefined schema. In NoSQL systems, data is strongly typed but dynamically so.

Structured Data

Structured Data SQL NoSQL Raw Data

Data Vault on Snowflake: Feature Engineering and Business Vault

Snowflake

MARCH 30, 2023

Data Vault as a practice does not stipulate how you transform your data, only that you follow the same standards to populate business vault link and satellite tables as you would to populate raw vault link and satellite tables. Feature engineering: Data is transformed to support ML model training. ML workflow, ubr.to/3EJHjvm

Engineering

Engineering Raw Data Data Science Machine Learning

Data Warehouse vs Big Data

Knowledge Hut

APRIL 23, 2024

Two popular approaches that have emerged in recent years are data warehouse and big data. While both deal with large datasets, but when it comes to data warehouse vs big data, they have different focuses and offer distinct advantages. Data warehousing offers several advantages.

Data Warehouse

Data Warehouse Big Data Unstructured Data Hadoop

DataOps vs. MLOps: Similarities, Differences, and How to Choose

Databand.ai

JULY 17, 2023

An enterprise looking to streamline its entire end-to-end analytics lifecycle may implement a comprehensive solution incorporating best practices from each approach—starting with robust data ingestion (DataOps) through optimized model training and deployment (MLOps). Better data observability equals better data quality.

Data Pipeline

Data Pipeline Machine Learning High Quality Data Data Ingestion

Data Engineering Weekly #108

Data Engineering Weekly

NOVEMBER 20, 2022

With Upsolver SQLake, you build a pipeline for data in motion simply by writing a SQL query defining your transformation. The blog narrates the European Commission’s updated version of the European Standard Contractual Clauses (EU SCCS) and how to prepare to handle the privacy laws. Kudos to the author and the Atlassian team.

Data Engineering

Data Engineering Data Engineer Engineering Datasets

Accelerate your Data Migration to Snowflake

RandomTrees

SEPTEMBER 6, 2020

A combination of structured and semi structured data can be used for analysis and loaded into the cloud database without the need of transforming into a fixed relational scheme first. The Data Load Accelerator meets the above-mentioned solution. Here’s a detail on the architecture of Snowflake.

Cloud Storage

Cloud Storage Data Ingestion Data Cleanse Data Warehouse

Hello World: Join the New Rockset Developer Community

Rockset

SEPTEMBER 8, 2021

At Rockset, we work hard to build developer tools (as well as APIs and SDKs) that allow you to easily consume semi-structured data using SQL and run sub-second queries on real-time data. In the community, we’ll be sharing topics related to product releases, blogs, events, memes, and more.

SQL

SQL Data Ingestion Consulting Data Pipeline

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

phData: Data Engineering

SEPTEMBER 19, 2023

By employing robust data modeling techniques, businesses can unlock the true value of their data lake and transform it into a strategic asset. With many data modeling methodologies and processes available, choosing the right approach can be daunting. Want to learn more about data governance?

Data Lake

Data Lake Process Metadata Data Warehouse

Snowflake Innovates on Performance & Efficiency While Reducing Costs

Snowflake

AUGUST 19, 2024

In this blog, we will cover some of the most recently launched improvements for the Snowflake platform. For example: Ingest performance: We improved the ingest performance of both JSON and Parquet files with case-insensitive data up to 25%.

Data Ingestion

Data Ingestion BI Structured Data Engineering

Creating Value With a Data-Centric Culture: Essential Capabilities to Treat Data as a Product

Ascend.io

JUNE 8, 2023

However, transforming data into a product so that it can deliver outsized business value requires more than just a mission statement; it requires a solid foundation of technical capabilities and a truly data-centric culture. This multitude of sources often causes a dispersed, complex, and poorly structured data landscape.

Pipeline-centric

Pipeline-centric Database-centric Data Ingestion Data Pipeline

Data Lake vs. Data Warehouse vs. Data Lakehouse

Sync Computing

NOVEMBER 7, 2024

Despite these limitations, data warehouses, introduced in the late 1980s based on ideas developed even earlier, remain in widespread use today for certain business intelligence and data analysis applications. While data warehouses are still in use, they are limited in use-cases as they only support structured data.

Data Lake

Data Lake Data Warehouse Business Intelligence Unstructured Data

Implementing the Netflix Media Database

Netflix Tech

DECEMBER 14, 2018

In the previous blog posts in this series, we introduced the N etflix M edia D ata B ase ( NMDB ) and its salient “Media Document” data model. A fundamental requirement for any lasting data system is that it should scale along with the growth of the business applications it wishes to serve.

Media

Media Database Metadata Data Schemas

From Schemaless Ingest to Smart Schema: Enabling SQL on Raw Data

Rockset

MARCH 27, 2019

You have complex, semi-structured data—nested JSON or XML, for instance, containing mixed types, sparse fields, and null values. It's messy, you don't understand how it's structured, and new fields appear every so often. This enables Rockset to generate a Smart Schema on the data.

Raw Data

Raw Data SQL NoSQL Datasets

Big Data Fabric Weaves Together Automation, Scalability, and Intelligence

Cloudera

JANUARY 22, 2019

Today’s data landscape is characterized by exponentially increasing volumes of data, comprising a variety of structured, unstructured, and semi-structured data types originating from an expanding number of disparate data sources located on-premises, in the cloud, and at the edge. Data orchestration.

Big Data

Big Data NoSQL Hadoop Data Lake

20+ Data Engineering Projects for Beginners with Source Code

ProjectPro

AUGUST 24, 2021

Data professionals who work with raw data like data engineers, data analysts, machine learning scientists , and machine learning engineers also play a crucial role in any data science project. And, out of these professions, this blog will discuss the data engineering job role.

Data Engineering

Data Engineering Data Engineer Coding Project

MongoDB CDC: When to Use Kafka, Debezium, Change Streams and Rockset

Rockset

JULY 28, 2022

Documents in MongoDB can also have complex structures. Data is stored as JSON documents that can contain nested objects and arrays that all provide further intricacies when building up analytical queries on the data such as accessing nested properties and exploding arrays to analyze individual elements.

MongoDB

MongoDB Kafka NoSQL Data Lake

Big Data Analytics: How It Works, Tools, and Real-Life Applications

AltexSoft

MAY 14, 2021

A single car connected to the Internet with a telematics device plugged in generates and transmits 25 gigabytes of data hourly at a near-constant velocity. And most of this data has to be handled in real-time or near real-time. Variety is the vector showing the diversity of Big Data. Big Data analytics processes and tools.

Big Data

Big Data Data Analytics IT NoSQL

New Snowflake Features Released in March 2023

Snowflake

APRIL 20, 2023

Data Pipelines Snowpipe Streaming – public preview While data generated in real time is valuable, it is more valuable when paired with historical data that helps provide context. Read our blog to learn more. Learn more about what these new functions are and how they work in our recent blog post.

Medical

Medical Retail Python Pharmaceutical

15+ Best Data Engineering Tools to Explore in 2023

Knowledge Hut

APRIL 25, 2023

It provides a flexible data model that can handle different types of data, including unstructured and semi-structured data. Key features: Flexible data modeling High scalability Support for real-time analytics 4. Key features: Instant elasticity Support for semi-structured data Built-in data security 5.

Data Engineering

Data Engineering Data Engineer Engineering Google Cloud

AML: Past, Present and Future – Part III

Cloudera

SEPTEMBER 6, 2018

The solution combines Cloudera Enterprise , the scalable distributed platform for big data, machine learning, and analytics, with riskCanvas , the financial crime software suite from Booz Allen Hamilton. It supports a variety of storage engines that can handle raw files, structured data (tables), and unstructured data.

Machine Learning

Machine Learning Banking Big Data Data Science

Why Real-Time Analytics Requires Both the Flexibility of NoSQL and Strict Schemas of SQL Systems

Rockset

JULY 6, 2022

This is the fifth post in a series by Rockset's CTO and Co-founder Dhruba Borthakur on Designing the Next Generation of Data Systems for Real-Time Analytics. We'll be publishing more posts in the series in the near future, so subscribe to our blog so you don't miss them! This keeps the data intact.

NoSQL

NoSQL SQL Systems PostgreSQL

Data Pipeline- Definition, Architecture, Examples, and Use Cases

ProjectPro

DECEMBER 7, 2021

Data pipelines are a significant part of the big data domain, and every professional working or willing to work in this field must have extensive knowledge of them. In broader terms, two types of data -- structured and unstructured data -- flow through a data pipeline.

Data Pipeline

Data Pipeline Architecture Kafka AWS

A Beginner’s Guide to Learning PySpark for Big Data Processing

ProjectPro

JANUARY 25, 2022

Here’s What You Need to Know About PySpark This blog will take you through the basics of PySpark, the PySpark architecture, and a few popular PySpark libraries , among other things. Finally, you'll find a list of PySpark projects to help you gain hands-on experience and land an ideal job in Data Science or Big Data.

Big Data

Big Data Data Process Process Kafka

100+ Big Data Interview Questions and Answers 2023

ProjectPro

JANUARY 31, 2023

If you're looking to break into the exciting field of big data or advance your big data career, being well-prepared for big data interview questions is essential. Get ready to expand your knowledge and take your big data career to the next level! But the concern is - how do you become a big data professional?

Big Data

Big Data Hadoop Relational Database AWS

What is a Data Pipeline (and 7 Must-Have Features of Modern Data Pipelines)

Striim

OCTOBER 11, 2024

This interconnected approach enables teams to create, manage, and automate data pipelines with ease and minimal intervention. In contrast, traditional data pipelines often require significant manual effort to integrate various external tools for data ingestion , transfer, and analysis.

Data Pipeline

Data Pipeline MongoDB Unstructured Data Data Lake

Offload Real-Time Reporting and Analytics from MongoDB Using PostgreSQL

Rockset

SEPTEMBER 3, 2020

If you’re collecting structured or semi-structured data which works well with PostgreSQL, offloading read operations to PostgreSQL is a great way to avoid impacting the performance of your primary MongoDB database. Like PostgreSQL, Rockset also supports full-featured SQL, including joins.

MongoDB

MongoDB PostgreSQL SQL Database

Build Internal Apps in Minutes with Retool and Rockset: A Customer 360 Example

Rockset

DECEMBER 17, 2020

Together, they empower developers to build performant internal tools, such as customer 360 and logistics monitoring apps, by solely using data APIs and pre-built UI components. In this blog, we’ll be building a customer 360 app using Rockset and Retool. For this blog, we’ll be using the customer support tool template.

Building

Building SQL Aggregated Data Database

20 Best Open Source Big Data Projects to Contribute on GitHub

ProjectPro

NOVEMBER 15, 2021

Table of Contents 20 Open Source Big Data Projects To Contribute How to Contribute to Open Source Big Data Projects? 20 Open Source Big Data Projects To Contribute There are thousands of open-source projects in action today. This blog will walk through the most popular and fascinating open source big data projects.

Big Data

Big Data Project Metadata Programming Language

50 PySpark Interview Questions and Answers For 2023

ProjectPro

NOVEMBER 22, 2021

PySpark SQL is a structured data library for Spark. PySpark SQL, in contrast to the PySpark RDD API, offers additional detail about the data structure and operations. ’ A DataFrame is an immutable distributed columnar data collection. ’ A DataFrame is an immutable distributed columnar data collection.

Hadoop

Hadoop Python Datasets Metadata

The Ultimate Modern Data Stack Migration Guide

phData: Data Engineering

JULY 18, 2023

Demands on the cloud data warehouse are also evolving to require it to become more of an all-in-one platform for an organization’s analytics needs. Enter Snowflake The Snowflake Data Cloud is one of the most popular and powerful CDW providers.

Data Warehouse

Data Warehouse Pipeline-centric Government Data

20 Solved End-to-End Big Data Projects with Source Code

ProjectPro

MAY 31, 2021

Ace your big data interview by adding some unique and exciting Big Data projects to your portfolio. This blog lists over 20 big data projects you can work on to showcase your big data skills and gain hands-on experience in big data tools and technologies. are examples of semi-structured data.

Big Data

Big Data Coding Project Hadoop

50 Artificial Intelligence Interview Questions and Answers [2023]

ProjectPro

OCTOBER 20, 2021

If you are unsure, be vocal about your thought process and the way you are thinking – take inspiration from the examples below and explain the answer to the interviewer through your learnings and experiences from data science and machine learning projects. Data: Data Engineering Pipelines Data is everything.

Machine Learning

Machine Learning Algorithm Data Science Government

Real-Time Analytics: Upleveling the Modern Customer Experience

Striim

FEBRUARY 7, 2025

This instant personalization is built on a well-structured data strategy that combines three key types of data: First-Party Data: This is data directly collected from your owned channels, such as your website and mobile apps. A unified view of customer behavior is key to delivering truly personalized experiences.

Machine Learning

Machine Learning Retail Data Ingestion Structured Data

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Zoomcamp – Data Ingestion (Week 2)

Webinars

Trending Sources

A Guide to Data Pipelines (And How to Design One From Scratch)

Webinars

Smart Schema: Enabling SQL Queries on Semi-Structured Data

Data Vault on Snowflake: Feature Engineering and Business Vault

Data Warehouse vs Big Data

DataOps vs. MLOps: Similarities, Differences, and How to Choose

Data Engineering Weekly #108

Accelerate your Data Migration to Snowflake

Hello World: Join the New Rockset Developer Community

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

Snowflake Innovates on Performance & Efficiency While Reducing Costs

Creating Value With a Data-Centric Culture: Essential Capabilities to Treat Data as a Product

Data Lake vs. Data Warehouse vs. Data Lakehouse

Implementing the Netflix Media Database

From Schemaless Ingest to Smart Schema: Enabling SQL on Raw Data

Big Data Fabric Weaves Together Automation, Scalability, and Intelligence

20+ Data Engineering Projects for Beginners with Source Code

MongoDB CDC: When to Use Kafka, Debezium, Change Streams and Rockset

Big Data Analytics: How It Works, Tools, and Real-Life Applications

New Snowflake Features Released in March 2023

15+ Best Data Engineering Tools to Explore in 2023

AML: Past, Present and Future – Part III

Why Real-Time Analytics Requires Both the Flexibility of NoSQL and Strict Schemas of SQL Systems

Data Pipeline- Definition, Architecture, Examples, and Use Cases

A Beginner’s Guide to Learning PySpark for Big Data Processing

100+ Big Data Interview Questions and Answers 2023

What is a Data Pipeline (and 7 Must-Have Features of Modern Data Pipelines)

Offload Real-Time Reporting and Analytics from MongoDB Using PostgreSQL

Build Internal Apps in Minutes with Retool and Rockset: A Customer 360 Example

20 Best Open Source Big Data Projects to Contribute on GitHub

50 PySpark Interview Questions and Answers For 2023

The Ultimate Modern Data Stack Migration Guide

20 Solved End-to-End Big Data Projects with Source Code

50 Artificial Intelligence Interview Questions and Answers [2023]

Real-Time Analytics: Upleveling the Modern Customer Experience

Stay Connected