Sun.May 28, 2023

article thumbnail

A Roadmap To Bootstrapping The Data Team At Your Startup

Data Engineering Podcast

Summary Building a data team is hard in any circumstance, but at a startup it can be even more challenging. The requirements are fluid, you probably don't have a lot of existing data talent to manage the hiring and onboarding, and there is a need to move fast. Ghalib Suleiman has been on both sides of this equation and joins the show to share his hard-won wisdom about how to start and grow a data team in the early days of company growth.

Data Lake 162
article thumbnail

Fast String Processing with Polars?—?Scam Emails Dataset

Towards Data Science

Clean, process and tokenise texts in milliseconds using in-built Polars string expressions Continue reading on Towards Data Science »

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Debezium Serialization with Avro and Apicurio Registry Simplified: A Comprehensive Guide 101

Hevo

Organizations use Kafka and Debezium to track real-time changes in databases and stream them to different applications. But often, due to a colossal amount of messages in Kafka topics, it becomes challenging to serialize these messages. Every message in Kafka’s topic has a key and value.

Kafka 52
article thumbnail

Data Engineering Weekly #132

Data Engineering Weekly

Data Engineering Weekly Is Brought to You by RudderStack RudderStack provides data pipelines that make collecting data from every application, website, and SaaS platform easy, then activating it in your warehouse and business tools. Sign up free to test out the tool today. Editor’s Note: DEW featured in AirByte’s State of the Data & Slack’s usage of Kafka DEW has been recognized as the number one individually run data newsletter in the industry, according to the latest AirB

article thumbnail

Apache Airflow® Crash Course: From 0 to Running your Pipeline in the Cloud

With over 30 million monthly downloads, Apache Airflow is the tool of choice for programmatically authoring, scheduling, and monitoring data pipelines. Airflow enables you to define workflows as Python code, allowing for dynamic and scalable pipelines suitable to any use case from ETL/ELT to running ML/AI operations in production. This introductory tutorial provides a crash course for writing and deploying your first Airflow pipeline.