March, 2021

article thumbnail

Building a Data Engineering Project in 20 Minutes

Simon Späti

This post focuses on practical data pipelines with examples from web-scraping real-estates, uploading them to S3 with MinIO, Spark and Delta Lake, adding some Data Science magic with Jupyter Notebooks, ingesting into Data Warehouse Apache Druid, visualising dashboards with Superset and managing everything with Dagster. The goal is to touch on the common data engineering challenges and using promising new technologies, tools or frameworks, which most of them I wrote about in Business Intelligence

article thumbnail

Apache Kafka Made Simple: A First Glimpse of a Kafka Without ZooKeeper

Confluent

At the heart of Apache Kafka® sits the log—a simple data structure that uses sequential operations that work symbiotically with the underlying hardware. Efficient disk buffering and CPU cache usage, […].

Kafka 145
article thumbnail

How to Host a Virtual Global Data Science Hackathon

Teradata

Learn how best to host a virtual hackathon, or any virtual event, with these tips and tricks from our Teradata team. Read more.

article thumbnail

Toward a Data Mesh (part 2) : Architecture & Technologies

François Nguyen

Just an illustration – not the truth and you certainly can do it with other technologies. TL;DR After setting up and organizing the teams, we are describing 4 topics to make data mesh a reality. the selfserve platform based on a serverless philisophy (life is too short to do provisioning) the building of data products (as code) : we are building data workflows not data pipelines the promotion of data domains where the metadata on the data life cycle is as important as your data The old dat

article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.

article thumbnail

Remote Workstations for the Discerning Artists

Netflix Tech

By Michelle Brenner Netflix is poised to become the world’s most prolific producer of visual effects and original animated content. To meet that demand, we need to attract the world’s best artistic talent. Artists like to work at places where they can create groundbreaking entertainment instead of worrying about getting access to the software or source files they need.

article thumbnail

Top Three Requirements for Data Flows

Cloudera

Data flows are an integral part of every modern enterprise. No matter whether they move data from one operational system to another to power a business process or fuel central data warehouses with the latest data for near-real-time reporting, life without them would be full of manual, tedious and error-prone data modification and copying tasks. At Cloudera, we’re helping our customers implement data flows on-premises and in the public cloud using Apache NiFi , a core component of Cloudera DataFl

Cloud 120

More Trending

article thumbnail

Under the Hood of Real-Time Analytics with Apache Kafka and Pinot

Confluent

Real-time analytics has become the need of the hour for modern internet companies. The ability to derive internal insights around business metrics, user growth and adoption as well as security […].

Kafka 144
article thumbnail

CFO Analytics: What Is It and Why Should You Care?

Teradata

Finance-driven analytics might be the largest untapped opportunity for organizations & a catalyst for driving business value & strategic vision. But, what exactly is CFO analytics?

IT 119
article thumbnail

Towards a Data Mesh (part 1) : Data Domains and Teams Topologies.

François Nguyen

Just an illustration – not the truth and we will pivot if it does not work. I discovered Zhamak Dehghani’s first article about Data Mesh in August 2020. Thanks to Youtube, you have the live illustration in this video with even more context and explanations. And then, you have this second video that is an introduction to her second article (december 2020).

article thumbnail

ConsoleMe: A Central Control Plane for AWS Permissions and Access

Netflix Tech

ConsoleMe: A Central Control Plane for AWS Permissions and Access By Curtis Castrapel , Patrick Sanders , and Hee Won Kim At AWS re:Invent 2020, we open sourced two new tools for managing multi-account AWS permissions and access. We’re very excited to bring you ConsoleMe (pronounced: kuhn-soul-mee ), and its CLI utility, Weep (pun intended)! If you missed the talk, check it out here.

AWS 105
article thumbnail

Apache Airflow®: The Ultimate Guide to DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

article thumbnail

Congratulations to our 2021 Partner Award Winners

Cloudera

We announced at our Partner Sales Kickoff, the winners of the 2021 Cloudera Partner Awards. These six awards recognize Cloudera partners who are dedicated to enabling customers to do more with their data by leveraging the power of an enterprise data cloud. Thank you to this year’s winners for their partnership in helping our joint customers’ ability to drive value from their data in the hybrid cloud.

article thumbnail

Data Quality Management For The Whole Team With Soda Data

Data Engineering Podcast

Summary Data quality is on the top of everyone’s mind recently, but getting it right is as challenging as ever. One of the contributing factors is the number of people who are involved in the process and the potential impact on the business if something goes wrong. In this episode Maarten Masschelein and Tom Baeyens share the work they are doing at Soda to bring everyone on board to make your data clean and reliable.

article thumbnail

Monitoring Your Event Streams: Integrating Confluent with Prometheus and Grafana

Confluent

Self-managing a highly scalable distributed system with Apache Kafka® at its core is not an easy feat. That’s why operators prefer tooling such as Confluent Control Center for administering and […].

Kafka 131
article thumbnail

Enterprise Data Operating Systems in the Cloud: Necessary, But Not Sufficient

Teradata

Getting your Cloud data architecture right starts with understanding which data products you need, the roles they perform, & the functional & non-functional characteristics that those roles demand.

Cloud 110
article thumbnail

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

article thumbnail

How to trigger a spark job from AWS Lambda

Start Data Engineering

Event driven pipelines Lambda function to trigger spark jobs Setup and run Monitoring and logging Teardown Conclusion Further reading References Event driven pipelines Event driven systems represent a software design pattern where a logic is executed in response to an event. This event can be a file creation on S3, a new database row, API call, etc.

AWS 100
article thumbnail

The Netflix Cosmos Platform

Netflix Tech

Orchestrated Functions as a Microservice by Frank San Miguel on behalf of the Cosmos team Introduction Cosmos is a computing platform that combines the best aspects of microservices with asynchronous workflows and serverless functions. Its sweet spot is applications that involve resource-intensive algorithms coordinated via complex, hierarchical workflows that last anywhere from minutes to years.

Media 94
article thumbnail

Data governance beyond SDX: Adding third party assets to Apache Atlas

Cloudera

Governance and the sustainable handling of data is a critical success factor in virtually all organizations. While Cloudera Data Platform (CDP) already supports the entire data lifecycle from ‘Edge to AI’, we at Cloudera are fully aware that enterprises have more systems outside of CDP. It is crucial to avoid that CDP becomes the next silo in your IT landscape.

article thumbnail

Real World Change Data Capture At Datacoral

Data Engineering Podcast

Summary The world of business is becoming increasingly dependent on information that is accurate up to the minute. For analytical systems, the only way to provide this reliably is by implementing change data capture (CDC). Unfortunately, this is a non-trivial undertaking, particularly for teams that don’t have extensive experience working with streaming data and complex distributed systems.

article thumbnail

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

article thumbnail

How to Tune RocksDB for Your Kafka Streams Application

Confluent

Apache Kafka ships with Kafka Streams, a powerful yet lightweight client library for Java and Scala to implement highly scalable and elastic applications and microservices that process and analyze data […].

Kafka 131
article thumbnail

Enhancing Customer Experience with Every Journey

Teradata

Big Tech giants dominate by using data to improve product & experience. The auto industry can emulate this by analyzing data to improve customer experience & guide individual choices.

Data 95
article thumbnail

Reverse ETL with dbt and Grouparoo

Grouparoo

Teams are centralizing their data in their data warehouse by loading data in and transforming it as necessary. Increasingly, we are seeing teams turn to dbt to do this transforming. The idea is to write *.sql files that, when run in the right order, create useful rollup tables or materialized views of the data. We've been asked by teams using dbt how Grouparoo can then sync their data to their cloud-based apps.

article thumbnail

A Day in the Life of an Experimentation and Causal Inference Scientist @ Netflix

Netflix Tech

Stephanie Lane , Wenjing Zheng , Mihir Tendulkar Source credit: Netflix Within the rapid expansion of data-related roles in the last decade, the title Data Scientist has emerged as an umbrella term for myriad skills and areas of business focus. What does this title mean within a given company, or even within a given industry? It can be hard to know from the outside.

article thumbnail

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

article thumbnail

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Cloudera

CDP Public Cloud is now available on Google Cloud. The addition of support for Google Cloud enables Cloudera to deliver on its promise to offer its enterprise data platform at a global scale. CDP Public Cloud is already available on Amazon Web Services and Microsoft Azure. With the addition of Google Cloud, we deliver on our vision of providing a hybrid and multi-cloud architecture to support our customer’s analytics needs regardless of deployment platform. .

article thumbnail

Managing The DoorDash Data Platform

Data Engineering Podcast

Summary The team at DoorDash has a complex set of optimization challenges to deal with using data that they collect from a multi-sided marketplace. In order to handle the volume and variety of information that they use to run and improve the business the data team has to build a platform that analysts and data scientists can use in a self-service manner.

article thumbnail

To Pull or to Push Your Data with Kafka Connect? That Is the Question.

Confluent

Today, every company is a data company. There are many different data pipeline, integration, and ingestion tools in the market, but before you can feed your data analytics needs, data […].

Kafka 126
article thumbnail

Don’t Just Collect Vehicle Data – Monetize It!

Teradata

As the auto sector transforms, vehicle data is becoming one of the most important sources of insight. But if it is left in fragmented silos, it quickly becomes a cost & delivers little value.

IT 64
article thumbnail

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

article thumbnail

How the Open Edge Is Driving Digital Transformation

DataKitchen

The post How the Open Edge Is Driving Digital Transformation first appeared on DataKitchen.

52
article thumbnail

Production Media Management: Transforming Media Workflows by leveraging the Cloud

Netflix Tech

Written by Anton Margoline , Avinash Dathathri , Devang Shah and Murthy Parthasarathi. Credit to Netflix Studio’s Product, Design, Content Hub Engineering teams along with all of the supporting partner and platform teams. In this post, we will share a behind-the-scenes look at how Netflix delivers technology and infrastructure to help production crews create and exchange media during production and post production stages.

Media 72
article thumbnail

CDP Endpoint Gateway provides Secure Access to CDP Public Cloud Services running in private networks

Cloudera

Cloudera Data Platform (CDP) Public Cloud allows users to deploy analytic workloads into their cloud accounts. These workloads cover the entire data lifecycle and are managed from a central multi-cloud Cloudera Control Plane. CDP provides the flexibility to deploy these resources into public or private subnets. Nearly unanimously, we’ve seen customers deploy their workloads to private subnets.

article thumbnail

Leave Your Data Where It Is And Automate Feature Extraction With Molecula

Data Engineering Podcast

Summary A majority of the time spent in data engineering is copying data between systems to make the information available for different purposes. This introduces challenges such as keeping information synchronized, managing schema evolution, building transformations to match the expectations of the destination systems. H.O. Maycotte was faced with these same challenges but at a massive scale, leading him to question if there is a better way.

IT 100
article thumbnail

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.