Top Data Engineering Digest Data Schemas Data Cleanse Content for January, 2021

January, 2021

How to update millions of records in MySQL?

Start Data Engineering

JANUARY 30, 2021

Introduction Setup Problems with a single large update Updating in batches Conclusion Further reading Introduction When updating a large number of records in an OLTP database, such as MySQL, you have to be mindful about locking the records. If those records are locked, they will not be editable(update or delete) by other transactions on your database.

MySQL

MySQL Database

Why do you need Agile Software Development for your Data Use Case ?

François Nguyen

JANUARY 29, 2021

The agile part of DataOps In this previous article, we have defined dataops as a”combination of tools and methods inspired by Agile, Devops and Lean Manufacturing » (thanks to DataKitchen for this definition). Let’s focus on the agile part and why it is so relevant for your data use cases. “Data is like a box of chocolates, you never know what you’re gonna get.” It is about the nature of data : you cannot guess what will be the content and the quality of your data sources

Manufacturing

Manufacturing Data Analysis Data Building

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

MORE WEBINARS

Trending Sources

Job conversion possibilities within Data Science

Team Data Science

JANUARY 14, 2021

Data science encompasses a range of fields, like data analysis, machine learning, statistics, computer science, infrastructure, and data architecture, and looking at how businesses are transforming on a day-to-day basis, we may infer that some data science jobs will be in high demand within the next ten years, there is a strong need for experts who understand the market demands, who can formulate a data-driven approach and then execute the way out.

Data Science

Data Science Computer Science Data Engineering Data Engineer

Webinars

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

MORE WEBINARS

Confluent and Microsoft Announce Strategic Alliance

Confluent

JANUARY 26, 2021

Today, I am thrilled to announce a new strategic alliance with Microsoft to enable a seamless, integrated experience between Confluent Cloud and the Azure platform. This represents a significant milestone […].

Cloud

Cloud Programming

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 1: The Set-Up & Basics

Cloudera

JANUARY 6, 2021

Introduction. Python is used extensively among Data Engineers and Data Scientists to solve all sorts of problems from ETL/ELT pipelines to building machine learning models. Apache HBase is an effective data storage system for many workflows but accessing this data specifically through Python can be a struggle. For data professionals that want to make use of data stored in HBase the recent upstream project “hbase-connectors” can be used with PySpark for basic operations.

Machine Learning

Machine Learning Data Science Database Building

Improving Population Health Through Citizen 360

Teradata

JANUARY 5, 2021

By leveraging data to create a 360 degree view of its citizenry, government agencies can create more optimal experiences & improve outcomes such as closing the tax gap or improving quality of care.

Government

Government IT Data

How to unit test sql transforms in dbt

Start Data Engineering

JANUARY 16, 2021

Introduction Setup Code Conditional logic to read from mock input Custom macro to test for equality Setup environment specific test Run ELT using dbt Conclusion Further reading Introduction With the recent advancements in data warehouses and tools like dbt most transformations(T of ELT) are being done directly in the data warehouse. While this provides a lot of functionality out of the box, it gets tricky when you want to test your sql code locally before deploying to production.

SQL

SQL Data Warehouse Coding IT

More Trending

How to unit test sql transforms in dbt

Start Data Engineering

JANUARY 16, 2021

SQL

SQL Data Warehouse Coding IT

The very strange way of doing Data Quality at Airbnb

François Nguyen

JANUARY 23, 2021

or why you should have a look at Data Observability ! This article is the second part on how Airbnb is managing data quality : “Part 2 — A New Gold Standard” The first part can be found here and it was just good principles about roles and responsabilities. The second part is really how they do it and all the steps to have a “certification” They name it Midas, the famous king who can turn everything into gold ( with a not so good ending ).

Finance

Finance Certification Data Pipeline Data

Skills you should have as a Data Engineer

Team Data Science

JANUARY 8, 2021

Big Data has become the dominant innovation in all high-performing companies. Notable businesses today focus their decision-making capabilities on knowledge gained from the study of big data. Big Data is a collection of large data sets, particularly from new sources, providing an array of possibilities for those who want to work with data and are enthusiastic about unraveling trends in rows of new, unstructured data.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Helpful Tools for Apache Kafka Developers

Confluent

JANUARY 20, 2021

Apache Kafka® is at the core of a large ecosystem that includes powerful components, such as Kafka Connect and Kafka Streams. This ecosystem also includes many tools and utilities that […].

Kafka

Kafka Utilities

Making It Easier To Stick B2B Data Integration Pipelines Together With Hotglue

Data Engineering Podcast

JANUARY 25, 2021

Summary Businesses often need to be able to ingest data from their customers in order to power the services that they provide. For each new source that they need to integrate with it is another custom set of ETL tasks that they need to maintain. In order to reduce the friction involved in supporting new data transformations David Molot and Hassan Syyid built the Hotlue platform.

Data Integration

Data Integration IT Python BI

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, and Terrence Sheflin

As we look ahead to 2025, business intelligence and data analytics are set to play pivotal roles in shaping success. Organizations are already starting to face a host of transformative trends as the year comes to a close, including the integration of AI in data analytics, an increased emphasis on real-time data insights, and the growing importance of user experience in BI solutions.

Data

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 3: Productionization of ML models

Cloudera

JANUARY 20, 2021

In this last installment, we’ll discuss a demo application that uses PySpark.ML to make a classification model based off of training data stored in both Cloudera’s Operational Database (powered by Apache HBase) and Apache HDFS. Afterwards, this model is then scored and served through a simple Web Application. For more context, this demo is based on concepts discussed in this blog post How to deploy ML models to production.

Machine Learning

Machine Learning Database Data Science Building

How to Backfill a SQL query using Apache Airflow

Start Data Engineering

JANUARY 6, 2021

What is backfilling ? Setup Prerequisites Apache Airflow - Execution Day Backfill Conclusion Further Reading References What is backfilling ? Backfilling refers to any process that involves modifying or adding new data to existing records in a dataset. This is a common use case in data engineering. Some examples can be a change in some business logic may need to be applied to an already processed dataset.

SQL

SQL Datasets Data Engineering Data Engineer

The last (but not least)”ops” you need for your data : DataGovops

François Nguyen

JANUARY 18, 2021

To finish the trilogy (Dataops, MLops), let’s talk about DataGovOps or how you can support your Data Governance initiative. The origin of the term : Datakitchen We must give credit to Chris Bergh and his team DataKictchen. You should visit their website , you will find incredible good stuff there. This article was published in October 2020 with this title : “Data Governance as Code” The idea behind that is you should “actively promotes the safe use of data with automation

Data Governance

Data Governance Metadata Government Data Pipeline

Data-driven 2021: Predictions for a new year in data, analytics and AI

DataKitchen

JANUARY 4, 2021

The post Data-driven 2021: Predictions for a new year in data, analytics and AI first appeared on DataKitchen.

Data Analytics

Data Analytics Data

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Speaker: Nikhil Joshi, Founder & President of Snic Solutions

Is your manufacturing operation reaching its efficiency potential? A Manufacturing Execution System (MES) could be the game-changer, helping you reduce waste, cut costs, and lower your carbon footprint. Join Nikhil Joshi, Founder & President of Snic Solutions, in this value-packed webinar as he breaks down how MES can drive operational excellence and sustainability.

Manufacturing

Property Based Testing Confluent Server Storage for Fun and Safety

Confluent

JANUARY 12, 2021

Confluent uses property-based testing to test various aspects of Confluent Server’s Tiered Storage feature. Tiered Storage shifts data from expensive local broker disks to cheaper, scalable object storage, thereby reducing […].

Data

Using Your Data Warehouse As The Source Of Truth For Customer Data With Hightouch

Data Engineering Podcast

JANUARY 18, 2021

Summary The data warehouse has become the central component of the modern data stack. Building on this pattern, the team at Hightouch have created a platform that synchronizes information about your customers out to third party systems for use by marketing and sales teams. In this episode Tejas Manohar explains the benefits of sourcing customer data from one location for all of your organization to use, the technical challenges of synchronizing the data to external systems with varying APIs, and

Data Warehouse

Data Warehouse BI Data Data Engineering

Digital Transformation is a Data Journey From Edge to Insight

Cloudera

JANUARY 20, 2021

Digital transformation is a hot topic for all markets and industries as it’s delivering value with explosive growth rates. Consider that Manufacturing’s Industry Internet of Things (IIOT) was valued at $161b with an impressive 25% growth rate, the Connected Car market will be valued at $225b by 2027 with a 17% growth rate, or that in the first three months of 2020, retailers realized ten years of digital sales penetration in just three months.

Manufacturing

Manufacturing Data Warehouse Kafka Retail

How to do Change Data Capture (CDC), using Singer

Start Data Engineering

JANUARY 1, 2021

Introduction Why Change Data Capture Setup Prerequisites Source setup Destination setup Source, MySQL CDC, MySQL => PostgreSQL Pros and Cons Pros Cons Conclusion References Introduction Change data capture is a software design pattern used to track every change(update, insert, delete) to the data in a database. In most databases these types of changes are added to an append only log (Binlog in MySQL, Write Ahead Log in PostgreSQL).

PostgreSQL

PostgreSQL MySQL Database Data

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

When developing a Gen AI application, one of the most significant challenges is improving accuracy. This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. 💥 Anindo Banerjea is here to showcase his significant experience building AI/ML SaaS applications as he walks us through the current problems his company, Civio, is solving.

Systems

Optimizing the Aural Experience on Android Devices with xHE-AAC

Netflix Tech

JANUARY 22, 2021

By Phill Williams and Vijay Gondi Introduction At Netflix, we are passionate about delivering great audio to our members. We began streaming 5.1 channel surround sound in 2010, Dolby Atmos in 2017 , and adaptive bitrate audio in 2019. Continuing in this tradition, we are proud to announce that Netflix now streams Extended HE-AAC with MPEG-D DRC ( xHE-AAC ) to compatible Android Mobile devices (Android 9 and newer).

Metadata

Metadata Programming Algorithm Media

The Business Case for DataOps

DataKitchen

JANUARY 6, 2021

Savvy executives maximize the value of every budgeted dollar. Decisions to invest in new tools and methods must be backed up with a strong business case. As data professionals, we know the value and impact of DataOps: streamlining analytics workflows, reducing errors, and improving data operations transparency. Being able to quantify the value and impact helps leadership understand the return on past investments and supports alignment with future enterprise DataOps transformation initiatives.

Pharmaceutical

Pharmaceutical Consulting Utilities Programming

Implementing mTLS and Securing Apache Kafka at Zendesk

Confluent

JANUARY 7, 2021

At Zendesk, Apache Kafka® is one of our foundational services for distributing events among different internal systems. We have pods, which can be thought of as isolated cloud environments where […].

Kafka

Kafka Cloud Systems

Enabling Version Controlled Data Collaboration With TerminusDB

Data Engineering Podcast

JANUARY 11, 2021

Summary As data professionals we have a number of tools available for storing, processing, and analyzing data. We also have tools for collaborating on software and analysis, but collaborating on data is still an underserved capability. Gavin Mendel-Gleason encountered this problem first hand while working on the Sesshat databank, leading him to create TerminusDB and TerminusHub.

PostgreSQL

PostgreSQL Python Computer Science Data Lake

The Ultimate Guide To Data-Driven Construction: Optimize Projects, Reduce Risks, & Boost Innovation

Speaker: Donna Laquidara-Carr, PhD, LEED AP, Industry Insights Research Director at Dodge Construction Network

In today’s construction market, owners, construction managers, and contractors must navigate increasing challenges, from cost management to project delays. Fortunately, digital tools now offer valuable insights to help mitigate these risks. However, the sheer volume of tools and the complexity of leveraging their data effectively can be daunting. That’s where data-driven construction comes in.

Project

New Applied ML Research: Few-shot Text Classification

Cloudera

JANUARY 7, 2021

Text classification is a ubiquitous capability with a wealth of use cases. For example, recommendation systems rely on properly classifying text content such as news articles or product descriptions in order to provide users with the most relevant information. Classifying user-generated content allows for more nuanced sentiment analysis. And in the world of e-commerce, assigning product descriptions to the most fitting product category ensures quality control. .

Machine Learning

Machine Learning Algorithm Deep Learning Designing

Big Data in Retail & CPG Requires a Scalpel, Not an Axe

Teradata

JANUARY 24, 2021

To satisfy the evolving demands of customers, Chief Commercial Officers need to wield big data in Retail & CPG using a precise scalpel rather than a blunt axe.

Retail

Retail Big Data Data

How To Convert 100% of Your Proofs of Concept into Happy Customers

Monte Carlo

JANUARY 28, 2021

I’ve never done Sales before. Most of my professional history was spent working for a Customer Success company, Gainsight, where I provided the data, insights, and tools for our customer-facing teams. At Gainsight, my bonus was tied to 1) verified customer outcomes, 2) renewal rate, and 3) customer advocacy. I learned to be maniacally customer-focused.

BI Machine Learning Engineering Data

Do You Need a DataOps Dojo?

DataKitchen

JANUARY 20, 2021

As DataOps activity takes root within an enterprise, managers face the question of whether to build centralized or decentralized DataOps capabilities. Centralizing analytics brings it under control but granting analysts free reign is necessary to foster innovation and stay competitive. The beauty of DataOps is that you don’t have to choose between centralization and freedom.

Education

Education Coding Project Engineering

Driving Responsible Innovation: How to Navigate AI Governance & Data Privacy

Speaker: Aindra Misra, Senior Manager, Product Management (Data, ML, and Cloud Infrastructure) at BILL

Join us for an insightful webinar that explores the critical intersection of data privacy and AI governance. In today’s rapidly evolving tech landscape, building robust governance frameworks is essential to fostering innovation while staying compliant with regulations. Our expert speaker, Aindra Misra, will guide you through best practices for ensuring data protection while leveraging AI capabilities.

Government

Apache Kafka Meets Table Football

Confluent

JANUARY 29, 2021

What happens when you let engineers go wild building an application to track the score of a table football game? This blog post shares SPOUD’s story of engineering a simple […].

Kafka

Kafka Engineering Building

Bringing Feature Stores and MLOps to the Enterprise at Tecton

Data Engineering Podcast

JANUARY 4, 2021

Summary As more organizations are gaining experience with data management and incorporating analytics into their decision making, their next move is to adopt machine learning. In order to make those efforts sustainable, the core capability they need is for data scientists and analysts to be able to build and deploy features in a self service manner.

Python

Python Machine Learning Computer Science Data Lake

Fostering community to help drive cultural change

Cloudera

JANUARY 18, 2021

2020 put on full display how humanity shows up in times of hardship. We saw everything from street celebrations to usher weary medical personnel home after long days fighting to save lives to places like food banks receiving more donations and volunteers than ever before. Some communities were harder hit than others, and we’ve seen the same in the global workplace.

Food

Food Medical Banking Programming

Six Crucial Refinements to Conventional Wisdom About Data Strategy

Teradata

JANUARY 28, 2021

This post shines a spotlight on the difference between what you already know & the more specific guidance you really need to create a successful data strategy.

Data

What Is Entity Resolution? How It Works & Why It Matters

Entity Resolution Sometimes referred to as data matching or fuzzy matching, entity resolution, is critical for data quality, analytics, graph visualization and AI. Learn what entity resolution is, why it matters, how it works and its benefits. Advanced entity resolution using AI is crucial because it efficiently and easily solves many of today’s data quality and analytics problems.

January, 2021

How to update millions of records in MySQL?

Why do you need Agile Software Development for your Data Use Case ?

Webinars

Trending Sources

Job conversion possibilities within Data Science

Webinars

Confluent and Microsoft Announce Strategic Alliance

15 Modern Use Cases for Enterprise Business Intelligence

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 1: The Set-Up & Basics

Improving Population Health Through Citizen 360

How to unit test sql transforms in dbt

Sign up to get articles personalized to your interests!

More Trending

How to unit test sql transforms in dbt

The very strange way of doing Data Quality at Airbnb

Skills you should have as a Data Engineer

Helpful Tools for Apache Kafka Developers

Making It Easier To Stick B2B Data Integration Pipelines Together With Hotglue

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 3: Productionization of ML models

How to Backfill a SQL query using Apache Airflow

The last (but not least)”ops” you need for your data : DataGovops

Data-driven 2021: Predictions for a new year in data, analytics and AI

How to Drive Cost Savings, Efficiency Gains, and Sustainability Wins with MES

Property Based Testing Confluent Server Storage for Fun and Safety

Using Your Data Warehouse As The Source Of Truth For Customer Data With Hightouch

Digital Transformation is a Data Journey From Edge to Insight

How to do Change Data Capture (CDC), using Singer

Improving the Accuracy of Generative AI Systems: A Structured Approach

Optimizing the Aural Experience on Android Devices with xHE-AAC

The Business Case for DataOps

Implementing mTLS and Securing Apache Kafka at Zendesk

Enabling Version Controlled Data Collaboration With TerminusDB

The Ultimate Guide To Data-Driven Construction: Optimize Projects, Reduce Risks, & Boost Innovation

New Applied ML Research: Few-shot Text Classification

Big Data in Retail & CPG Requires a Scalpel, Not an Axe

How To Convert 100% of Your Proofs of Concept into Happy Customers

Do You Need a DataOps Dojo?

Driving Responsible Innovation: How to Navigate AI Governance & Data Privacy

Apache Kafka Meets Table Football

Bringing Feature Stores and MLOps to the Enterprise at Tecton

Fostering community to help drive cultural change

Six Crucial Refinements to Conventional Wisdom About Data Strategy

What Is Entity Resolution? How It Works & Why It Matters

Stay Connected