What Is Data Lineage And Why Use Data Lineage?

What is Data Lineage and Why use Data Lineage?

A
by Alex Noah — 2 years ago in Development 6 min. read
2298

Introduction

Are you too busy fixing bugs in your C-level dashboards or are you spending too much time chasing them down? Do different departments struggle to agree on the data that is required throughout the company? Are you having trouble assessing the potential impact of a possible migration?

A data lineage may be the solution to your data quality problems. A data lineage system improves data visibility and traceability across the entire data stack. It also simplifies the task of communicating about the data your organization relies on.

But wait! what exactly is data lineage?
Also read: 7 Best Woocommerce Plugins to boost your Store you must know

What is Data Lineage?

The Data Lineage shows the data flow through various systems and transformations. Data in modern data stacks is not only stored in application databases.

It flows from one application and then to the next, and finally to data warehouses where it can be transformed and consumed by any number of downstream applications.

This data flow allows each system access to data in a way that makes sense. Source applications can optimize to improve the performance of read/write transactions. Reporting clients have access to denormalized data, which makes it easy for queries.

This convenience comes at a cost of visibility and traceability. After the data leaves the source database, it is subject to any number of transformations.

This layer can mask the true data. Many reporting teams struggle to determine the source of their data or to identify the correct data to use in a report. They might ask the application team to clarify the situation.

The team may tell them that the data isn’t there because the terms used to describe a piece of data have changed after the transformation.

Solving bugs and problems can take longer and will require the involvement of three teams, the reporting team, and the data warehouse team. The data team typically takes on the task of finding the root cause of the problem. They will then have to go through the version control and try to solve it. This can also slow down the development process for new reports.

Data lineage resolves these problems. Let’s talk about how.
Also read: 30 Best Money Making Websites, Top Rated Money Earning Websites (No Cash Deposit!)

Data Lineage: Why?

A data lineage system allows you to have your cake while still having it. You can have both separation of roles, the performance of a data warehouse, and clear data understanding across all of your systems and teams.

You can trace data throughout the system with clear data understanding and traceability. This can be used to confirm that no personally identifiable data (PII), is being exported from the system and being consumed in places it shouldn’t.

You can also see which data is being used downstream and the impact of possible changes or migrations. You can also identify any unutilized information and allow for easy cleanup of columns or tables.

Data lineage systems improve communication and reduce incident response time by increasing data understanding.

The data lineage system eliminates confusion about the source of data in reports and makes it easy for all parties to understand where it came from. This system speeds up the resolution of errors as well as new development.

We now know why data lineage matters for modern data stacks. Let’s take a look at different types of data-lineage systems.
Also read: 10 Business-Critical Digital Marketing Trends For 2021

Different types of data lineage

There are two major types of data lineage systems: Active and Passive.

An active data-lineage system is considered “active” as you have to create it. This can be done by either programming the necessary source and transformation information into your system or tagging the data with the appropriate metadata.

Apache Atlas is an example of such an active system. An active data lineage system that is properly configured can give you traceability of your data down to the smallest detail.

These benefits require constant maintenance and updating. This can add complexity to your overall data infrastructure, and can also be time-consuming.

A passive system that attempts to understand your data by itself is the alternative. Passive systems examine the data coming out of the data warehouse.

A passive system uses pattern recognition to identify where the data comes from and what it is being transformed. This can be useful for simple data sets and simpler transformations. However, it can produce inaccurate results.

Another type of passive data lineage system is the parsing-based system. This generates lineage data through reverse-engineering your database warehouse.

A parsing-based system allows you to see exactly where your data is coming from and what it is being used for. Datafold illustrates this type of system. Datafold analyses all DQL code within your data warehouse and generates lineage graphs.

This lineage is much more detailed than table-level and allows you to see which column a piece of data was sourced from, and where it was consumed.

This detail allows for quicker outage response times, faster troubleshooting, as well as reducing the number of production-ready changes.

Datafold has many integrations. Datafold is easy to use and accessible via the Datafold HTTP1_ API. A parsing-based data lineage system, as long as it supports your data warehouse or related systems, is the best choice for implementation and maintenance.

It’s all great but how does it affect my day-to-day? Let’s take a look at this.

Solving bugs or problems can take longer, and it will require the involvement three teams: the reporting team and data warehouse team.

The data team typically takes on the task of finding the root cause of the problem. They will then have to go through the version control and try to solve it. This can also slow down the development process for new reports.

Data lineage resolves these problems. Let’s talk about how.
Also read: Top 6 Tips to Stay Focused on Your Financial Goals

Data Lineage: Why?

A data lineage system allows you to have your cake while still having it. You can have both separation of roles , the performance of a data warehouse, and clear data understanding across all of your systems and teams.

You can trace data throughout the system with clear data understanding and traceability. This can be used to confirm that no personally identifiable data (PII), is being exported from the system and being consumed in places it shouldn’t.

You can also see which data is being used downstream and the impact of possible changes or migrations. You can also identify any unutilized information and allow for easy cleanup of columns or tables.

Data lineage systems improve communication and reduce incident response time by increasing data understanding.

The data lineage system eliminates confusion about the source of data in reports and makes it easy for all parties to understand where it came from. This system speeds up the resolution of errors as well as new development.

We now know why data lineage matters for modern data stacks. Let’s take a look at different types of data-lineage systems.
Also read: 210+ Best Pick Up Lines: Funny, Cheesy, & Flirty Pickup Lines For Boys & Girls

Different types of data lineage

There are two major types of data lineage systems: Active and Passive.

An active data-lineage system is considered “active” as you have to create it. This can be done by either programming the necessary source and transformation information into your system or tagging the data with the appropriate metadata.

Apache Atlas is an example of such an active system. An active data lineage system that is properly configured can give you traceability of your data down to the smallest detail.

These benefits require constant maintenance and updating. This can add complexity to your overall data infrastructure, and can also be time-consuming.

A passive system that attempts to understand your data by itself is the alternative. Passive systems examine the data coming out of the data warehouse.

A passive system uses pattern recognition to identify where the data comes from and what it is being transformed. This can be useful for simple data sets and simpler transformations. However, it can produce inaccurate results.

Another type of passive data lineage system is the parsing based system. This generates lineage data through reverse-engineering your database warehouse.

A parsing-based system allows you to see exactly where your data is coming from and what it is being used for.

Datafold illustrates this type of system. Datafold analyses all DQL code within your data warehouse and generates lineage graphs.

This lineage is much more detailed than table-level and allows you see which column a piece of data was sourced from, and where it was consumed. This detail allows for quicker outage response times, faster troubleshooting, as well as reducing the number of production-ready changes.

Datafold has many integrations. Datafold is easy to use and accessible via the Datafold HTTP1_ API. A parsing-based data lineage system, as long as it supports your data warehouse or related systems, is the best choice for implementation and maintenance.

It’s all great but how does it affect my day-to-day? Let’s take a look at this.
Also read: Spotify User? Guide To Cancel Spotify Premium Subscription

How can Data Lineage ensure day-to-day data quality?

A data lineage system provides visibility and traceability that is better than ever. Three clear benefits can be seen in your day-to-day operations.

It increases your team’s response time. It doesn’t take hours to find the root cause of an error in a reporting. This is possible with the cooperation of multiple teams. Errors can be quickly identified and corrected if you have complete visibility of the data flow across your entire data stack.

It allows the creation and maintenance of a common vocabulary. The application team understands what views are and where they come from when the report team discusses them.

The application team can see what data has been aggregated to create the dashboard that informs company outlook and decisions.

Over time, terminology discrepancies can be reduced or eliminated, which allows for better communication throughout the company.

The data lineage system allows teams to quickly and efficiently predict the potential effects of changes or migrations. With certainty, data schema changes and migrations are possible to plan. It is easy to track the impact of any changes on downstream parties and notify them.
Also read: Top 10 Best Artificial Intelligence Software

Wrap-up

This article explains what data lineage is and why you might use it. We also explain the various types of data lineage available, as well as how data lineage can help improve data quality every day. The addition of a data-lineage system to your data stack will increase transparency and reduce headaches for the entire organization.

Alex Noah

Alex is senior editor of The Next Tech. He studied International Communication Management at the Hague University of Applied Sciences.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments

Copyright © 2018 – The Next Tech. All Rights Reserved.