It is estimated that a staggering 463 zettabytes of data will be created each day by 2025, according to an article from Finances Online. This begs the question, how do we organize and harness the power of such a large and ever growing pool? 

One concept or methodology that can help tame the ever growing wild tangle of data is data lineage.

In this article, you will learn what data lineage is, how it works, and why it is important. In addition, this article will detail the difference between data lineage and data governance/provenance, data lineage techniques, examples of data lineage, steps to get started in data lineage, as well as tools to help drive data lineage best practices within the context of a business.

data lineage

What Is Data Lineage?

An article from IBM defines data lineage as, "the process of tracking the flow of data over time, providing a clear understanding of where it originated, how it has changed, and its ultimate destination."1 In short, this means that data changes are being tracked and transactional metadata is being stored to compile a historical record of the data as it changes states, moves locations, triggers other data to be created or move, and more.

How Does It Work?

The tools and methods for tracking data lineage are many, but the basic principles are the same. Data lineage works through tracking metadata, otherwise known as "data that describes other data" according to an article by TechTerms.com.2

Examples of metadata include:

  • the date a piece of data was created
  • the date a piece of data was modified
  • the user that created or modified the data

The metadata associated with the primary data is tracked as the primary data reaches set markers along the way through a data pipeline or workflow.

Why Is Data Lineage Important?

The importance of data lineage in today's data driven world cannot be overstated. 

The practice of data lineage concepts:

  • makes searching stored data much easier
  • allows organizations to see gaps in process efficiencies
  • provides the opportunity for proactively identifying errors in data

Without proper data lineage tools and techniques, important insights are hidden among the onslaught of data that is created each day within the average business today.

data lineage process

Types of Data Lineage

There are several ways to categorize data lineage. A few of these categorization factors include the documentation method, stakeholder requirements, and techniques or frameworks used to derive the lineage.

Business Lineage

This type of data lineage tracks data that is important for running a business. The focus here is only on the metadata behind business data that is most relevant for tracking business transactions and making business decisions. Often, business lineage metadata is summarized to help business analysts or decision makers understand where gaps in processes can be improved or interactions can be leveraged to grow the business.

In other words, business lineage helps leaders and managers make decisions about a business.

Technical Lineage

Technical lineage focuses on tracing data as it transforms and moves through technical operations like ETL pipelines. Because of the technical nature of this type of data lineage, this is typically not suitable for the average business person to review and glean insights from. Often, business lineage can be summaries of technical lineage data that are made to be more digestible for the average business professional.

This type of data lineage helps data management professionals know things like where data came from, what systems the data passed through or originated from, and what actions the systems applied to the data.

Data Lineage vs Data Governance & Provenance

What Is Data Governance?

Data governance can be defined as systems, processes, or work instructions that govern the creation and management of data. Data governance is all about ensuring data is accurate and readily available for use.

What Is Data Provenance?

Data provenance refers to the tracing the origin of data specifically. In addition, data provenance also typically refers to tracking data through its lifecycle to identify how it has been altered along the way. To understand how it has been altered, you must understand where and how it began.

Main Similarities And Differences

Data governance and data provenance share many similarities as concepts, but they serve different purposes.

Some similarities between data governance and data provenance are that they are both important for establishing robust data lineage analysis protocols. In addition, both concepts are essential for maintaining proper data security and data quality over time.

However, data governance and data provenance drive good data quality and security in different ways. For example, data governance drives how data is tracked while data provenance refers to what data is tracked.

data lineage vs. data provenance

Data Lineage Techniques

Data Tagging

Data tagging is the practice of attaching a tag to data as it moves through a pipeline or workflow. The data lineage system will track the tag as the data or dataset flows through the pipeline or workflow to track metadata associated with the tagged data including things like user information and date/time altered.

data tagging

Parsing-Based Lineage

Parsing-based data lineage techniques rely on the ability of a person or system to break down the logic used to process data to capture changes to data or datasets as they pass through pipelines or workflows. 

While this does not typically require an additional piece of metadata be applied to data as it moves through processes and systems, it does require an advanced understanding or system that can parse the coded transactions.

Pattern-Based Lineage

Pattern-based lineage focuses on data movement patterns to determine data lineage rather than a tag or parsing code. This type of data lineage technique can be simpler to implement than others because it typically uses metadata that is already available to build a picture of data lineage. However, pattern-based lineage techniques might miss more complex relationships between data if they don't fit into the specified pattern.

Examples Of Data Lineage Use

Examples of data lineage use include: 

Data Migration

Data lineage can be very useful when planning for a data migration. For example, data lineage information can be used to remove data from the migration pipeline that is outdated or duplicated that should not be moved to a new system. 

Another example of how data lineage can be used to enhance the outcome of a data migration is by using data lineage information to group data together when migrating it to a new system. Both of these examples serve to improve data quality and system performance.

Data Modeling

Data modeling is an activity where visual representations of data and their relationships are created for analysis. Data lineage can be used to help define data dependencies across systems or ETL pipelines. 

Because data lineage tracks data at regular intervals through its lifecycle, it is a quality foundation for building accurate data models that businesses can use to make important data management and process decisions.

Impact Analysis

Impact analysis involves reviewing a situation or data set to evaluate or predict the impact of business changes. An example of how data lineage can be used to drive impact analysis is that data lineage reporting can be used to track the impact of data errors across an organization. 

If an analyst knows where the data came from and what reports or processes the data touched then they have a good starting place to report the total exposure across the organization.

Compliance

Data lineage can be a big component of validating compliance with policies and regulations. Many governments and agencies have recognized that consumer and business data can be vulnerable to attack or neglect if not properly regulated. Data lineage provides a vehicle for reporting out on compliance activities like informing consumers when and where their data is stored or providing evidence that consumer notification regulations are being followed like California's Prop 65.

How To Get Started In Data Lineage

Below we’ve included some best practices to consider including: 

Best Practices

Here are some best practices that can help you along your data lineage journey:

  • Link data lineage to real business requirements. Tracking data for the sake of tracking data doesn't really provide measurable value to a business or organization. It is important to tie data lineage activities to a real world problem to ensure organizations are getting the most out of their monetary and time investments.
  • Thoroughly document and map your data flows. It is important to understand both business process flows as well as technical process flows when starting out with data lineage. This documentation will be your foundation for accurately tracing data as it flows from system to system.
  • Involve business leaders and key stakeholders early. Getting business leaders on board and excited about data lineage early on is important to secure funding for data lineage tools and human resources to properly implement data lineage.
  • Look ahead before locking yourself into a tool or process. Shifting a business culture to practice proper data lineage and begin reaping the huge benefits can be difficult. This means it is important to look ahead at the organization's future goals and systems roadmap to consider what is to come when building processes, documentation, and tools to drive data lineage.

Choosing The Best Data Lineage Tool For Your Business

There are many factors to consider when choosing a tool to manage your data lineage. Here are some of the most important:

  • Cost v. Benefit - Like most business investments, there is a cost element to consider when seeking access to a data lineage tool. There are a wide range of tools at a wide range of price points. It is important to choose one that is the value for your business.
  • Scalability - A data lineage tool is an investment in an organization's future. This means it is important to look ahead at business and system development plans when choosing a data lineage tool to be sure it can scale with your business.
  • Team Communication - Communicating the insights gleaned from data lineage quickly and clearly is an important consideration when choosing a data lineage tool. After all, if a business cannot act quickly on insights then it is hard to reap the full benefits of the investment in the tool.
  • Integration - Understanding current systems and how potential data lineage tools integrate with those systems is essential when choosing the right data lineage tool. If integrations with current systems are not smooth, the business runs the risk of poor data quality and unnecessary effort.

What's Next On Your Data Lineage Journey?

Now that we have laid out a thorough overview of data lineage and its importance to businesses and individuals in today's society, you probably want to learn how to get started in this essential field. There are a variety of avenues you can take to get started including enforcing data lineage best practices, implementing data lineage tools in your business, or seeking out additional education or courses on the topic of data lineage.

For additional information and help on your next project, check out our philosophy at Capella. Our low-code platform allows us to build enterprise quality software without reinventing the wheel. Combine your data on your people and technologies in a single workflow application. 

Sources:

  1. What is data lineage? IBM. (n.d.). Retrieved from https://www.ibm.com/topics/data-lineage 

Metadata. Metadata Definition. (n.d.). Retrieved from https://techterms.com/definition/metadata

Rasheed Rabata

Is a solution and ROI-driven CTO, consultant, and system integrator with experience in deploying data integrations, Data Hubs, Master Data Management, Data Quality, and Data Warehousing solutions. He has a passion for solving complex data problems. His career experience showcases his drive to deliver software and timely solutions for business needs.