Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: SQLMash Lineage Integration with Data Catalog #3565

Open
alefbt opened this issue Dec 24, 2024 · 0 comments
Open

Feature Request: SQLMash Lineage Integration with Data Catalog #3565

alefbt opened this issue Dec 24, 2024 · 0 comments
Labels
Feature Adds new functionality

Comments

@alefbt
Copy link

alefbt commented Dec 24, 2024

This feature request proposes integrating SQLMash with a data catalog, such as Open Metadata Catalog (OMC), to track and manage data lineage. This integration will enable users to visualize the origin and flow of data within their SQL workflows, improving data transparency and facilitating impact analysis.

Background

Data lineage refers to the history of how data is transformed and moved throughout a system. Tracking lineage provides valuable insights into data origin, dependencies, and downstream impacts. DBT, a popular data transformation tool, already integrates with OMC to ingest lineage information from its manifest.json file. This feature request aims to replicate similar functionality for SQLMash.

Benefits

  • Improved Data Transparency: Users can easily understand the origin and flow of data within their SQL workflows, fostering trust and facilitating data governance.
  • Enhanced Impact Analysis: By visualizing data lineage, users can assess the impact of changes upstream on downstream tables and identify potential issues before they occur.
  • Simplified Debugging: Lineage information can streamline debugging efforts by helping users pinpoint the source of errors in data pipelines.

Minimum Viable Product (MVP)

As a minimum viable product (MVP), SQLMash should be able to:

  • Ingest Lineage Information from dbt manifest.json: Parse the dbt manifest.json file to extract lineage information, including source tables, transformations applied, and destination tables.
  • Store Lineage Data: Develop a mechanism to store the extracted lineage data within SQLMash or integrate with an external data catalog like OMC.
  • Visualize Lineage: Implement a user interface component to visualize the lineage graph, allowing users to explore data flows and dependencies.

Future Considerations

Following a successful MVP implementation, future enhancements could include:

  • Support for Additional Data Sources: Expand lineage tracking capabilities to encompass various data sources beyond dbt, including databases and APIs.
  • Lineage Transformation Tracking: Capture the lineage of data transformations within SQLMash workflows, providing a more comprehensive view of data flow.
  • Alerting and Monitoring: Develop functionalities to monitor data lineage and generate alerts for potential issues or changes in upstream data.

Conclusion

Integrating SQLMash with a data catalog for lineage tracking offers significant advantages for data governance, impact analysis, and debugging. By implementing the proposed features, SQLMash can empower users to gain a deeper understanding of their data pipelines and ensure data quality and consistency.

Reference

@georgesittas georgesittas added Improvement Improves existing functionality Feature Adds new functionality and removed Improvement Improves existing functionality labels Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Adds new functionality
Projects
None yet
Development

No branches or pull requests

2 participants