Skip to content

Latest commit

 

History

History
36 lines (21 loc) · 2.9 KB

README.md

File metadata and controls

36 lines (21 loc) · 2.9 KB

cless-blend

WARNING: As of June 2022, only create/change code in the Pipelines folder of this repo.

This repo contains code used to construct the data pipelines between the various data sources required by the CLESSN and the datamarts providing datasets needed for research or visualization.

The data platform of the CLESSN is composed of scripts that move the data across the internet, file storage space, and databases, to make it ready for analytics.

The scripts are the active components of the pipelines (compute), and the storage and databases are passives components (storage).

Extractors, loaders and refiners : the active components of ETL pipelines

The current methodology consists in data pipelines made out of data extractors, data loaders and data refiners. Each component moves data in turn

  • from its original source to the data lake or files blob storage
  • from the data lake or files blob storage to the data warehouse
  • from the data warehouse to datamarts

Side Note: There are exceptions by which an extractor might not be needed in a pipeline. For instance, a researcher could very well obtain raw data in the form of a csv or pdf file, such as a university paper, a political party election program, the answers of survey questions etc. and would store it manually directly in the data lake or files blob storage. In that case the researcher plays the role of the extractor.

Note that this ETL methodology has recently been implemented at the CLESSN. Prior to that, web scrapers were developped and often combined all three steps in one (extraction, storage and refining).

Data Lake, files blob storage, data tables : the passives components of ETL pipelines

The data lake contains items in their raw format. The datawarehouse and datamarts contain only tabular representations of those items.

To conduct their research or data visualization projects, researchers will consume data only from the datamarts they produce. They produce datamarts by writing refiners that feed from the data warehouse.

As much as possible, a datamart will serve multiple purposes that require data of the same nature. However, it would be an error to squeeze too much data from the warehouse in one datamart, for the sole purpose of avoiding to have to write a refiner.

Researchers will eventually be able to make some datamarts public in order to be shared them with the community.

Development methodology

It is important to respect the CLESSN development methodology and environment requirements to create active and passive components of a data pipeline. See the Pipelines folder for more details.

Here's a view of the CLESSN data platform: Alt text