DIGITAL - History Events (Banno)
Introduction
The goal of this pipeline is to make the History Events dataset available in Google Big Query where it can be queried using the SQL interface for business purposes and to drive downstream business processes.
Scenarios
The History Events dataset is a single data file in CSV format sourced from an Azure Storage bucket on a predetermined cadence. The current requirement dictates that this file be copied to Google Cloud Storage bucket and subsequently imported into Big Query, and exposed as a native table in Big Query to provide a SQL interface over the data in order to drive downstream business processes.
This pipeline enables three scenarios:
- Initial load: The data files are loaded for the first time
- Incremental load: Subsequent data file loads after the initial load
- Historical load / backfill: Historical load to clean up existing data and reload from the beginning of time
Pipelines
The pipeline consists of two Google Data Transfer jobs. The first job transfers the data file from the source Azure Storage bucket into the destination Google Cloud Storage bucket and the second job picks up the data file from the Google Cloud Storage bucket and import it into Google Big Query and expose it as native table. After the data file is imported, the copies of the data file in Azure Storage bucket and the Google Cloud Storage bucket are both retained for later archival and audit purposes. After the initial data file import into Big Query, the subsequent imports only append to the existing table; there is no additional work to de-duplicate data or purge old(er) entries. Per the current requirements, the two jobs are currently time-driven and not auto-triggered based on events such as the availability of the source file and the completion of the first job.
Schedule
Initially, the data import process is expected to run hourly, however the pipeline is extensible to change it to any schedule as needed in the future.