Payments-Smartsight
Introduction
The goal of this pipeline is to make the Payments dataset available in Google Big Query where it can be queried using the SQL interface for business purposes and to drive downstream business processes.
Scenarios
The Payments dataset is comprised of a single data file for each Jack Henry Payments solution. The data files are in PARQUET format, sourced from an Amazon S3 bucket on a monthly cadence. Today, the file must be copied to a Google Cloud Storage bucket, imported into Big Query, and exposed as a native table in Big Query to provide a SQL interface for the data.
This pipeline enables three scenarios:
- Initial load: A user loads the data files for the first time.
- Incremental load: A user performs subsequent data file loads to update the data from the initial load.
- Historical load/backfill: The user loads historical data to update or “clean up” existing data from a prior load, or to provide data for a period of time preceding data imported during prior loads.
Pipelines
The pipeline consists of two Google Data Transfer jobs. The first job transfers the data file from the source Amazon S3 bucket into the destination Google Cloud Storage bucket. The second job picks up the data file from the Google Cloud Storage bucket and imports it into Google Big Query to expose it as native table. After the data file is imported, the copies of the data file in the Amazon S3 bucket will be purged. After the initial data file import into Big Query, the subsequent imports only append to the existing table; there is no additional work to de-duplicate data or purge old(er) entries. Per the current requirements, the two jobs are currently time-driven and not auto-triggered based on events such as the availability of the source file and the completion of the first job.
Schedule
Initially, the data import process is expected to run once a month, but the pipeline is extensible and may be changed to run on any schedule as needed in the future.