Automate large-scale information validation utilizing Amazon EMR and Apache Griffin

April 5, 2024

37

Many enterprises are migrating their on-premises information shops to the AWS Cloud. Throughout information migration, a key requirement is to validate all the info that has been moved from supply to focus on. This information validation is a essential step, and if not carried out accurately, might outcome within the failure of your entire venture. Nonetheless, creating customized options to find out migration accuracy by evaluating the info between the supply and goal can usually be time-consuming.

On this publish, we stroll via a step-by-step course of to validate giant datasets after migration utilizing a configuration-based instrument utilizing Amazon EMR and the Apache Griffin open supply library. Griffin is an open supply information high quality resolution for giant information, which helps each batch and streaming mode.

In right this moment’s data-driven panorama, the place organizations cope with petabytes of information, the necessity for automated information validation frameworks has change into more and more essential. Guide validation processes should not solely time-consuming but additionally susceptible to errors, particularly when coping with huge volumes of information. Automated information validation frameworks provide a streamlined resolution by effectively evaluating giant datasets, figuring out discrepancies, and guaranteeing information accuracy at scale. With such frameworks, organizations can save precious time and sources whereas sustaining confidence within the integrity of their information, thereby enabling knowledgeable decision-making and enhancing general operational effectivity.

The next are standout options for this framework:

Makes use of a configuration-driven framework
Affords plug-and-play performance for seamless integration
Conducts rely comparability to determine any disparities
Implements strong information validation procedures
Ensures information high quality via systematic checks
Offers entry to a file containing mismatched information for in-depth evaluation
Generates complete experiences for insights and monitoring functions

Resolution overview

This resolution makes use of the next companies:

Amazon Easy Storage Service (Amazon S3) or Hadoop Distributed File System (HDFS) because the supply and goal.
Amazon EMR to run the PySpark script. We use a Python wrapper on high of Griffin to validate information between Hadoop tables created over HDFS or Amazon S3.
AWS Glue to catalog the technical desk, which shops the outcomes of the Griffin job.
Amazon Athena to question the output desk to confirm the outcomes.

We use tables that retailer the rely for every supply and goal desk and likewise create recordsdata that present the distinction of information between supply and goal.

The next diagram illustrates the answer structure.

Within the depicted structure and our typical information lake use case, our information both resides n Amazon S3 or is migrated from on premises to Amazon S3 utilizing replication instruments reminiscent of AWS DataSync or AWS Database Migration Service (AWS DMS). Though this resolution is designed to seamlessly work together with each Hive Metastore and the AWS Glue Information Catalog, we use the Information Catalog as our instance on this publish.

This framework operates inside Amazon EMR, mechanically operating scheduled duties each day, as per the outlined frequency. It generates and publishes experiences in Amazon S3, that are then accessible through Athena. A notable function of this framework is its functionality to detect rely mismatches and information discrepancies, along with producing a file in Amazon S3 containing full information that didn’t match, facilitating additional evaluation.

On this instance, we use three tables in an on-premises database to validate between supply and goal : balance_sheet, covid, and survery_financial_report.

Stipulations

Earlier than getting began, ensure you have the next stipulations:

Deploy the answer

To make it easy so that you can get began, we’ve got created a CloudFormation template that mechanically configures and deploys the answer for you. Full the next steps:

Create an S3 bucket in your AWS account known as bdb-3070-griffin-datavalidation-blog-${AWS::AccountId}-${AWS::Area} (present your AWS account ID and AWS Area).
Unzip the next file to your native system.
After unzipping the file to your native system, change <bucket title> to the one you created in your account (bdb-3070-griffin-datavalidation-blog-${AWS::AccountId}-${AWS::Area}) within the following recordsdata:
1. bootstrap-bdb-3070-datavalidation.sh
2. Validation_Metrics_Athena_tables.hql
3. datavalidation/totalcount/totalcount_input.txt
4. datavalidation/accuracy/accuracy_input.txt
Add all of the folders and recordsdata in your native folder to your S3 bucket:
```
aws s3 cp . s3://<bucket_name>/ --recursive
```
Run the next CloudFormation template in your account.

The CloudFormation template creates a database known as griffin_datavalidation_blog and an AWS Glue crawler known as griffin_data_validation_blog on high of the info folder within the .zip file.

Select Subsequent.
Select Subsequent once more.
On the Assessment web page, choose I acknowledge that AWS CloudFormation may create IAM sources with customized names.
Select Create stack.

You possibly can view the stack outputs on the AWS Administration Console or by utilizing the next AWS CLI command:

aws cloudformation describe-stacks --stack-name <stack-name> --region us-east-1 --query Stacks[0].Outputs

Run the AWS Glue crawler and confirm that six tables have been created within the Information Catalog.
Run the next CloudFormation template in your account.

This template creates an EMR cluster with a bootstrap script to repeat Griffin-related JARs and artifacts. It additionally runs three EMR steps:

Create two Athena tables and two Athena views to see the validation matrix produced by the Griffin framework
Run rely validation for all three tables to check the supply and goal desk
Run record-level and column-level validations for all three tables to check between the supply and goal desk

For SubnetID, enter your subnet ID.
Select Subsequent.
Select Subsequent once more.
On the Assessment web page, choose I acknowledge that AWS CloudFormation may create IAM sources with customized names.
Select Create stack.

You possibly can view the stack outputs on the console or by utilizing the next AWS CLI command:

aws cloudformation describe-stacks --stack-name <stack-name> --region us-east-1 --query Stacks[0].Outputs

It takes roughly 5 minutes for the deployment to finish. When the stack is full, it is best to see the EMRCluster useful resource launched and out there in your account.

When the EMR cluster is launched, it runs the next steps as a part of the post-cluster launch:

Bootstrap motion – It installs the Griffin JAR file and directories for this framework. It additionally downloads pattern information recordsdata to make use of within the subsequent step.
Athena_Table_Creation – It creates tables in Athena to learn the outcome experiences.
Count_Validation – It runs the job to check the info rely between supply and goal information from the Information Catalog desk and shops the ends in an S3 bucket, which shall be learn through an Athena desk.
Accuracy – It runs the job to check the info rows between the supply and goal information from the Information Catalog desk and retailer the ends in an S3 bucket, which shall be learn through the Athena desk.

When the EMR steps are full, your desk comparability is finished and able to view in Athena mechanically. No guide intervention is required for validation.

Validate information with Python Griffin

When your EMR cluster is prepared and all the roles are full, it means the rely validation and information validation are full. The outcomes have been saved in Amazon S3 and the Athena desk is already created on high of that. You possibly can question the Athena tables to view the outcomes, as proven within the following screenshot.

The next screenshot reveals the rely outcomes for all tables.

The next screenshot reveals the info accuracy outcomes for all tables.

The next screenshot reveals the recordsdata created for every desk with mismatched information. Particular person folders are generated for every desk straight from the job.

Each desk folder incorporates a listing for every day the job is run.

Inside that particular date, a file named __missRecords incorporates information that don’t match.

The next screenshot reveals the contents of the __missRecords file.

Clear up

To keep away from incurring extra costs, full the next steps to scrub up your sources whenever you’re carried out with the answer:

Delete the AWS Glue database griffin_datavalidation_blog and drop the database griffin_datavalidation_blog cascade.
Delete the prefixes and objects you created from the bucket bdb-3070-griffin-datavalidation-blog-${AWS::AccountId}-${AWS::Area}.
Delete the CloudFormation stack, which removes your extra sources.

Conclusion

This publish confirmed how you should use Python Griffin to speed up the post-migration information validation course of. Python Griffin helps you calculate rely and row- and column-level validation, figuring out mismatched information with out writing any code.

For extra details about information high quality use circumstances, check with Getting began with AWS Glue Information High quality from the AWS Glue Information Catalog and AWS Glue Information High quality.

Concerning the Authors

Dipal Mahajan serves as a Lead Guide at Amazon Net Companies, offering professional steerage to world shoppers in creating extremely safe, scalable, dependable, and cost-efficient cloud functions. With a wealth of expertise in software program improvement, structure, and analytics throughout various sectors reminiscent of finance, telecom, retail, and healthcare, he brings invaluable insights to his function. Past the skilled sphere, Dipal enjoys exploring new locations, having already visited 14 out of 30 international locations on his want record.

Akhil is a Lead Guide at AWS Skilled Companies. He helps prospects design & construct scalable information analytics options and migrate information pipelines and information warehouses to AWS. In his spare time, he loves travelling, enjoying video games and watching films.

Ramesh Raghupathy is a Senior Information Architect with WWCO ProServe at AWS. He works with AWS prospects to architect, deploy, and migrate to information warehouses and information lakes on the AWS Cloud. Whereas not at work, Ramesh enjoys touring, spending time with household, and yoga.

Automate large-scale information validation utilizing Amazon EMR and Apache Griffin

Resolution overview

Stipulations

Deploy the answer

Validate information with Python Griffin

Clear up

Conclusion

Concerning the Authors

Related Articles

Publicly accessible life cycle assessments doc our merchandise’ environmental affect

Introducing new capabilities to AWS CloudTrail Lake to reinforce your cloud visibility and investigations

The $3.8 Trillion Alternative: Unlocking the Financial Potential of the US Generative AI Ecosystem

LEAVE A REPLY Cancel reply

Latest Articles

Publicly accessible life cycle assessments doc our merchandise’ environmental affect

Introducing new capabilities to AWS CloudTrail Lake to reinforce your cloud visibility and investigations

The $3.8 Trillion Alternative: Unlocking the Financial Potential of the US Generative AI Ecosystem

Advancing city tree monitoring with AI-powered digital twins | MIT Information

Pink Hat Linux to be official WSL distro