Scale AWS Glue jobs by optimizing IP handle consumption and increasing community capability utilizing a non-public NAT gateway

March 19, 2024

54

As companies increase, the demand for IP addresses inside the company community typically exceeds the provision. A company’s community is usually designed with some anticipation of future necessities, however as enterprises evolve, their data expertise (IT) wants surpass the beforehand designed community. Firms might discover themselves challenged to handle the restricted pool of IP addresses.

For knowledge engineering workloads when AWS Glue is utilized in such a constrained community configuration, your group might generally face hurdles operating many roles concurrently. This occurs as a result of you might not have sufficient IP addresses to help the required connections to databases. To beat this scarcity, the group might get extra IP addresses out of your company community pool. These obtained IP addresses will be distinctive (non-overlapping) or overlapping, when the IP addresses are reused in your company community.

Whenever you use overlapping IP addresses, you want an extra community administration to determine connectivity. Networking options can embrace choices like non-public Community Handle Translation (NAT) gateways, AWS PrivateLink, or self-managed NAT home equipment to translate IP addresses.

On this put up, we’ll talk about two methods to scale AWS Glue jobs:

Optimizing the IP handle consumption by right-sizing Knowledge Processing Models (DPUs), utilizing the Auto Scaling function of AWS Glue, and fine-tuning of the roles.
Increasing the community capability utilizing extra non-routable Classless Inter-Area Routing (CIDR) vary with a non-public NAT gateway.

Earlier than we dive deep into these options, allow us to perceive how AWS Glue makes use of Elastic Community Interface (ENI) for establishing connectivity. To allow entry to knowledge shops inside a VPC, it is advisable create an AWS Glue connection that’s hooked up to your VPC. When an AWS Glue job runs in your VPC, the job creates an ENI contained in the configured VPC for every knowledge connection, and that ENI makes use of an IP handle within the specified VPC. These ENIs are short-lived and lively till job is full.

Now allow us to have a look at the primary answer that explains optimizing the AWS Glue IP handle consumption.

Methods for environment friendly IP handle consumption

In AWS Glue, the variety of employees a job makes use of determines the depend of IP addresses used out of your VPC subnet. It’s because every employee requires one IP handle that maps to 1 ENI. Whenever you don’t have sufficient CIDR vary allotted to the AWS Glue subnet, you might observe IP handle exhaustion errors. The next are some greatest practices to optimize AWS Glue IP handle consumption:

Proper-sizing the job’s DPUs – AWS Glue is a distributed processing engine. It really works effectively when it might run duties in parallel. If a job has greater than the required DPUs, it doesn’t at all times run faster. So, discovering the suitable variety of DPUs will be sure you use IP addresses optimally. By constructing observability within the system and analyzing the job efficiency, you may get insights into ENI consumption tendencies after which configure the suitable capability on the job for the suitable dimension. For extra particulars, check with Monitoring for DPU capability planning. The Spark UI is a useful software to observe AWS Glue jobs’ employees utilization. For extra particulars, check with Monitoring jobs utilizing the Apache Spark internet UI.
AWS Glue Auto Scaling – It’s typically tough to foretell a job’s capability necessities upfront. Enabling the Auto Scaling function of AWS Glue will offload a few of this accountability to AWS. At runtime primarily based on the workload necessities, the job robotically scales employee nodes upto the outlined most configuration. If there is no such thing as a extra want, AWS Glue won’t overprovision employees, thereby saving assets and decreasing value. The Auto Scaling function is offered in AWS Glue 3.0 and later. For extra data, check with Introducing AWS Glue Auto Scaling: Mechanically resize serverless computing assets for decrease value with optimized Apache Spark.
Job-level optimization – Determine job-level optimizations by utilizing AWS Glue job metrics , and apply greatest practices from Finest practices for efficiency tuning AWS Glue for Apache Spark jobs.

Subsequent allow us to have a look at the second answer that elaborates community capability enlargement.

Options for community dimension (IP handle) enlargement

On this part, we’ll talk about two potential options to increase community dimension in additional element.

Develop VPC CIDR ranges with routable addresses

One answer is so as to add extra non-public IPv4 CIDR ranges from RFC 1918 to your VPC. Theoretically, every AWS account will be assigned to some or all these IP handle CIDRs. Your IP Handle Administration (IPAM) group typically manages the allocation of IP addresses that every enterprise unit can use from RFC1918 to keep away from overlapping IP addresses throughout a number of AWS accounts or enterprise items. In case your present routable IP handle quota allotted by the IPAM group isn’t enough, then you’ll be able to request for extra.

In case your IPAM group points you an extra non-overlapping CIDR vary, then you’ll be able to both add it as a secondary CIDR to your current VPC or create a brand new VPC with it. In case you are planning to create a brand new VPC, then you’ll be able to inter-connect the VPCs through VPC peering or AWS Transit Gateway.

If this extra capability is enough to run all of your jobs inside outlined the timeframe, then it’s a easy and cost-effective answer. In any other case, you’ll be able to contemplate adopting overlapping IP addresses with a non-public NAT gateway, as described within the following part. With the next answer you have to use Transit Gateway to attach VPCs as VPC peering isn’t potential when there are overlapping CIDR ranges in these two VPCs.

Configure non-routable CIDR with a non-public NAT gateway

As described within the AWS whitepaper Constructing a Scalable and Safe Multi-VPC AWS Community Infrastructure, you’ll be able to increase your community capability by making a non-routable IP handle subnet and utilizing a non-public NAT gateway that’s situated in a routable IP handle house (non-overlapping) to route site visitors. A non-public NAT gateway interprets and routes site visitors between non-routable IP addresses and routable IP addresses. The next diagram demonstrates the answer on the subject of AWS Glue.

High level architecture

As you’ll be able to see within the above diagram, VPC A (ETL) has two CIDR ranges hooked up. The smaller CIDR vary 172.33.0.0/24 is routable as a result of it not reused wherever, whereas the bigger CIDR vary 100.64.0.0/16 is non-routable as a result of it’s reused within the database VPC.

In VPC B (Database), we’ve got hosted two databases in routable subnets 172.30.0.0/26 and 172.30.0.64/26. These two subnets are in two separate Availability Zones for top availability. We even have two extra unused subnet 100.64.0.0/24 and 100.64.1.0/24 to simulate a non-routable setup.

You possibly can select the scale of the non-routable CIDR vary primarily based in your capability necessities. Since you’ll be able to reuse IP addresses, you’ll be able to create a really giant subnet as wanted. For instance, a CIDR masks of /16 would offer you roughly 65,000 IPv4 addresses. You possibly can work along with your community engineering group and dimension the subnets.

In brief, you’ll be able to configure AWS Glue jobs to make use of each routable and non-routable subnets in your VPC to maximise the accessible IP handle pool.

Now allow us to perceive how Glue ENIs which are in a non-routable subnet talk with knowledge sources in one other VPC.

Call flow

The information circulation for the use case demonstrated right here is as follows (referring to the numbered steps in determine above):

When an AWS Glue job must entry a knowledge supply, it first makes use of the AWS Glue connection on the job and creates the ENIs within the non-routable subnet 100.64.0.0/24 in VPC A. Later AWS Glue makes use of the database connection configuration and makes an attempt to hook up with the database in VPC B 172.30.0.0/24.
As per the route desk VPCA-Non-Routable-RouteTable the vacation spot 172.30.0.0/24 is configured for a non-public NAT gateway. The request is shipped to the NAT gateway, which then interprets the supply IP handle from a non-routable IP handle to a routable IP handle. Visitors is then despatched to the transit gateway attachment in VPC A as a result of it’s related to the VPCA-Routable-RouteTable route desk in VPC A.
Transit Gateway makes use of the 172.30.0.0/24 route and sends the site visitors to the VPC B transit gateway attachment.
The transit gateway ENI in VPC B makes use of VPC B’s native route to hook up with the database endpoint and question the information.
When the question is full, the response is shipped again to VPC A. The response site visitors is routed to the transit gateway attachment in VPC B, then Transit Gateway makes use of the 172.33.0.0/24 route and sends site visitors to the VPC A transit gateway attachment.
The transit gateway ENI in VPC A makes use of the native path to ahead the site visitors to the non-public NAT gateway, which interprets the vacation spot IP handle to that of ENIs in non-routable subnet.
Lastly, the AWS Glue job receives the information and continues processing.

The non-public NAT gateway answer is an choice if you happen to want further IP addresses when you’ll be able to’t receive them from a routable community in your group. Generally with every extra service there may be an extra value incurred, and this trade-off is important to fulfill your targets. Consult with the NAT Gateway pricing part on the Amazon VPC pricing web page for extra data.

Conditions

To finish the walk-through of the non-public NAT gateway answer, you want the next:

Deploy the answer

To implement the answer, full the next steps:

Check in to your AWS administration console.
Deploy the answer by clicking . This stack defaults to us-east-1, you’ll be able to choose your required Area.
Click on subsequent after which specify the stack particulars. You possibly can retain the enter parameters to the prepopulated default values or change them as wanted.
For DatabaseUserPassword, enter an alphanumeric password of your alternative and guarantee to notice it down for additional use.
For S3BucketName, enter a singular Amazon Easy Storage Service (Amazon S3) bucket identify. This bucket shops the AWS Glue job script that will probably be copied from an AWS public code repository.
Click on subsequent.
Go away the default values and click on subsequent once more.
Overview the small print, acknowledge the creation of IAM assets, and click on submit to start out the deployment.

You possibly can monitor the occasions to see assets being created on the AWS CloudFormation console. It could take round 20 minutes for the stack assets to be created.

After the stack creation is full, go to the Outputs tab on the AWS CloudFormation console and notice the next values for later use:

DBSource
DBTarget
SourceCrawler
TargetCrawler

Connect with an AWS Cloud9 occasion

Subsequent, we have to put together the supply and goal Amazon RDS for MySQL tables utilizing an AWS Cloud9 occasion. Full the next steps:

On the AWS Cloud9 console web page, find the aws-glue-cloud9 setting.
Within the Cloud9 IDE column, click on on Open to launch your AWS Cloud9 occasion in a brand new internet browser.

Put together the supply MySQL desk

Full the next steps to organize your supply desk:

From the AWS Cloud9 terminal, set up the MySQL shopper utilizing the next command: sudo yum replace -y && sudo yum set up -y mysql
Connect with the supply database utilizing the next command. Substitute the supply hostname with the DBSource worth you captured earlier. When prompted, enter the database password that you just specified in the course of the stack creation. mysql -h <Supply Hostname> -P 3306 -u admin -p

Run the next scripts to create the supply emp desk, and cargo the take a look at knowledge:

-- connect with supply database
USE srcdb;
-- Drop emp desk if it exists
DROP TABLE IF EXISTS emp;
-- Create the emp desk
CREATE TABLE emp (empid INT AUTO_INCREMENT,
                  ename VARCHAR(100) NOT NULL,
                  edept VARCHAR(100) NOT NULL,
                  PRIMARY KEY (empid));
-- Create a saved process to load pattern data into emp desk
DELIMITER $$
CREATE PROCEDURE sp_load_emp_source_data()
BEGIN
DECLARE empid INT;
DECLARE ename VARCHAR(100);
DECLARE edept VARCHAR(50);
DECLARE cnt INT DEFAULT 1; -- Initialize counter to 1 to auto-increment the PK
DECLARE rec_count INT DEFAULT 1000; -- Initialize pattern data counter
TRUNCATE TABLE emp; -- Truncate the emp desk
WHILE cnt <= rec_count DO -- Loop and cargo the required variety of pattern data
SET ename = CONCAT('Employee_', FLOOR(RAND() * 100) + 1); -- Generate random worker identify
SET edept = CONCAT('Dept_', FLOOR(RAND() * 100) + 1); -- Generate random worker division
-- Insert report with auto-incrementing empid
INSERT INTO emp (ename, edept) VALUES (ename, edept);
-- Increment counter for subsequent report
SET cnt = cnt + 1;
END WHILE;
COMMIT;
END$$
DELIMITER ;
-- Name the above saved process to load pattern data into emp desk
CALL sp_load_emp_source_data();

Test the supply emp desk’s depend utilizing the under SQL question (you want this at later step for verification). choose depend(*) from emp;
Run the next command to exit from the MySQL shopper utility and return to the AWS Cloud9 occasion’s terminal: give up;

Put together the goal MySQL desk

Full the next steps to organize the goal desk:

Connect with the goal database utilizing the next command. Substitute the goal hostname with the DBTarget worth you captured earlier. When prompted enter the database password that you just specified in the course of the stack creation. mysql -h <Goal Hostname> -P 3306 -u admin -p

Run the next scripts to create the goal emp desk. This desk will probably be loaded by the AWS Glue job within the subsequent step.

-- connect with the goal database
USE targetdb;
-- Drop emp desk if it exists 
DROP TABLE IF EXISTS emp;
-- Create the emp desk
CREATE TABLE emp (empid INT AUTO_INCREMENT,
                  ename VARCHAR(100) NOT NULL,
                  edept VARCHAR(100) NOT NULL,
                  PRIMARY KEY (empid)
);

Confirm the networking setup (Non-compulsory)

The next steps are helpful to grasp NAT gateway, route tables, and the transit gateway configurations of personal NAT gateway answer. These parts have been created in the course of the CloudFormation stack creation.

On the Amazon VPC console web page, navigate to Digital non-public cloud part and find NAT gateways.
Seek for NAT Gateway with identify Glue-OverlappingCIDR-NATGW and discover it additional. As you’ll be able to see within the following screenshot, the NAT gateway was created in VPC A (ETL) on the routable subnet.
Within the left facet navigation pane, navigate to Route tables beneath digital non-public cloud part.
Seek for VPCA-Non-Routable-RouteTable and discover it additional. You possibly can see that the route desk is configured to translate site visitors from overlapping CIDR utilizing the NAT gateway.
Within the left facet navigation pane, navigate to Transit gateways part and click on on Transit gateway attachments. Enter VPC- within the search field and find the 2 newly created transit gateway attachments.
You possibly can discover these attachments additional to be taught their configurations.

Run the AWS Glue crawlers

Full the next steps to run the AWS Glue crawlers which are required to catalog the supply and goal emp tables. This can be a prerequisite step for operating the AWS Glue job.

On the AWS Glue Console web page, beneath Knowledge Catalog part within the navigation pane, click on on Crawlers.
Find the supply and goal crawlers that you just famous earlier.
Choose these crawlers and click on Run to create the respective AWS Glue Knowledge Catalog tables.
You possibly can monitor the AWS Glue crawlers for the profitable completion. It could take round 3–4 minutes for each crawlers to finish. Once they’re achieved, the final run standing of the job adjustments to Succeeded, and you can even see there are two AWS Glue catalog tables created from this run.

Run the AWS Glue ETL job

After you arrange the tables and full the prerequisite steps, you are actually able to run the AWS Glue job that you just created utilizing the CloudFormation template. This job connects to the supply RDS for MySQL database, extracts the information, and hundreds the information into the goal RDS for MySQL database. This job reads knowledge from a supply MySQL desk and hundreds it to the goal MySQL desk utilizing non-public NAT gateway answer. To run the AWS Glue job, full the next steps:

On the AWS Glue console, click on on ETL jobs within the navigation pane.
Click on on the job glue-private-nat-job.
Click on Run to start out it.

The next is the PySpark script for this ETL job:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node AWS Glue Knowledge Catalog
AWSGlueDataCatalog_node = glueContext.create_dynamic_frame.from_catalog(
    database="glue_cat_db_source",
    table_name="srcdb_emp",
    transformation_ctx="AWSGlueDataCatalog_node",
)

# Script generated for node Change Schema
ChangeSchema_node = ApplyMapping.apply(
    body=AWSGlueDataCatalog_node,
    mappings=[
        ("empid", "int", "empid", "int"),
        ("ename", "string", "ename", "string"),
        ("edept", "string", "edept", "string"),
    ],
    transformation_ctx="ChangeSchema_node",
)

# Script generated for node AWS Glue Knowledge Catalog
AWSGlueDataCatalog_node = glueContext.write_dynamic_frame.from_catalog(
    body=ChangeSchema_node,
    database="glue_cat_db_target",
    table_name="targetdb_emp",
    transformation_ctx="AWSGlueDataCatalog_node",
)

job.commit()

Based mostly on the job’s DPU configuration, AWS Glue creates a set of ENIs within the non-routable subnet that’s configured on the AWS Glue connection. You possibly can monitor these ENIs on the Community Interfaces web page of the Amazon Elastic Compute Cloud (Amazon EC2) console.

The under screenshot reveals the ten ENIs that have been created for the job run to match the requested variety of employees configured on the job parameters. As anticipated, the ENIs have been created within the non-routable subnet of VPC A, enabling scalability of IP addresses. After the job is full, these ENIs will probably be robotically launched by AWS Glue. Execution ENIs

When the AWS Glue job is operating, you’ll be able to monitor its standing. Upon profitable completion, the job’s standing adjustments to Succeeded. Job successful completition

Confirm the outcomes

After the AWS Glue job is full, connect with the goal MySQL database. Confirm if the goal report depend matches to the supply. You need to use the under SQL question in AWS Cloud9 terminal.

USE targetdb;
SELECT depend(*) from emp;

Lastly, exit from the MySQL shopper utility utilizing the next command and return to the AWS Cloud9 terminal: give up;

Now you can verify that AWS Glue has efficiently accomplished a job to load knowledge to a goal database utilizing the IP addresses from a non-routable subnet. This concludes finish to finish testing of the non-public NAT gateway answer.

Clear up

To keep away from incurring future expenses, delete the useful resource created through CloudFormation stack by finishing the next steps:

On the AWS CloudFormation console, click on Stacks within the navigation pane.
Choose the stack AWSGluePrivateNATStack.
Click on on Delete to delete the stack. When prompted verify the stack deletion.

Conclusion

On this put up, we demonstrated how one can scale AWS Glue jobs by optimizing IP addresses consumption and increasing your community capability by utilizing a non-public NAT gateway answer. This two-fold method lets you get unblocked in an setting that has IP handle capability constraints. The choices mentioned within the AWS Glue IP handle optimization part are complimentary to the IP handle enlargement options, and you’ll iteratively construct to mature your knowledge platform.

Study extra about AWS Glue job optimization strategies from Monitor and optimize value on AWS Glue for Apache Spark and Finest practices to scale Apache Spark jobs and partition knowledge with AWS Glue.

Concerning the authors

Sushanth Kothapally is a Options Architect at Amazon Net Providers supporting Automotive and Manufacturing clients. He’s keen about designing expertise options to fulfill enterprise targets and has eager curiosity in serverless and event-driven architectures.

Author2 Senthil Kamala Rathinam is a Options Architect at Amazon Net Providers specializing in Knowledge and Analytics. He’s keen about serving to clients to design and construct trendy knowledge platforms. In his free time, Senthil likes to spend time together with his household and play badminton.