Final week, we introduced the basic availability of the mixing between Amazon DataZone and AWS Lake Formation hybrid entry mode. On this publish, we share how this new characteristic helps you simplify the way in which you employ Amazon DataZone to allow safe and ruled sharing of your knowledge within the AWS Glue Knowledge Catalog. We additionally delve into how knowledge producers can share their AWS Glue tables via Amazon DataZone with no need to register them in Lake Formation first.
Overview of the Amazon DataZone integration with Lake Formation hybrid entry mode
Amazon DataZone is a completely managed knowledge administration service to catalog, uncover, analyze, share, and govern knowledge between knowledge producers and shoppers in your group. With Amazon DataZone, knowledge producers populate the enterprise knowledge catalog with knowledge property from knowledge sources such because the AWS Glue Knowledge Catalog and Amazon Redshift. In addition they enrich their property with enterprise context to make it easy for knowledge shoppers to know. After the information is obtainable within the catalog, knowledge shoppers comparable to analysts and knowledge scientists can search and entry this knowledge by requesting subscriptions. When the request is authorized, Amazon DataZone can robotically provision entry to the information by managing permissions in Lake Formation or Amazon Redshift in order that the information client can begin querying the information utilizing instruments comparable to Amazon Athena or Amazon Redshift.
To handle the entry to knowledge within the AWS Glue Knowledge Catalog, Amazon DataZone makes use of Lake Formation. Beforehand, for those who needed to make use of Amazon DataZone for managing entry to your knowledge within the AWS Glue Knowledge Catalog, you needed to onboard your knowledge to Lake Formation first. Now, the mixing of Amazon DataZone and Lake Formation hybrid entry mode simplifies how one can get began along with your Amazon DataZone journey by eradicating the necessity to onboard your knowledge to Lake Formation first.
Lake Formation hybrid entry mode permits you to begin managing permissions in your AWS Glue databases and tables via Lake Formation, whereas persevering with to take care of any present AWS Id and Entry Administration (IAM) permissions on these tables and databases. Lake Formation hybrid entry mode helps two permission pathways to the identical Knowledge Catalog databases and tables:
- Within the first pathway, Lake Formation permits you to choose particular principals (opt-in principals) and grant them Lake Formation permissions to entry databases and tables by opting in
- The second pathway permits all different principals (that aren’t added as opt-in principals) to entry these sources via the IAM principal insurance policies for Amazon Easy Storage Service (Amazon S3) and AWS Glue actions
With the mixing between Amazon DataZone and Lake Formation hybrid entry mode, when you have tables within the AWS Glue Knowledge Catalog which are managed via IAM-based insurance policies, you may publish these tables on to Amazon DataZone, with out registering them in Lake Formation. Amazon DataZone registers the situation of those tables in Lake Formation utilizing hybrid entry mode, which permits managing permissions on AWS Glue tables via Lake Formation, whereas persevering with to take care of any present IAM permissions.
Amazon DataZone allows you to publish any kind of asset within the enterprise knowledge catalog. For a few of these property, Amazon DataZone can robotically handle entry grants. These property are referred to as managed property, and embrace Lake Formation-managed Knowledge Catalog tables and Amazon Redshift tables and views. Previous to this integration, you needed to full the next steps earlier than Amazon DataZone may deal with the printed Knowledge Catalog desk as a managed asset:
- Id the Amazon S3 location related to Knowledge Catalog desk.
- Register the Amazon S3 location with Lake Formation in hybrid entry mode utilizing a position with applicable permissions.
- Publish the desk metadata to the Amazon DataZone enterprise knowledge catalog.
The next diagram illustrates this workflow.
With the Amazon DataZone’s integration with Lake Formation hybrid entry mode, you may merely publish your AWS Glue tables to Amazon DataZone with out having to fret about registering the Amazon S3 location or including an opt-in principal in Lake Formation by delegating these steps to Amazon DataZone. The administrator of an AWS account can allow the information location registration setting below the DefaultDataLake
blueprint on the Amazon DataZone console. Now, a knowledge proprietor or writer can publish their AWS Glue desk (managed via IAM permissions) to Amazon DataZone with out the additional setup steps. When a knowledge client subscribes to this desk, Amazon DataZone registers the Amazon S3 places of the desk in hybrid entry mode, provides the information client’s IAM position as an opt-in principal, and grants entry to the identical IAM position by managing permissions on the desk via Lake Formation. This makes positive that IAM permissions on the desk can coexist with newly granted Lake Formation permissions, with out disrupting any present workflows. The next diagram illustrates this workflow.
Resolution overview
To show this new functionality, we use a pattern buyer situation the place the finance crew needs to entry knowledge owned by the gross sales crew for monetary evaluation and reporting. The gross sales crew has a pipeline that creates a dataset containing beneficial details about ticket gross sales, fashionable occasions, venues, and seasons. We name it the tickit dataset. The gross sales crew shops this dataset in Amazon S3 and registers it in a database within the Knowledge Catalog. The entry to this desk is at the moment managed via IAM-based permissions. Nevertheless, the gross sales crew needs to publish this desk to Amazon DataZone to facilitate safe and ruled knowledge sharing with the finance crew.
The steps to configure this resolution are as follows:
- The Amazon DataZone administrator permits the information lake location registration setting in Amazon DataZone to robotically register the Amazon S3 location of the AWS Glue tables in Lake Formation hybrid entry mode.
- After the hybrid entry mode integration is enabled in Amazon DataZone, the finance crew requests a subscription to the gross sales knowledge asset. The asset exhibits up as a managed asset, which implies Amazon DataZone can handle entry to this asset even when the Amazon S3 location of this asset isn’t registered in Lake Formation.
- The gross sales crew is notified of a subscription request raised by the finance crew. They assessment and approve the entry request. After the request is authorized, Amazon DataZone fulfills the subscription request by managing permissions within the Lake Formation. It registers the Amazon S3 location of the subscribed desk in Lake Formation hybrid mode.
- The finance crew good points entry to the gross sales dataset required for his or her monetary reviews. They’ll go to their DataZone atmosphere and begin working queries utilizing Athena in opposition to their subscribed dataset.
Conditions
To comply with the steps on this publish, you want an AWS account. In case you don’t have an account, you may create one. As well as, you will need to have the next sources configured in your account:
- An S3 bucket
- An AWS Glue database and crawler
- IAM roles for various personas and providers
- An Amazon DataZone area and venture
- An Amazon DataZone atmosphere profile and atmosphere
- An Amazon DataZone knowledge supply
In case you don’t have these sources already configured, you may create them by deploying the next AWS CloudFormation stack:
- Select Launch Stack to deploy a CloudFormation template.
- Full the steps to deploy the template and depart all settings as default.
- Choose I acknowledge that AWS CloudFormation may create IAM sources, then select Submit.
After the CloudFormation deployment is full, you may log in to the Amazon DataZone portal and manually set off a knowledge supply run. This pulls any new or modified metadata from the supply and updates the related property within the stock. This knowledge supply has been configured to robotically publish the information property to the catalog.
- On the Amazon DataZone console, select View domains.
You have to be logged in utilizing the identical position that’s used to deploy CloudFormation and confirm that you’re in the identical AWS Area.
- Discover the area
blog_dz_domain
, then select Open knowledge portal. - Select Browse all initiatives and select Gross sales producer venture.
- On the Knowledge tab, select Knowledge sources within the navigation pane.
- Find and select the information supply that you just wish to run.
This opens the information supply particulars web page.
- Select the choices menu (three vertical dots) subsequent to
tickit_datasource
and select Run.
The info supply standing adjustments to Working as Amazon DataZone updates the asset metadata.
Allow hybrid mode integration in Amazon DataZone
On this step, the Amazon DataZone administrator goes via the method of enabling the Amazon DataZone integration with Lake Formation hybrid entry mode. Full the next steps:
- On a separate browser tab, open the Amazon DataZone console.
Confirm that you’re in the identical Area the place you deployed the CloudFormation template.
- Select View domains.
- Select the area created by AWS CloudFormation,
blog_dz_domain
. - Scroll down on the area particulars web page and select the Blueprints tab.
A blueprint defines what AWS instruments and providers can be utilized with the information property printed in Amazon DataZone. The DefaultDataLake
blueprint is enabled as a part of the CloudFormation stack deployment. This blueprint allows you to create and question AWS Glue tables utilizing Athena. For the steps to allow this in your individual deployments, seek advice from Allow built-in blueprints within the AWS account that owns the Amazon DataZone area.
- Select the
DefaultDataLake
blueprint. - On the Provisioning tab, select Edit.
- Choose Allow Amazon DataZone to register S3 places utilizing AWS Lake Formation hybrid entry mode.
You have got the choice of excluding particular Amazon S3 places for those who don’t need Amazon DataZone to robotically register them to Lake Formation hybrid entry mode.
- Select Save adjustments.
Request entry
On this step, you log in to Amazon DataZone because the finance crew, seek for the gross sales knowledge asset, and subscribe to it. Full the next steps:
- Return to your Amazon DataZone knowledge portal browser tab.
- Change to the finance client venture by selecting the dropdown menu subsequent to the venture identify and selecting Finance client venture.
From this step onwards, you tackle the persona of a finance consumer seeking to subscribe to an information asset printed within the earlier step.
- Within the search bar, seek for and select the
gross sales
knowledge asset. - Select Subscribe.
The asset exhibits up as managed asset. Which means that Amazon DataZone can grant entry to this knowledge asset to the finance crew’s venture by managing the permissions in Lake Formation.
- Enter a motive for the entry request and select Subscribe.
Approve entry request
The gross sales crew will get a notification that an entry request from the finance crew is submitted. To approve the request, full the next steps:
- Select the dropdown menu subsequent to the venture identify and select Gross sales producer venture.
You now assume the persona of the gross sales crew, who’re the homeowners and stewards of the gross sales knowledge property.
- Select the notification icon on the top-right nook of the DataZone portal.
- Select the Subscription Request Created job.
- Grant entry to the gross sales knowledge asset to the finance crew and select Approve.
Analyze the information
The finance crew has now been granted entry to the gross sales knowledge, and this dataset has been to their Amazon DataZone atmosphere. They’ll entry the atmosphere and question the gross sales dataset with Athena, together with every other datasets they at the moment personal. Full the next steps:
- On the dropdown menu, select Finance client venture.
On the fitting pane of the venture overview display screen, you will discover an inventory of energetic environments obtainable to be used.
- Select the Amazon DataZone atmosphere
finance_dz_environment
. - Within the navigation pane, below Knowledge property, select Subscribed.
- Confirm that your atmosphere now has entry to the gross sales knowledge.
It might take a couple of minutes for the information asset to be robotically added to your atmosphere.
- Select the brand new tab icon for Question knowledge.
A brand new tab opens with the Athena question editor.
- For Database, select
finance_consumer_db_tickitdb-<suffix>
.
This database will include your subscribed knowledge property.
- Generate a preview of the gross sales desk by selecting the choices menu (three vertical dots) and selecting Preview desk.
Clear up
To scrub up your sources, full the next steps:
- Change again to the administrator position you used to deploy the CloudFormation stack.
- On the Amazon DataZone console, delete the initiatives used on this publish. This can delete most project-related objects like knowledge property and environments.
- On the AWS CloudFormation console, delete the stack you deployed to start with of this publish.
- On the Amazon S3 console, delete the S3 buckets containing the tickit dataset.
- On the Lake Formation console, delete the Lake Formation admins registered by Amazon DataZone.
- On the Lake Formation console, delete tables and databases created by Amazon DataZone.
Conclusion
On this publish, we mentioned how the mixing between Amazon DataZone and Lake Formation hybrid entry mode simplifies the method to begin utilizing Amazon DataZone for end-to-end governance of your knowledge within the AWS Glue Knowledge Catalog. This integration helps you bypass the handbook steps of onboarding to Lake Formation earlier than you can begin utilizing Amazon DataZone.
For extra info on the right way to get began with Amazon DataZone, seek advice from the Getting began information. Try the YouTube playlist for a number of the newest demos of Amazon DataZone and quick descriptions of the capabilities obtainable. For extra details about Amazon DataZone, see How Amazon DataZone helps clients discover worth in oceans of information.
Concerning the Authors
Utkarsh Mittal is a Senior Technical Product Supervisor for Amazon DataZone at AWS. He’s obsessed with constructing modern merchandise that simplify clients’ end-to-end analytics journeys. Exterior of the tech world, Utkarsh likes to play music, with drums being his newest endeavor.
Praveen Kumar is a Principal Analytics Resolution Architect at AWS with experience in designing, constructing, and implementing fashionable knowledge and analytics platforms utilizing cloud-centered providers. His areas of pursuits are serverless know-how, fashionable cloud knowledge warehouses, streaming, and generative AI functions.
Paul Villena is a Senior Analytics Options Architect in AWS with experience in constructing fashionable knowledge and analytics options to drive enterprise worth. He works with clients to assist them harness the facility of the cloud. His areas of pursuits are infrastructure as code, serverless applied sciences, and coding in Python