Thursday, July 4, 2024

How VMware Tanzu CloudHealth migrated from self-managed Kafka to Amazon MSK

It is a submit co-written with Rivlin Pereira & Vaibhav Pandey from Tanzu CloudHealth (VMware by Broadcom).

VMware Tanzu CloudHealth is the cloud value administration platform of alternative for greater than 20,000 organizations worldwide, who depend on it to optimize and govern their largest and most complicated multi-cloud environments. On this submit, we talk about how the VMware Tanzu CloudHealth DevOps staff migrated their self-managed Apache Kafka workloads (working model 2.0) to Amazon Managed Streaming for Apache Kafka (Amazon MSK) working model 2.6.2. We talk about the system architectures, deployment pipelines, subject creation, observability, entry management, subject migration, and all the problems we confronted with the present infrastructure, together with how and why we migrated to the brand new Kafka setup and a few classes realized.

Kafka cluster overview

Within the fast-evolving panorama of distributed programs, VMware Tanzu CloudHealth’s next-generation microservices platform depends on Kafka as its messaging spine. For us, Kafka’s high-performance distributed log system excels in dealing with huge information streams, making it indispensable for seamless communication. Serving as a distributed log system, Kafka effectively captures and shops various logs, from HTTP server entry logs to safety occasion audit logs.

Kafka’s versatility shines in supporting key messaging patterns, treating messages as primary logs or structured key-value shops. Dynamic partitioning and constant ordering guarantee environment friendly message group. The unwavering reliability of Kafka aligns with our dedication to information integrity.

The combination of Ruby companies with Kafka is streamlined via the Karafka library, appearing as a higher-level wrapper. Our different language stack companies use comparable wrappers. Kafka’s sturdy debugging options and administrative instructions play a pivotal position in making certain clean operations and infrastructure well being.

Kafka as an architectural pillar

In VMware Tanzu CloudHealth’s next-generation microservices platform, Kafka emerges as a vital architectural pillar. Its capability to deal with excessive information charges, help various messaging patterns, and assure message supply aligns seamlessly with our operational wants. As we proceed to innovate and scale, Kafka stays a steadfast companion, enabling us to construct a resilient and environment friendly infrastructure.

Why we migrated to Amazon MSK

For us, migrating to Amazon MSK got here down to 3 key choice factors:

  • Simplified technical operations – Working Kafka on a self-managed infrastructure was an operational overhead for us. We hadn’t up to date Kafka model 2.0.0 for some time, and Kafka brokers have been taking place in manufacturing, inflicting points with subjects going offline. We additionally needed to run scripts manually for rising replication elements and rebalancing leaders, which was further handbook effort.
  • Deprecated legacy pipelines and simplified permissions – We have been seeking to transfer away from our current pipelines written in Ansible to create Kafka subjects on the cluster. We additionally had a cumbersome technique of giving staff members entry to Kafka machines in staging and manufacturing, and we wished to simplify this.
  • Value, patching, and help – As a result of Apache Zookeeper is totally managed and patched by AWS, transferring to Amazon MSK was going to avoid wasting us money and time. As well as, we found that working Amazon MSK with the identical sort of brokers on Amazon Elastic Compute Cloud (Amazon EC2) was cheaper to run on Amazon MSK. Mixed with the truth that we get safety patches utilized on brokers by AWS, migrating to Amazon MSK was a simple choice. This additionally meant that the staff was freed as much as work on different necessary issues. Lastly, getting enterprise help from AWS was additionally vital in our remaining choice to maneuver to a managed resolution.

How we migrated to Amazon MSK

With the important thing drivers recognized, we moved forward with a proposed design emigrate current self-managed Kafka to Amazon MSK. We performed the next pre-migration steps earlier than the precise implementation:

  • Evaluation:
    • Performed a meticulous evaluation of the present EC2 Kafka cluster, understanding its configurations and dependencies
    • Verified Kafka model compatibility with Amazon MSK
  • Amazon MSK setup with Terraform
  • Community configuration:
    • Ensured seamless community connectivity between the EC2 Kafka and MSK clusters, fine-tuning safety teams and firewall settings

After the pre-migration steps, we applied the next for the brand new design:

  • Automated deployment, improve, and subject creation pipelines for MSK clusters:
    • Within the new setup, we wished to have automated deployments and upgrades of the MSK clusters in a repeatable style utilizing an IaC device. Subsequently, we created customized Terraform modules for MSK cluster deployments in addition to upgrades. These modules the place referred to as from a Jenkins pipeline for automated deployments and upgrades of the MSK clusters. For Kafka subject creation, we have been utilizing an Ansible-based home-grown pipeline, which wasn’t secure and led to lots of complaints from dev groups. Because of this, we evaluated choices for deployments to Kubernetes clusters and used the Strimzi Subject Operator to create subjects on MSK clusters. Subject creation was automated utilizing Jenkins pipelines, which dev groups might self-service.
  • Higher observability for clusters:
    • The outdated Kafka clusters didn’t have good observability. We solely had alerts on Kafka dealer disk dimension. With Amazon MSK, we took benefit of open monitoring utilizing Prometheus. We stood up a standalone Prometheus server that scraped metrics from MSK clusters and despatched them to our inside observability device. Because of improved observability, we have been in a position to arrange sturdy alerting for Amazon MSK, which wasn’t potential with our outdated setup.
  • Improved COGS and higher compute infrastructure:
    • For our outdated Kafka infrastructure, we needed to pay for managing Kafka, Zookeeper cases, plus any further dealer storage prices and information switch prices. With the transfer to Amazon MSK, as a result of Zookeeper is totally managed by AWS, we solely should pay for Kafka nodes, dealer storage, and information switch prices. Because of this, in remaining Amazon MSK setup for manufacturing, we saved not solely on infrastructure prices but in addition operational prices.
  • Simplified operations and enhanced safety:
    • With the transfer to Amazon MSK, we didn’t should handle any Zookeeper cases. Dealer safety patching was additionally taken care by AWS for us.
    • Cluster upgrades turned easier with the transfer to Amazon MSK; it’s a simple course of to provoke from the Amazon MSK console.
    • With Amazon MSK, we acquired dealer automated scaling out of the field. Because of this, we didn’t have to fret about brokers working out of disk house, thereby resulting in further stability of the MSK cluster.
    • We additionally acquired further safety for the cluster as a result of Amazon MSK helps encryption at relaxation by default, and numerous choices for encryption in transit are additionally obtainable. For extra info, confer with Knowledge safety in Amazon Managed Streaming for Apache Kafka.

Throughout our pre-migration steps, we validated the setup on the staging atmosphere earlier than transferring forward with manufacturing.

Kafka subject migration technique

With the MSK cluster setup full, we carried out a knowledge migration of Kafka subjects from the outdated cluster working on Amazon EC2 to the brand new MSK cluster. To realize this, we carried out the next steps:

  • Arrange MirrorMaker with Terraform – We used Terraform to orchestrate the deployment of a MirrorMaker cluster consisting of 15 nodes. This demonstrated the scalability and suppleness by adjusting the variety of nodes based mostly on the migration’s concurrent replication wants.
  • Implement a concurrent replication technique – We applied a concurrent replication technique with 15 MirrorMaker nodes to expedite the migration course of. Our Terraform-driven method contributed to value optimization by effectively managing sources throughout the migration and ensured the reliability and consistency of the MSK and MirrorMaker clusters. It additionally showcased how the chosen setup accelerates information switch, optimizing each time and sources.
  • Migrate information – We efficiently migrated 2 TB of knowledge in a remarkably brief timeframe, minimizing downtime and showcasing the effectivity of the concurrent replication technique.
  • Arrange post-migration monitoring – We applied sturdy monitoring and alerting throughout the migration, contributing to a clean course of by figuring out and addressing points promptly.

The next diagram illustrates the structure after the subject migration was full.
Mirror-maker setup

Challenges and classes realized

Embarking on a migration journey, particularly with massive datasets, is commonly accompanied by unexpected challenges. On this part, we delve into the challenges encountered throughout the migration of subjects from EC2 Kafka to Amazon MSK utilizing MirrorMaker, and share invaluable insights and options that formed the success of our migration.

Problem 1: Offset discrepancies

One of many challenges we encountered was the mismatch in subject offsets between the supply and vacation spot clusters, even with offset synchronization enabled in MirrorMaker. The lesson realized right here was that offset values don’t essentially should be similar, so long as offset sync is enabled, which makes certain the subjects have the right place to learn the information from.

We addressed this downside through the use of a customized device to run exams on shopper teams, confirming that the translated offsets have been both smaller or caught up, indicating synchronization as per MirrorMaker.

Problem 2: Sluggish information migration

The migration course of confronted a bottleneck—information switch was slower than anticipated, particularly with a considerable 2 TB dataset. Regardless of a 20-node MirrorMaker cluster, the velocity was inadequate.

To beat this, the staff strategically grouped MirrorMaker nodes based mostly on distinctive port numbers. Clusters of 5 MirrorMaker nodes, every with a definite port, considerably boosted throughput, permitting us emigrate information inside hours as a substitute of days.

Problem 3: Lack of detailed course of documentation

Navigating the uncharted territory of migrating massive datasets utilizing MirrorMaker highlighted the absence of detailed documentation for such situations.

By trial and error, the staff crafted an IaC module utilizing Terraform. This module streamlined the complete cluster creation course of with optimized settings, enabling a seamless begin to the migration inside minutes.

Closing setup and subsequent steps

Because of the transfer to Amazon MSK, our remaining setup after subject migration regarded like the next diagram.
MSK Blog
We’re contemplating the next future enhancements:

Conclusion.

On this submit, we mentioned how VMware Tanzu CloudHealth migrated their current Amazon EC2-based Kafka infrastructure to Amazon MSK. We walked you thru the brand new structure, deployment and subject creation pipelines, enhancements to observability and entry management, subject migration challenges, and the problems we confronted with the present infrastructure, together with how and why we migrated to the brand new Amazon MSK setup. We additionally talked about all the benefits that Amazon MSK gave us, the ultimate structure we achieved with this migration, and classes realized.

For us, the interaction of offset synchronization, strategic node grouping, and IaC proved pivotal in overcoming obstacles and making certain a profitable migration from Amazon EC2 Kafka to Amazon MSK. This submit serves as a testomony to the facility of adaptability and innovation in migration challenges, providing insights for others navigating an analogous path.

For those who’re working self-managed Kafka on AWS, we encourage you to strive the managed Kafka providing, Amazon MSK.


In regards to the Authors

Rivlin Pereira is Employees DevOps Engineer at VMware Tanzu Division. He’s very enthusiastic about Kubernetes and works on CloudHealth Platform constructing and working cloud options which can be scalable, dependable and price efficient.

Vaibhav Pandey, a Employees Software program Engineer at Broadcom, is a key contributor to the event of cloud computing options. Specializing in architecting and engineering information storage layers, he’s enthusiastic about constructing and scaling SaaS purposes for optimum efficiency.

Raj Ramasubbu is a Senior Analytics Specialist Options Architect targeted on massive information and analytics and AI/ML with Amazon Net Companies. He helps clients architect and construct extremely scalable, performant, and safe cloud-based options on AWS. Raj supplied technical experience and management in constructing information engineering, massive information analytics, enterprise intelligence, and information science options for over 18 years previous to becoming a member of AWS. He helped clients in numerous business verticals like healthcare, medical units, life science, retail, asset administration, automotive insurance coverage, residential REIT, agriculture, title insurance coverage, provide chain, doc administration, and actual property.

Todd McGrath is a knowledge streaming specialist at Amazon Net Companies the place he advises clients on their streaming methods, integration, structure, and options. On the non-public facet, he enjoys watching and supporting his 3 youngsters of their most well-liked actions in addition to following his personal pursuits reminiscent of fishing, pickleball, ice hockey, and comfortable hour with family and friends on pontoon boats. Join with him on LinkedIn.

Satya Pattanaik is a Sr. Options Architect at AWS. He has been serving to ISVs construct scalable and resilient purposes on AWS Cloud. Prior becoming a member of AWS, he performed vital position in Enterprise segments with their progress and success. Exterior of labor, he spends time studying “the way to cook dinner a flavorful BBQ” and making an attempt out new recipes.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles