Thursday, November 21, 2024

Nexthink scales to trillions of occasions per day with Amazon MSK

Actual-time information streaming and occasion processing current scalability and administration challenges. AWS gives a broad number of managed real-time information streaming providers to effortlessly run these workloads at any scale.

On this submit, Nexthink shares how Amazon Managed Streaming for Apache Kafka (Amazon MSK) empowered them to realize huge scale in occasion processing. Experiencing enterprise hyper-growth, Nexthink migrated to AWS to beat the scaling limitations of on-premises options. With Amazon MSK, Nexthink now seamlessly processes trillions of occasions per day, reaching over 5 GB per second of aggregated throughput.

Within the following sections, Nexthink introduces their product and the necessity for scalability. They then spotlight the challenges of their legacy on-premises software and current their transition to a cloud-centered software program as a service (SaaS) structure powered by Amazon MSK. Lastly, Nexthink particulars the advantages achieved by adopting Amazon MSK.

Nexthink’s have to scale

Nexthink is the chief in digital worker expertise (DeX). The corporate is shaping the way forward for work by offering IT leaders and C-levels with insights into workers’ day by day know-how experiences on the machine and software stage. This enables IT to evolve from reactive problem-solving to proactive optimization.

The Nexthink Infinity platform combines analytics, monitoring, automation, and extra to handle the worker digital expertise. By amassing machine and software occasions, processing them in actual time, and storing them, our platform analyzes information to resolve issues and increase experiences for over 15 million workers throughout 5 continents.

In simply 3 years, Nexthink’s enterprise grew tenfold, and with the introduction of extra real-time information our software needed to scale from processing 200 MB per second to five GB per second and trillions of occasions day by day. To allow this progress, we modernized our software from an on-premises single-tenant monolith to a cloud-based scalable SaaS resolution powered by Amazon MSK.

The following sections element our modernization journey, together with the challenges we confronted and the advantages we realized with our new cloud-centered, AWS-based structure.

The on-premises resolution and its challenges

Let’s first discover our earlier on-premises resolution, Nexthink V6, earlier than inspecting how Amazon MSK addressed its challenges. The next diagram illustrates its structure.

Nexthink v6

V6 was made up of two monolithic, single-tenant Java and C++ functions that have been tightly coupled. The portal was a backend-for-frontend Java software, and the core engine was an in-house C++ in-memory database software that was additionally dealing with machine connections, information ingestion, aggregation, and querying. By bundling all these features collectively, the engine grew to become tough to handle and enhance.

V6 additionally lacked scalability. Initially supporting 10,000 gadgets, some new tenants had over 300,000 gadgets. We reacted by deploying a number of V6 engines per tenant, growing complexity and price, hampering person expertise, and delaying time to market. This additionally led to longer proof of idea and onboarding cycles, which harm the enterprise.

Moreover, the absence of a streaming platform like Kafka created dependencies between groups via tight HTTP/gRPC coupling. Moreover, groups couldn’t entry real-time occasions earlier than ingestion into the database, limiting characteristic improvement. We additionally lacked a knowledge buffer, risking potential information loss throughout outages. Such constraints impeded innovation and elevated dangers.

In abstract, though the V6 system served its preliminary goal, reinventing it with cloud-centered applied sciences grew to become crucial to boost scalability, reliability, and foster innovation by our engineering and product groups.

Transitioning to a cloud-centered structure with Amazon MSK

To realize our modernization objectives, after thorough analysis and iterations, we applied an event-driven microservices design on Amazon Elastic Kubernetes Service (Amazon EKS), utilizing Kafka on Amazon MSK for distributed occasion storage and streaming.

Our transition from the v6 on-prem resolution to the cloud-centered platform was phased over 4 iterations:

  • Section 1 – We lifted and shifted from on premises to digital machines within the cloud, decreasing operational complexities and accelerating proof of idea cycles whereas transparently migrating prospects.
  • Section 2 – We prolonged the cloud structure by implementing new product options with microservices and self-managed Kafka on Kubernetes. Nevertheless, working Kafka clusters ourselves proved overly tough, main us to Section 3.
  • Section 3 – We switched from self-managed Kafka to Amazon MSK, bettering stability and decreasing operational prices. We realized that managing Kafka wasn’t our core competency or differentiator, and the overhead was excessive. Amazon MSK enabled us to deal with our core software, liberating us from the burden of undifferentiated Kafka administration.
  • Section 4 – Lastly, we eradicated all legacy elements, finishing the transition to a totally cloud-centered SaaS platform. This multi-year journey of studying and transformation took 3 years.

In the present day, after our profitable transition, we use Amazon MSK for 2 key features:

  • Actual-time information ingestion and processing of trillions of day by day occasions from over 15 million gadgets worldwide, as illustrated within the following determine.

Nexthink Architecture Ingestion

  • Enabling an event-driven system that decouples information producers and shoppers, as depicted within the following determine.

Nexthink Architecture Event Driven

To additional improve our scalability and resilience, we adopted a cell-based structure utilizing the huge availability of Amazon MSK throughout AWS Areas. We at the moment function over 10 cells, every representing an impartial regional deployment of our SaaS resolution. This cell-based method minimizes the world of influence in case of points, addresses information residency necessities, and allows horizontal scaling throughout AWS Areas, as illustrated within the following determine.

Nexthink Architecture Cells

Advantages of Amazon MSK

Amazon MSK has been essential in enabling our event-driven design. On this part, we define the principle advantages we gained from its adoption.

Improved information resilience

In our new structure, information from gadgets is pushed on to Kafka matters in Amazon MSK, which supplies excessive availability and resilience. This makes positive that occasions could be safely acquired and saved at any time. Our providers consuming this information inherit the identical resilience from Amazon MSK. If our backend ingestion providers face disruptions, no occasion is misplaced, as a result of Kafka retains all printed messages. When our providers resume, they seamlessly proceed processing from the place they left off, because of Kafka’s producer semantics, which permit processing messages exactly-once, at-least-once, or at-most-once primarily based on software wants.

Amazon MSK allows us to tailor the info retention length to our particular necessities, starting from seconds to limitless length. This flexibility grants uninterrupted information availability to our software, which wasn’t potential with our earlier structure. Moreover, to safeguard information integrity within the occasion of processing errors or corruption, Kafka enabled us to implement a knowledge replay mechanism, guaranteeing information consistency and reliability.

Organizational scaling

By adopting an event-driven structure with Amazon MSK, we decomposed our monolithic software into loosely coupled, stateless microservices speaking asynchronously through Kafka matters. This method enabled our engineering group to scale quickly from simply 4–5 groups in 2019 to over 40 groups and roughly 350 engineers at present.

The unfastened coupling between occasion publishers and subscribers empowered groups to deal with distinct domains, comparable to information ingestion, identification providers, and information lakes. Groups might develop options independently inside their domains, speaking via Kafka matters with out tight coupling. This structure accelerated characteristic improvement by minimizing the danger of latest options impacting present ones. Groups might effectively devour occasions printed by others, providing new capabilities extra quickly whereas decreasing cross-team dependencies.

The next determine illustrates the seamless workflow of including new domains to our system.

Adding domains

Moreover, the event-driven design allowed groups to construct stateless providers that might seamlessly auto scale primarily based on MSK metrics like messages per second. This event-driven scalability eradicated the necessity for intensive capability planning and guide scaling efforts, liberating up improvement time.

Through the use of an event-driven microservices structure on Amazon MSK, we achieved organizational agility, enhanced scalability, and accelerated innovation whereas minimizing operational overhead.

Seamless infrastructure scaling

Nexthink’s enterprise grew tenfold in 3 years, and plenty of new capabilities have been added to the product, resulting in a considerable enhance in site visitors from 200 MB per second to five GB per second. This exponential information progress was enabled by the sturdy scalability of Amazon MSK. Reaching such scale with an on-premises resolution would have been difficult and costly, if not infeasible.

Making an attempt to self-manage Kafka imposed pointless operational overhead with out offering enterprise worth. Working it with simply 5% of at present’s site visitors was already complicated and required two engineers. At at present’s volumes, we estimated needing 6–10 devoted employees, growing prices and diverting sources away from core priorities.

Actual-time capabilities

By channeling all our information via Amazon MSK, we enabled real-time processing of occasions. This unlocked capabilities like real-time alerts, event-driven triggers, and webhooks that have been beforehand unattainable. As such, Amazon MSK was instrumental in facilitating our event-driven structure and powering impactful improvements.

Safe information entry

Transitioning to our new structure, we met our safety and information integrity objectives. With Kafka ACLs, we enforced strict entry controls, permitting shoppers and producers to solely work together with licensed matters. We primarily based these granular information entry controls on standards like information kind, area, and staff.

To securely scale decentralized administration of matters, we launched proprietary Kubernetes Customized Useful resource Definitions (CRDs). These CRDs enabled groups to independently handle their very own matters, settings, and ACLs with out compromising safety.

Amazon MSK encryption made positive that the info remained encrypted at relaxation and in transit. We additionally launched a Convey Your Personal Key (BYOK) possibility, permitting application-level encryption with buyer keys for all single-tenant and multi-tenant matters.

Enhanced observability

Amazon MSK gave us nice visibility into our information flows. The out-of-the-box Amazon CloudWatch metrics allow us to see the quantity and kinds of information flowing via every subject and cluster. This helped us quantify the utilization of our product options by monitoring information volumes on the subject stage. The Amazon MSK operational metrics enabled easy monitoring and right-sizing of clusters and brokers. General, the wealthy observability of Amazon MSK facilitated data-driven selections about structure and product options.

Conclusion

Nexthink’s journey from an on-premises monolith to a cloud SaaS was streamlined by utilizing Amazon MSK, a totally managed Kafka service. Amazon MSK allowed us to scale seamlessly whereas benefiting from enterprise-grade reliability and safety. By offloading Kafka administration to AWS, we might keep targeted on our core enterprise and innovate quicker.

Going ahead, we plan to additional enhance efficiency, prices, and scalability by adopting Amazon MSK capabilities comparable to tiered storage and AWS Graviton-based EC2 occasion varieties.

We’re additionally working carefully with the Amazon MSK staff to arrange for upcoming service options. Quickly adopting new capabilities will assist us stay on the forefront of innovation whereas persevering with to develop our enterprise.

To study extra about how Nexthink makes use of AWS to serve its international buyer base, discover the Nexthink on AWS case examine. Moreover, uncover different buyer success tales with Amazon MSK by visiting the Amazon MSK weblog class.


Concerning the Authors

Moe HaidarMoe Haidar is a principal engineer and particular initiatives lead @ CTO workplace of Nexthink. He has been concerned with AWS since 2018 and is a key contributor to the cloud transformation of the Nexthink platform to AWS. His focus is on product and know-how incubation and structure, however he additionally loves doing hands-on actions to maintain his information of applied sciences sharp and updated. He nonetheless contributes closely to the code base and likes to sort out complicated issues.
Simone PomataSimone Pomata is Senior Options Architect at AWS. He has labored enthusiastically within the tech business for greater than 10 years. At AWS, he helps prospects achieve constructing new applied sciences every single day.
Magdalena GargasMagdalena Gargas is a Options Architect enthusiastic about know-how and fixing buyer challenges. At AWS, she works principally with software program corporations, serving to them innovate within the cloud. She participates in business occasions, sharing insights and contributing to the development of the containerization discipline.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles