Apache Airflow is likely one of the world’s hottest open supply instruments for constructing and managing information pipelines, with round 16 million downloads per thirty days. These customers will see a number of compelling new options that assist them transfer information shortly and precisely with model 2.8, which was Monday by the Apache Software program Basis.
Apache Airflow was initially created by Airbnb in 2014 to be a workflow administration platform for information engineering. Since turning into a top-level challenge on the Apache Software program Basis in 2019, it has emerged as a core a part of a stack of open supply information instruments, together with tasks like Apache Spark, Ray, dbt, and Apache Kafka.
The challenge’s strongest asset is its flexibility, because it permits Python builders to create information pipelines as directed acyclic graphs (DAGs) that accomplish a variety of duties throughout 1,500 information sources and sinks. Nevertheless, all that flexibility in Airflow typically comes at the price of elevated complexity. Configuring new information pipelines beforehand required builders to have a stage of familiarity with the product, and to know, for instance, precisely which operators to make use of to perform a particular process.
With model 2.8, information pipeline connections to object shops grow to be a lot easier to construct because of the brand new Airflow ObjectStore, which implements an abstraction layer atop the DAGs. Julian LaNeve, CTO of Astronomer, the business entity behind the open supply challenge, explains:
“Earlier than 2.8, for those who needed to jot down a file to your S3 versus Azure BLOB storage versus in your native file disk, you had been utilizing totally different suppliers in Airflow, particular integrations, and that meant that the code appears to be like totally different,” LaNeve says. “That wasn’t the correct stage of abstraction. This ObjectStore is beginning to change that.
“As an alternative of writing customized code to go work together with AWS S3 or GCS or Microsoft Azure BLOB Storage, the code appears to be like the identical,” he continues. “You import this ObjectStorage module that’s given to you by Airflow, and you may deal with it like a standard file. So you possibly can copy it locations, you possibly can listing recordsdata and directories, you possibly can write to it, and you may learn from it.”
Airflow has by no means been tremendous opinionated about how builders should construct their information pipelines, which is a product of its historic flexibility, LaNeve says. With the ObjectStore in 2.8, the product is beginning to provide a neater path to construct information pipelines, however with out the added complexity.
“It additionally fixes this paradigm in Airflow that we name switch operators,” LeNeve says. “So there’s an operator, or pre constructed process, to take information from S3 to Snowflake. There’s a separate one to take information from S3 to Redshift. There’s a separate one to take information from GCS to Redshift. So that you type of have to know the place Airflow does and the place Airflow doesn’t assist these issues, and you find yourself with this many-to-many sample, the place the variety of switch operators, or prebuilt duties in Airflow, turns into very massive as a result of there’s no abstraction to this.”
With the ObjectStore, you don’t should know the title of the precise operator you wish to use or configure it. You simply inform Airflow that you just wish to transfer information from level A to level B, and the product will work out how you can do it. “It simply makes that course of a lot simpler,” LeNeve says. “Including this abstraction we expect will assist fairly a bit.”
Airflow 2.8 can be bringing new options that can heighten information consciousness. Particularly, a brand new listener hook in Airflow permits customers to get alerts or run customized code each time a sure dataset is up to date or modified.
“For instance, if an administrator needs to be alerted or notified each time your information units are altering or the dependencies on them are altering, now you can set that up,” LaNeve tells Datanami. “You write one piece of customized code to ship that alert to you, the way you’d prefer it to, and Airflow can now run that code principally each time these information units change.”
The dependencies in information pipelines can get fairly complicated, and directors can simply get overwhelmed by attempting to manually observe them. With the automated alerts generated by the brand new listener hook in Airflow 2.8, admins can begin to push again on the complexity by constructing information consciousness into the product itself.
“One use case for instance that we expect will get plenty of use is, anytime a knowledge set has modified, ship me a Slack message., That means, you construct up a feed of who’s modifying information units and what do these adjustments wanting like,” LaNeve says. “A few of our prospects will run lots of of deployments, tens of hundreds of pipelines, so to know all of these dependencies and just remember to are conscious of adjustments to these dependencies that you just care about, it may be fairly complicated. This makes it rather a lot simpler to do.”
The final of the massive three new options in Airflow 2.8 is an enhancement to how the product generates and shops logs used for debugging issues within the information pipelines.
Airflow is itself an advanced little bit of software program that depends on a group of six or seven underlying elements, together with a database, a scheduler, employee nodes, and extra. That’s one of many causes that uptake of Astronomer’s hosted SaaS model of Airflow, referred to as Astro, has elevated by 200% since over the previous 12 months (though it nonetheless sells enterprise software program that prospects can insatll and run on-prem).
“Beforehand, every of these six or seven elements would write logs to totally different places,” LaNeve explains. “That implies that, for those who’re working a process, you’ll see these process logs which might be particular to the employee, however typically that process will fail for causes exterior of that employee. Possibly one thing occurred within the scheduler or the database.
“And so we’ve added the flexibility to ahead the log from these different elements to your process,” he continues, “in order that in case your process fails, once you’re debugging it, as an alternative of six or seven several types of logs…now you can simply go to at least one place and see all the things that may very well be related.”
These three options, and extra, are usually accessible now in Airflow model 2.8. They’re additionally accessible in Astro and the enterprise model of Airflow bought by Astronomer. For extra info, take a look at this weblog on Airflow 2.8 by Kenten Danas, Astronomer’s supervisor of developer relations.
Associated Objects:
Airflow Accessible as a New Managed Service Known as Astro
Apache Airflow to Energy Google’s New Workflow Service
8 New Large Information Tasks To Watch