Evan Floden
Evan FlodenJun 16, 2019

Cloud Native Data Pipelines

Cloud Native Data Pipelines

Several weeks ago, I had the opportunity to attend KubeCon + CloudNativeCon Europe 2019 in Barcelona. The first thing that became apparent is the sheer size of the event, almost to the extent that it was overwhelming! The number of start-ups involved and buy-in from incumbents all strongly suggest that the Cloud Native ecosystem is here to stay. As Kubernetes enters its 5th year, we wanted to understand what exactly is cloud native? And how will it change the landscape of data pipelines for scientists and engineers?

What is Cloud Native?

A recent podcast with Joe Beda from Heptio provides a fantastic definition where he defines cloud as essentially running API-driven, self-service, elastic systems that are managed by somebody else. Cloud native is simply the tools and techniques that are optimised to take advantage of cloud. In being cloud native, organisations can move their software, processes and mind-set to the cloud without even having to be in the public cloud itself. To date, Kubernetes has been the central platform in the cloud native movement. It provides abstractions, via an API, that allow for resources to be deployed on any underlying infrastructure. This is a vision we share as we continue to drive for scalable and portable data pipelines with Nextflow. At its heart, the executor abstraction provided by Nextflow allows users to switch between local, cluster or cloud deployment with a single command line flag in a somewhat cloud native way.

How does it apply to data pipelines?

We think that this is a precipice moment for HPC in the cloud. A willingness from users is being met with the realised promise from public cloud providers. Each of us now has access to the ideal definition of a super computer: an unlimited number of latest-gen cores, a multitude of storage types and ultra-fast networking. We now have access to bespoke accelerators such as the FPGAs used in Illumina’s DRAGEN pipelines which were previously out-of-reach for the vast majority of scientists. But we still want to access these resources in ways that reminds us of how we are used to working.

It’s all matter of abstraction

If the definition of cloud is using systems managed by somebody else, then the level of the service lies along a spectrum of abstractions. From interacting with bare metal to submitting a job through Galaxy’s GUI, we all have a different idea of fun on a Saturday night. Running a Nextflow pipeline is a high-level abstraction with regards to data workflows where the staging of data between environments, orchestration of tasks and the deployment across infrastructure is transparent and doesn’t need to be provisioned by the user. The earliest versions Nextflow deployments in the cloud used Apache Ignite to grow and shrink EC2 instances on demand. The development of AWS Batch and the Google Pipelines API removed the need to provision individual instances. Tasks can be submitted through the job APIs and resources subsequently grow and shrink elastically with the workload. This approach is cloud native in principle and has become by far the most common deployment method by our customers. Yet other users wish to have full control and the underlying service, deploying their own Kubernetes cluster and using the Nextflow K8s executor to launch task as pods. No matter which approach, one thing we continue to see is the demand for flexibility in deploying data pipelines in the cloud.

Our take on the future

We don’t see this rhetoric changing any time soon. As the life-sciences face up to the challenges of exploding genomic and health records, data are being siloed in what feels like increasingly esoteric ways. Coupled with the tough regulatory environment, we see a demand for highly heterogeneous deployment solutions. On the surface, these demands appear to align with many of the answers provided by the cloud native computing community. However, we remain platform agnostic. Sector-specific open standards such as those being developed within GA4GH including the task execution service TES and workflow execution service WES may help to solve some of these challenges. Critically though, it is the decoupling of workflow logic from execution environment that is key. Overall, we are highly optimistic that solutions to the problems of portable computation are emerging and that Nextflow is proving to be a strong contender for revealing a way forward.

The rundown

There is a clear need for portable workflows and for hybrid computing that enables transparent multi-cloud and platform deployment. For the majority of our customers, the more managed and optimised the service, the better. But by leveraging the powerful execution engines provided by Nextflow, we are now building software for the complete spectrum of services and abstraction levels. We recently released a product into a private beta that allows for the monitoring and optimization of data pipelines across infrastructure. It is built from day one to espouse the cloud native ideals, from laptop execution to serverless. As we continue to refine it for public release over the next months, we hope to offer Nextflow users with the full spectrum of deployment options and help them to join the cloud native tribe.

This week myself (@evanfloden) and Paolo Di Tommaso (@paoloditommaso) will be in Frankfurt for ISC ‘19. Drop by the AWS booth to see how Seqera is helping to bring genomics to the cloud or alternatively reach out to us on Twitter.