Over the past decade, developers, researchers, and the community have successfully built tens of thousands of data applications using Spark. Since then, use cases and requirements of data applications have evolved: Today, every application, from web services that run in application servers, interactive environments such as notebooks and IDEs, to phones and edge devices such as smart home devices, want to leverage the power of data.

However, Spark's driver architecture is monolithic, running client applications on top of a scheduler, optimizer and analyzer. This architecture makes it hard to address these new requirements: there is no built-in capability to remotely connect to a Spark cluster from languages other than SQL.

Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages.

This talk highlights how simple it is to connect to Spark using Spark Connect from any data applications or IDEs. We will do a deep dive into the architecture of Spark Connect and give an outlook of how the community can participate in the extension of Spark Connect for new programming languages and frameworks - to bring the power of Spark everywhere.

Martin Grund

Affiliation: Databricks

Martin Grund is a Senior Staff Software Engineer and Tech Lead at Databricks working at the intersection between query processing, data governance and security. He's currently leading the design and engineering efforts for Spark Connect as part of Apache Spark. Martin has previously led the engineering for Amazon Redshift Spectrum and worked on Cloudera Impala. He holds a PhD in computer science from the Hasso-Plattner-Institute in Germany.