Zingg is thrilled to announce the kickoff of our Community Events, starting September 26! This exciting series of events is designed to bring our rapidly expanding community together—now 650 members strong—to discuss the latest breakthroughs in data science, data, and open-source technologies.
We’re launching with a bang by hosting Sem Sinchenko—a Zingg Contributor, OSS aficionado, and PPMC at Apache GraphAr. Sem will take us on a deep dive into one of the most revolutionary changes in distributed computing: Spark-Connect.
Whether you’re a seasoned data engineer or just dipping your toes into the world of big data, this session is bound to 'spark' fresh ideas!
📅 Date: September 26, 2024
🕒 Time: 4:00 PM UTC
📍 Virtual Event: Reserve your spot
Sem will explore the Spark architecture, providing insights into query execution and diving into the Spark-Connect. He’ll explain how Spark-Connect untangles the complexities of the old architecture, opening new doors for PySpark users, and demonstrate how to create a Spark-Connect plugin for JVM libraries using low-level Spark APIs.
If you’ve worked with Apache Spark, you already know how powerful it is when handling large-scale data processing. However, even with all its power, Spark’s tightly coupled architecture presents challenges, especially for flexible, language-agnostic solutions. Spark Connect aims to address the following issues in Apache Spark.
The traditional Spark architecture relies on a monolithic driver that runs client applications directly on top of the scheduler, optimizer, and analyzer. This design complicates the execution of applications, especially in interactive environments like notebooks and IDEs, where a long-running Spark session is required. If the client machine running the driver fails or is stopped, it results in the abrupt termination of all associated Spark jobs, leading to significant disruptions.
Since all applications run on the same driver, any application that consumes excessive memory can lead to critical exceptions, such as out-of-memory errors, which can crash the entire Spark cluster. This lack of isolation means that one poorly performing application can impact all other users sharing the same cluster.
The tight coupling between client applications and the Spark driver complicates upgrades. Users must ensure compatibility between their applications and the version of Spark being used, which can hinder the adoption of new features and security improvements.
Managing resources effectively within a Spark cluster is challenging due to its distributed nature. Users often face difficulties in configuring memory settings for drivers and executors, which can lead to performance bottlenecks or excessive resource consumption. The traditional architecture does not provide adequate tools for monitoring and managing these resources effectively across multiple applications running simultaneously.
Debugging Spark applications can be cumbersome because users often lack direct access to the driver process for troubleshooting. This limitation makes it difficult to identify and resolve issues when they arise, particularly in complex distributed environments where errors may occur across different nodes.
The requirement for the Spark driver to have full library dependencies creates potential conflicts with user-added libraries. If an application uses a library version that differs from what Spark requires, it can lead to compatibility issues that complicate development and deployment.
Introduced in Apache Spark 3.4, Spark Connect enables remote connectivity to Spark clusters through a decoupled client-server architecture, allowing applications to interact with Spark from various environments - such as IDEs, notebooks, and different programming languages, using the DataFrame API.
The Spark Connect client library simplifies Spark application development by providing a thin, language-agnostic API that can be integrated into various platforms. It builds on Spark's DataFrame API, using unresolved logical plans as a protocol between the client and Spark driver.
The client library translates DataFrame operations into unresolved logical query plans encoded with protocol buffers and sends them to the server via the gRPC framework. On the Spark Server, these plans are converted into Spark’s logical plan operators, initiating the standard Spark execution process. The results are then streamed back to the client as Apache Arrow-encoded row batches.
The new architecture of Spark Connect addresses several multi-tenant operational challenges:
1. Stability: Applications with high memory usage now affect only their own environment, running in separate processes. Users can manage their dependencies on the client side without worrying about conflicts with the Spark driver.
2. Upgradability: The Spark driver can be updated independently of applications, allowing for performance and security enhancements. Applications remain forward-compatible as long as server-side RPC definitions are backward-compatible.
3. Debuggability and Observability: Spark Connect supports interactive debugging from your preferred IDE during development. Applications can also be monitored using their framework’s native metrics and logging libraries.
At Zingg, we harness Spark for distributing the entity resolution workflows by learning blocking rules for matching and preventing a cartesian join on record attributes. Zingg's Python API leverages the earlier Py4J based design. We are now moving to Spark Connect, and it will be a great time for our users and data folks to learn about Spark Connect in this session.
We hope to see you all!