Zingg with Spark Connect - Sem Sinchenko, Data Engineer, Raiffeisen Bank Russia

September 26, 2024

0:00
Sonal:
Everyone, I think some of us we've actually been in touch and uh we are trying to get regular with the Zingg events, so, um, uh, really thrilled that you people have, you know, joined in today.
I would love to get some bit of an introduction, and then I'll, um, you know, just pass over to Sem who would be conducting the session today.
Okay, Jason, would you like to introduce yourself? I'm sorry, my power went off, so I'm not visible right now.

1:00
Jason:
Oh yeah, sure, sorry. Um, my name is Jason. I'm a software engineer at the ACLU, and, um, yeah, we looked into Zingg for entity resolution, um, last year, and um, I've been looking into Spark Connect, so I've been interested in that. So, I wanted to get some more info on that.

1:25
Sonal:
Yeah, so welcome, Jason!
Uh, Pavle, how about you? Did I get your name correct?
‍
Pavle:
Yeah, yeah, it's correct. Uh, so yeah, I work; uh, I'm from Belgrade. I work as an ML Ops on Databricks, so mostly about like the DevOps part. So, I'm, yeah, interested in Spark Connect.

Sonal:
Okay, welcome! I hope this is useful to you. Uh, how about you, Tejas?
Tejas, can you hear us?

Tejas:
Yes, can you hear me?
So, I basically work in a similar domain as entity resolution.
So curious to know, like, how this field works. That's why I joined.
‍

2:30
Sonal:
Okay, so today's session, um, just for the record, is not that much about Zingg but more about Spark Connect and what we are doing with it, and we'll cover later the next session, uh, what we plan to do with Spark Connect and Zingg.
But I hope you'll still find this useful, and anytime you want to talk entity resolution, just let us know. We have a Slack full of people, uh, 650-plus, uh, so do join in our community if you're interested.
Yeah, how about you, Jeevraj?

3:03
Jeevraj:
Hi, I'm a full-stack developer. Yeah, basically JavaScript. I code in JavaScript, and, uh, I came across entity resolution when I came across the term called fuzzy search, when I was in my last company, where we faced a few problems with respect to duplication of the records basically which we were not able to filter out. So, uh, let's see.
I'm more interested in learning more about it. I've not yet tried completely. I want to learn more about like, uh, how I would have made an API and integrate, like, automating things. I have seen the documentation, like, I need to run Docker and all, but, uh, I was not that comfortable, uh, because as a developer, like, full-stack developer, I'm more comfortable in endpoints, like what payload I should give, like, I have that thinking or I have that mindset. So, I was thinking if I can make something, uh, like that, like sending some payload and getting some response, uh, like that thing. So, that was my interest, basically.

4:21
Sonal:
Yeah, and how about you, Abishek?

Abhishek:
Hello, hello, are you able to hear me? Okay, hi Sonal, how are you?

Sonal:
Yeah, good, nice to meet you!

Abhishek:
Yeah, nice to meet you too! Yeah, so I just joined in a little bit late. I just got 10 minutes late, rushing to the office, so, uh, yeah, even I have tried, uh, the Zingg version actually. Like, I just followed the documentation and was able to run it.
But I ran into some issues. I think the documentation on the website is for the 0.3 version, and I think the recent Zingg that I downloaded is for 0.4, right? And also, like, I want to know, uh, there are certain, like, uh, we are just in the evaluation phase right now, and we are just evaluating different things which we can use for better entity resolution for different entities. So, one thing we want to know is, uh, do we have, like, suppose we just want to identify the duplicates? Like, I think we get the unique cluster ID for all the duplicate records in the dataset, right?
So, my thing is, like, is there a way, like, how can we confirm that there is no data going outside from us? What are the different things which are getting tracked? So, those are the things I'm interested in right now.

5:56
Sonal:
So, uh, I think I'll defer the Zingg discussion to post the session. I can stay behind, you know, answer some of the questions for the audience, which are in specific. Today, specifically at least for this current session. Uh, we have a guest with us who's Sem, and we are looking for a Spark Connect session today.
But I can stay back and we can talk about, you know, your specific issues. Yeah, so just let's just wait for the session to be over for that. Uh, Sania, Nitish, you want to go next? Sania and Nitish are part of the Zingg team, and, um, let's have a quick round of introduction from both of them.

‍
‍6:34
Nitish:
Yeah, hello guys! yeah, so, I am working as a senior software engineer at Zingg. I've been working for the last three months, so, it's, it's kind of pretty cool algorithmic work and all those things we are doing here.
Like, uh, so, maybe, as soon as said, maybe some other time, we will arrange some other sort of, uh, session to explain how it's working, how end-to-end things are working, but today's session will be more focused toward Spark Connect with Sem.
So, yeah, thank you.

Sania:
Uh, yeah, so I'm Sania and I've joined as a software developer, and, uh, yeah, it's been three months for me as well, and it's been awesome working with Zingg, and, uh, yeah, excited to hear about Spark Connect and its functionality.

7:26
Sonal:
Okay, so, thanks everyone, and welcome to this close group.
I'm very excited to welcome Sem. Uh, Sem and I have been working and chatting for quite a while now. I think almost five to six months now Sem. It's like, you know, just one developer to another, uh, going over open-source, uh, things, and the thing that has bonded for us very closely is the Spark Connect thing that Spark, um, that Sem has actually been working on.
But just besides, uh, the work that Sem has been doing for Zingg, he's also, um, a major open-source enthusiast. He's also starting up an organization completely devoted to open-source, and, uh, he has been releasing, um, you know, a lot of small plugins as well as very deep work with, uh, Graph Arc, right? Graph Arc and DQ library for, um, data quality. So, I'd leave to Sem, to, you know, discuss some of the work that he's been doing and then talk about, uh, Spark Connect. But very thrilled to have him here and, uh, really looking for this informative session today. So, thank you, Sir. Over to you.

8:45
Sem:
Thank you for the introduction. Uh, let me share my screen. Uh, do you see my screen?
Okay, so, uh, my name is Sem, uh, I am a software engineer, a data engineer in my job, but here I'm not as a representative person of my company, but as a, uh, just an open-source enthusiast. I like open-source, I like to contribute to open-source, and yeah, I'm working here on Zingg open-source part, uh, especially on a new Python API, and today I would like to share with you, uh, what I know about the new Spark Connect, a new decoupled architecture, and the most important for us, for Zingg and for most of the open-source projects in the Spark ecosystem.

9:46
Uh, I would like to tell a little about the Spark Connect plugin system, uh, but before we go, uh, we need to make a small step back and remember how Spark works under the hood. "Of course, it's a little simplification, but it should be enough for this presentation. So, what happens when you write your SQL query or data frame is actually that there is no data flow, there are no computations happening, because Spark follows a lazy execution model. It is an opposite model to the so-called eager model in Pandas. For example, if you work with Pandas data frame and you apply, for example, a filter operation or select operation or group by operation, you will get results immediately because Pandas will do immediate computations and that's it. Spark, however, follows a different way, as it follows a lazy execution model. So, when you write your SQL query or write your data frame operation, there is no immediate computation.

‍10:48
Okay, so when you write SQL queries or create data frame operations, Spark creates a so-called logical plan. The logical plan is simply a set of operations, an ordered set of operations or graph of operations, and there is no computation at all. Spark performs analysis, optimization on the logical plan, and physical planning. The actual computation and data processing only happen at the final step when you have a selected physical plan, the most optimal that Spark was able to find. The code generation and physical execution happen here at this last step, but this happens only if you call it. If you call so-called actions because, in Spark, we have actions and transformations. Any transformations, any call to transformations just creates one more note in the logical plan, so there is no computation at this point. During the analysis, Spark resolves column names and table names to actual data files, checks datatypes, applies optimizations, and applies rules, generates a number of physical plans, and ultimately chooses the most optimal one.

12:07
For example, if we create a simple PySpark application that reads a Parquet file, renames one column, groups by another column, and computes two aggregates, an average and a max, and then we call explain, we can see both the past logical plan (or initial logical plan) created directly from our data frame program and the optimized final physical plan. We can already see some benefits of the lazy execution model because, for example, this table contains one, two, three, four, five columns. However, from the code and the plan, Spark realizes that actually only two columns are needed, and it applies this optimization to select only those two columns. This can happen, and it happens implicitly without our direct input, which is very advantageous because Spark can apply numerous optimizations. This is the main benefit of the lazy execution model.

13:08
Now, let's discuss Spark 'classic' or the current way Spark operates. When we run a PySpark Python application, we have an application layer, where our code runs, and we have a driver, which is already running in the Java Virtual Machine. The interaction between PySpark and the driver happens via a so-called Java bridge; it’s a method to call Java applications, classes, and methods from Python. The driver handles parsing, optimization, scheduling, and communicates with workers, which typically reside in a Kubernetes cluster, a YARN cluster, or a similar environment. The application sends the driver an SQL query or plan and receives the results.

14:07
Now, with Spark Connect, we see a significant change. Spark Connect creates a separate layer between the driver and application, which allows the driver to operate remotely on another server. This Java driver and the Spark Connect API can be hosted on a different server, on a Spark server, for example. This is useful because, for instance, if I want to create a web application that my manager can use, they would be able to fill in some forms, click one button, and an actual Spark application using these parameters could be run without needing to directly access Spark 'classic.'
In Spark Classic, I need to have all the Spark JARs, all the Java things, and dependencies on the same server where my backend is running, and it's not an easy thing because, uh, PySpark, for example, contains about 600 megabytes of JARs, and Spark contains thousands of dependencies. Very big and complex project, and yeah, for example, if you want to work with Spark from Python, you have a pain that you're working not with Spark, but with Py4J, you get cryptic errors like "Class not found exception," or if you're using the wrong version of Java or the wrong Java home, and you're not familiar with the Java world, it may be very hard to debug.

15:30
If you are using Spark from Java, using the Spark Java API, you may easily face dependency hell because Spark has thousands of dependencies with specific versions, and putting into your application other dependencies might easily lead to unresolved cases where you just cannot build your project because of dependencies, and it is very painful. And yes, one more thing: as you remember, there is no actual, uh, operations on data until that moment. All that we have before is the logical plan or, uh, this set of operations, and this is exactly the thing that is sent from our application to the Spark Connect API and back. And in the case when we request the data, yes, uh, it is sent as an Apache Arrow stream.
Apache Arrow is a way of intercommunication between programming languages. It's a programming-language-agnostic format for the data, and yeah, it's very cool because now you can. Yes, at the moment we have PySpark, we have Spark R, we have Spark Scala, and Spark Java, but with Spark Connect, you can have Spark Rust, Spark C, and who knows what else, because all that you need to do is implement a quite simple logic of, uh, creating gRPC requests, and Spark itself already defines all the protocol messages that are needed to implement it.

16:59
So, what are the benefits of Spark Connect? No Java home, no class-not-found exception, no dependency hell, no huge Python libraries, 600megabytes, no need to have all the JARs, all the configurations on your local server. You can submit your task from anywhere. Another strong benefit is that if you want to have multiple people working on the same driver, it is almost impossible to achieve total isolation of, uh, user contexts, uh, because that's how Java works and that's because of how Spark classic was designed. But with Spark Connect, you can achieve shared isolation, and that’s the thing used in databases.
Ah, I forgot to mention that today you already have some, uh, vendor-provided Spark builds. Uh, the first one is Databricks Connect, that is the Same Spark Connect under the hood. Another one is EMR Serverless, that is again, uh, the Same Spark Connect. So yeah, we have a lot of benefits of Spark Connect.

18:03
But there are also some concerns, and yeah, the first thing is that you cannot call low-level APIs like RDD because RDD, from now on, is a physical execution layer of Spark, and it is a part of so-called internal API, so but it's not so big of a problem for me because, in my understanding, end users of Apache Spark shouldn't call RDD APIs or something like this, because, yeah, all that the user should want to do may be achieved by the DataFrame API. No need to go to the level of RDDs because with RDDs, you do not have all these optimizations; you need to care about everything by yourself. It's a really hard story. But another big concern, uh, is that there's no JVM, and yeah, it may be kind of controversial because it's a benefit and a concern at the Sametime, but yeah, the problem that there is no JVM is that there is no Java bridge too, and let's talk a little about why it is important to understand why it is so important.

19:08
Let's, uh, take a small look at how PySpark itself is done. So if you, uh, go and take a look at the PySpark source code, uh, you will see that it is mostly about—it's not about logic, it's not about creating plans or optimizations or something like this. It's just about bindings or calls to the Java Bridge. So this is an example of the in it method of the Spark session in PySpark classic. And you can see that, under the hood, there is a, uh, Java Spark Connect object that is actually a reference to the Spark Connect instance in the Java Virtual Machine. There is an underscore JVM that actually references the Java bridge itself, and yeah, in classic development, developers can easily add bindings to their own Java classes because this Java has access to all the JARs or Java packages in the class path. So all you need is just to add a JAR to your class path.
And yeah, if you worked with Zingg, maybe sometimes you can face this error, like "Class not found exception." This means that you just forgot to add the Java class, and Python knows nothing about this. But yeah, again, why is this so important for the PySpark ecosystem? Because, um, the biggest part of the PySpark ecosystem, I mean third-party packages like data quality, entity resolution, machine learning can be or something like this, most of this code is done by developers who are very familiar with Spark itself or actually are contributors to Spark itself. And of course, if you are a contributor to PySpark and you’re used to using the Java bridge for everything, you will build your, uh, third-party Spark package in the Same way. And it's a big problem.

21:03
And for example, Zingg is built in exactly the Same way. If you open the current version of Zingg, uh, Python code, it's just about bindings to Java. You create an instance of, uh, zingg.spark.client.pipe.spark_pipe(), it's actually a Java class instance, and you can call a method of this Java class and, uh, the actual method in JVM will be called. Yeah, you can geta reference to Java objects, and at the moment, uh, that I show, uh, Zingg API is done. And not only Zingg, because, uh, if you take a look at the Spark ecosystem, you will find a lot of projects that are done in exactly this way. For example, uh, DQ is a very, uh, well-known library for data quality, and PyDQ is built as just bindings, Py4J bindings to, uh, Apache TQ. Uh, I already mentioned about Zingg.

22:04
Uh, Apache Graph is a project where I implemented Py4J bindings by myself. It was about 6,000 lines of Python code, uh, just to realize it. Uh, today I need to rewrite it again for Spark Connect. Synapse ML is a, uh, very famous library for machine learning applications. You may have heard about this by its old name, MML Spark. Graph Frames is a library to work with graphs from Python, and it is done in Py4J the Same way. All these libraries will, uh, not work with Spark Connect just because there is no JVM. Uh, I had a couple of talks with the Spark core devs behind the Spark Connect idea, and I asked this question: "Like, guys, why did you not provide anything like Py4J?" But actually, there are more problems with Py4J that, uh, we can think of—there are a lot of underlying problems to call there. There are problems with isolation, with security, and the answer from Spark devs was, while it is possible to reimplement Py4J, no one will do it, so we need to live with anew reality. What is this new reality? The new reality is Spark Connect’s plugin system, and actually, that's a quite cool thing. Uh, the only problem is that, uh, you need to rewrite your PySpark application in a new way.

23:27
So what can we do with Spark Connect plugins? With Spark Connect plugins, there are three at the moment, only three, but I expect there will be more interfaces that allow you to define your own plugin or thing for Spark Connect. And Spark Connect is only about protobuf messages and gRPC calls. Uh, the first plugin is like a comment plugin; itis a void or None in Python, and it's for cases where, for example, you want to run something, for example, update a configuration. There are a lot of configurations in Spark, and for example, you want to run a command and update a configuration, such as specifying high catalog name or your own configuration, like Zingg catalog name or something like this. Uh, and this is about commands: you just send a message, and you are not expecting an answer.
The second, uh, kind of plugin is the expression plugin. It's about columns, so for example, if you want to write your own logic to transform a column in Java or Scala using low-level Python API with dunder methods again, and all these things, you can do it and register your column expression as an expression plugin. But, to be honest, I donot use this thing because it’s quite a rare case, and themost important for me is the relation plugin. It's a plugin when you send a message to Spark and you get a DataFrame back, but because again, it’s Spark Connect, you're actually getting not the DataFrame but a logical plan that will be serialized. What’s really cool is that developers of this plugin don't need to worry about creating gRPC request handling, implementing all this attaching, reattaching logic, or managing Arrow streams, because Spark will handle it itself. Spark will, uh, search for all the possible plugins, and if Spark finds a plugin that can handle a specific message you’ve defined, it will use that plugin and care about all the gRPC, Arrow, and other underlying tasks by itself.

25:35
That’s really cool. Now, let’s, uh, take a look at a real-world example of plugins that work. I did this recently, uh, as I already mentioned, with the DQ library. Uh, and yeah, I really like this library; I think it’s very cool. So, what is the DQ library? The DQ library is a very cool, quite low-level Scala library that’s used to build data quality—actually, it’s a data quality engine, not a data quality tool. And there is a data quality tool built on top of DQ, called Python DQ, but Python DQ is built viaPy4J bindings, and it doesn’t work in the Spark Connect environment. So, for educational purposes, I, uh, decided to, uh, create Spark Connect plugins for DQ.
What did I do? I defined all the protobuf messages required to define all the main, uh, structures of DQ. I created a server plugin that can handle the message and pass this protobuf and this realized protobuf to actual structures of DQ, create structures of DQ from the protobuf messages, run the DQ job, collect the result, and send the result back to the client, uh, to the user as a DataFrame. I also create PySpark API on top of the generated protobuf messages Python code because, uh, yeah, if you work with protobuf, it’s quite a specific thing, and typically, it’s best practice not to force users, uh, to use directly protobuf code.

27:24
So yeah, what do I have now? I have a Python API. So via this Python API, I can create DQ structures. Spark itself will care about creating the gRPC request from my, uh, protobuf message, send it to the Spark Connect server, and handle the attach, reattach logic, pass my plugin that I just registered via simple configuration, run the job, and, uh, send the result back. So what was this about? It was about 350lines of protobuf code, which is a simple example. So yeah, uh, DQ is a data quality library, so you can run different kinds of data quality checks. For example, you can compute histograms, compute counts, distinct correlations, and all those things. For each possible DQ structure, I defined my own protobuf message. I did the Same for analyzers that I used in DQ, I did the Same for anomaly detection strategies, some helper classes, checking configurations, and the whole suite.

28:30
I created a parser in Scala about 500–600 lines of Scala code that just passes the protobuf messages to DQ structures. As you can see, this is a protobuf message, this is the code generated from the protobuf message. So it’s actually a protobuf message, and, uh, I just check each message, uh, for pattern matching. For example, if my message is for approximate count distinct case, I need to create an approximate count distinct in DQ, and this is an instance of the DQ object. I just pass inside all the parameters from the protobuf messages. So, actually, the protobuf messages again don’t contain any kind of business logic. They only contain the arguments for the DQ classes. Protobuf messages contain all the information that’s needed to create instances of DQ classes.
Yeah, it’s quite a boring pattern matching—like Case, Case, Case—but only 600lines of Scala code, and to be honest, I used LLMs to generate it because it’s a perfect case for any LLM tooling to generate some bindings from existing protobuf messages. And the final thing is the, uh, implementation of the relation plugin itself.

30:00
So, what I did here is I created a method, this method expects a message, but in the form of Any, and inside my code, I need to check that this message is of the class I expected, because if it’s not, I should return empty. In this case, Spark will try to find another plugin that can handle this kind of message. I pass the DataFrame, but it’s not a DataFrame, it’s a logical plan I passed from my client. I create the DQ suite from the message using my parser, I run the DQ, I generate the resulting DataFrame for my API, and I just send it back. But not, again, not in the form of a DataFrame but in the form of a logical plan because the DataFrame is not serializable, but the logical plan is serializable, and you can send it from one to another. And that’s it. I tested it, itworks, it works fine.
And yeah, after implementing all these things, I realized how cool this way is because, actually, you can do the Same thing for Spark classic because these protobuf, these already defined protobuf messages, you just pass here, instead of a relation logical plan, you just pass here a DataFrame viaPy4J, via the Java bridge, and you use the Same suite you created, uh, from protobuf as bytes. And that’s very easy, very cool, because, uh, it’s only about one single call to the Java bridge, instead of calling each method of the Java bridge. Because working with Py4J can be really painful, especially if you’re working with Scala. For example, if you want to have fun, try realizing how to pass an option long from Py4J from Python to Scala. In my understanding, it’s absolutely impossible. That’s funny, but it’s impossible. So that’s it.

32:07
Uh, what’s next? Finally, we’re, uh, in a point where we’re ready to discuss Zingg. Uh, I’m working on a new PySpark API for Zingg. I’ve already defined most of the protobuf messages for most of the Zingg structures. We’ve already had a discussion with the team, I created a server plugin, and it mostly works. And now we are here, testing the pipes with different kinds of messages, corner cases. And when it’s done, the last thing we need to do is just create a single simple user-friendly Python API, again, because working directly with the code generated from protobuf can be a little tricky and tedious, and you can easily create a user-friendly Python API on top with data classes, Pydantic, and all modern Python tools.
Uh, mostly, that’s all I wanted to tell today. Uh, thank you for your attention. Feel free to contact me via email if you have any questions, ideas, or want to collaborate on any of the projects. Yeah, thank you.

33:18
Sonal:
Sem, this was very informative and very deep, uh, so, so thankyou for this. And thank you for all the hard work that you’ve been doing on, uh, a lot of the open-source projects, including Zingg. I do have a few questions, but I think before that, if anybody else has some questions they would like to ask… Okay, so I think I’ll go next. I’ll go first then. So, in terms of, like, you know, this API which is Python-bound, but then there are so many Spark connectors that, you know, Zingg specifically uses, uh, for JDBC connections, for NoSQL connections to, let’s say, Cassandra, or to Elastic, or to, you know, Snowflake. The typical DataFrame APIs. Uh, in the new world of Spark connect, how does that work?

34:28
Sem:
It works in the Same way because, for example, yes, okay, you want to connect and read your DataFrame from Snowflake, no problem at all. You just add a, uh, needed JAR to your Spark server, not to the client. You send, uh, via gRPC message information needed to connect, for example, I don’t know, the URI of the connection, some token or something like this. And the Spark server will, uh, call Snowflake, read the Snowflake data source internally, and return your logical plan of it because, , Snowflake implements the data source with two, as I remember, and you can, uh, read a Snowflake table via Spark. It’s just about this creating a logical plan like a source from Snowflake, and you do your operations. So in my understanding, there’s no principal difference. The only difference is that you’re calling it, uh, not from your application but from the server. And in this case, you need to care about the connectivity of your server to Snowflake or any other data source you want to use. And also, you need to care about, uh, your class path on your server to contain the JAR that implements the data source, but it works in the same way. For example, this DQ story is about the Same. I’m adding a DQ JAR to the Spark server because DQ obviously is not part of the Spark distribution, and I’m calling these low-level RDD things from DQ in my Spark server. And my client shouldn’t care about this because the interaction is via gRPC.

36:13
Sonal:
So just to be clear, on the client side, it's just the Python installation or the Python package of, let's say, Zingg or DQ, and it's only on the server side of Spark where we have the actual DQ JARs or the JARs which we need to care about. They are part of the Spark server installation or part of the driver, like the Spark server running. So that when the relation, the plan gets translated, the relation plugin or the appropriate plugin is at least available to handle those messages. Is that my understanding correct?

36:50
Sem:
Right, yes. Yes, you need only to add your JARs to the server. Actually, at the moment, there is no Python distribution because even if you want to install PySpark Connect, you will download the whole Spark, but they are actively working on this. There was a voting in the PySpark mailing list recently, and this voting passed successfully about creating a Java-free PySpark distribution, so we should expect a CI PySpark client, not a 600 megabytes Python library with all the JARs inside. It will be just a C Python that should work without almost any dependencies, only protobuf, gRPC, and maybe some PyArrow code, and that’s it. For me, it’s very cool because nothing fundamentally changed. Now, all the things happen on the server, and for me, it is even a simplification for users.

37:44
Sonal:
Okay, and I think the second question is that some of our users have been interested because they come from a data science background, and some of them have requested me for the R bindings, not the Python API. The Zingg R API. And I assume that, you know, with all the work that you’ve done on protobuf, it’s just a matter of writing the R client because those messages are already defined.

38:17
Sem:
It’s a very complex question. The first thing I want to say is SparkR itself is deprecated. Voting passed successfully. They voted recently in the Spark mailing list about deprecating SparkR. So, new distributions of Spark won’t contain SparkR; they’ll contain only Spark Java, Spark Scala, and PySpark, not SparkR. Instead, it is strongly recommended to use the sparklyr library from the R ecosystem. I recently took a first look at how it is done under the hood, and I was surprised that sparklyr is working via R calling PySpark. So actually, it seems to me that you do not need even to generate your R code from protobuf because I’m not even sure that it is possible. There are some third-party tools, because R is not supported by the protoc compiler. You can try to find a protobuf compiler that compiles R code, but I’m not sure that such a thing exists. But you can generate Python code, Python classes, because Python is a first-class citizen in the protoc compiler, and you can add these Python libraries to this sparklyr distribution. And also, the thing you need to do is create very simple R APIs that are calling the underlying PySpark. So, in my understanding, it should be absolutely enough to create PySpark bindings and simple, very tiny R bindings to these PySpark bindings because sparklyr itself works via calling PySpark.

40:13
Sonal:
Okay, that’s fine then. Okay, so that’s a new one to me. So, I’ll probably put that as a project somewhere. We’ll probably investigate this a bit more, but this seems cool. Right? I think we should be able to leverage the work that we have then on Zingg Python and go from there.

40:32
Sem:
Also, they are actively maintaining at the moment projects about creating Spark Connect Go, and Go Lang is a first-class citizen in the protobuf compiler, so you can generate protobuf classes for Go Lang. Yeah, it’s quite raw but mostly working already. Oh, I knew people behind Spark Connect C, and also there is Spark NE Rust. So in theory, you can support all of this using only a single set of protobuf messages and a single Scala or Java plugin for the server. All that you need is to generate the corresponding messages and create maybe an API on top to simplify the life of users. About sparklyr, I will send you links to what I found with direct messages.

41:24
Nitish:
Yeah, I have one high-level question. I just want to ask: So basically, this Spark Connect is kind of a wrapper or gateway in front of this Spark cluster, right? So for an end user, they are just concerned about firing requests to this Spark Connect, right? They need not worry about underlying Zingg or Spark that is sitting behind this Spark Connect, right? So the next question is, once a request comes to this Spark Connect, Spark Connect will then interact with both the Spark cluster, Spark core JARs, and as well as the Zingg API JAR, or how is it like? Code inside the Zingg core as well, right? Means, my question was: So the end user will fire requests at Spark Connect, then Spark Connect has to interact with both the Zingg JARs and the Spark cluster JARs that are residing within the Spark cluster, right? Spark JARs, basically.

42:33
Sem:
‍Yes, yes, yes. When you send a gRPC request to the Spark Connect server and this gRPC request is a type of extension, because all these plugins should be registered as extensions, Spark will check the configuration. There is a specific configuration, like something like spark.connect.plugin.extension.relation, and you need to register your class, your class of your plugin as an extension. But it's again about one-time configuration of the server. In this case, Spark will check all the registered plugins until it finds a plugin that is able to pass the specific message you sent. If Spark won’t be able to find this plugin, Spark will send back a message like, “I’m sorry, there is no registered plugin to handle a message of kind [class name].” So if your customer forgets to register your Zingg plugin in the Spark Connect server, your customer will get an exception, like “I’m sorry, but message of kind Zingg message cannot be handled because there is no plugin.” And actually, this message can also be handled on the client side. For example, in your Python code, you can check for the return code from Spark Connect and for example in this case provide an even more explicit message, like “We are sorry, but you need to add this JAR to the Spark server and this configuration” or something like that. Actually, it works in this way.

44:15
Nitish:
So, like whenever... So one time we created a JAR and we stored that JAR in the Spark cluster. The next time when we add some feature to our product and create a new JAR out of it, then we need to change—will we then store the new JAR, for example, a new Zingg build JAR in the Spark cluster? Then do we need to make changes in the plugin as well or at the Spark Connect side as well? Like if you are adding some features or whatever, some core functionality changes and all, so that’s what I just want to know as well.

44:48
Sem:
Yes, of course. If you provide a new version of your library, you need to provide these updates and deliver the new version of the JAR to the Spark server. So, yes, the JAR with Zingg logic and also with the Zingg Connect plugin should be in the class path of the Spark Connect server, and that's it. The class you created for handling the gRPC message should be registered as a Spark extension, but the simplest part I will show you how to do it. It’s really easy; it’s just a couple of lines of configuration.

45:47
Amnay:
Yes, hello, hi. Do you hear me? Hi, thank you so much for this very interesting presentation. As a data engineer, I was just wondering if there are any benefits that Spark Connect offers in ETL workloads, especially in cloud-based platforms like Databricks or Synapse or Fabric, for example?

46:13
Sem:
Yes, I already mentioned shared isolation, and at the moment, there is something like shared isolation mode or something like this in Databricks. Yes, and it is achieved by Spark Connect. Without Spark Connect, it’s really hard to achieve total isolation, but Spark Connect is designed to provide multi-user support. So, you have a single server, and multiple customers can connect to the server, and this server works as a gateway. This thing, for example, is used under the hood of Databricks Serverless or EMR Serverless. From the developer’s point of view, for example, the startup time of a Databricks cluster, a classic Databricks cluster, is about 3 to 5 minutes or something like this because it needs to find instances in AWS or Google Cloud, run scripts, set up Spark, set up all the JARs and all the connectivity. But with the Serverless version, the up-time is about seconds, so you can get a ready-to-use endpoint, and that’s quite cool for me. Also, if you are working as a developer and you are writing in Scala, there is a very nice project at the moment about creating a Scala 3 client because if you are working in a gRPC way, you actually don’t need to care about all the Spark classes, and it is very simple to create a Scala 3 Spark version. And yeah, that’s very exciting. I’m waiting for the release of this project, but even if you are working with Scala 2 or Java, with Spark Connect, you do not need to have all these dependencies of Spark, and it’s much easier to maintain your dependencies and avoid dependency hell. Spark has thousands of dependencies, and if you try to create a simple application that has its own dependencies, it won’t be an easy task to synchronize all these things, so I see a lot of benefits, to be honest.

48:32
Sonal:
Thank you. Any other questions? Anybody? Okay, cool. So, I guess this was it. Thank you so much. I hope you enjoyed giving this talk as much as we enjoyed listening to you. It’s been a great learning experience for me personally as well as for this group today. Thank you so much for joining us.

Sem:
Thank you. Thank you. I will send my slides to you. Please share the slides in the chat.

Sonal:
The recording as well, as well as the slides. We will publish them on our YouTube channel and also on our website. Thank you once again, and I’m staying back just in case anybody has questions around Zingg. I can answer right away. I’m happy to. Let’s stop the recording as well.

‍

Zingg with Spark Connect - Sem Sinchenko, Data Engineer, Raiffeisen Bank Russia

Recent posts

Training Entity Resolution Models - Sonal Goyal

Zingg With Databricks - Lucas Bilbro, Lead Solutions Architect, Databricks

Zingg with Spark Connect - Sem Sinchenko, Data Engineer, Raiffeisen Bank Russia

Sign Up For Sonal's Newsletter And Be a Part Of Our Journey