Zingg With Databricks - Lucas Bilbro, Lead Solutions Architect, Databricks

October 10, 2024

00:00:17
Sonal:
Okay, so, thrilled to actually meet Luke again and very pleased to invite him for this community meetup. Luke and I have been collaborating since, I think, just the very early days of Zingg open source. Luke is a lead solution architect at Databricks, where he has helped develop solution accelerators for Customer 360 and product matching. He’s given us a ton of feedback, really helped improve the product, and brings a lot of entity resolution expertise from his prior job history with Tamr. So, in all, it’s just been a pleasure working with Luke on various fronts, and I’m very excited to see what notebooks he has for us today. So, over to you, Luke, and thank you so much for coming.

Luke:
Thanks, Sonal, and thanks, everybody, for joining. I’m happy to be here and to present some stuff to everyone today. So, yeah, as Sonal l said, my name is Luke; I’m at Databricks. I’ve been here almost five years now. I’m a solutions architect, which, for those who know the other titles, is like a solution engineer, pre-sales engineer, or field engineer kind of role. I’m technically in the sales org, but I’m on the technical side of things. So, I’m the person who talks to people who are trying to figure out if Databricks is a fit for them, or maybe they’ve already used it and need some help. I love to work on a couple of problem areas—I really like performance tuning, diving deep into Delta and the different types of storage formats, and how to really optimize workloads. And then, of course, fuzzy matching and entity resolution are interesting to me as well.
Originally, I was in academia—if you go back far enough, I was a physicist fora while but ran away from academia about 12 or 15 years ago. I cut my teeth on data at Ab Initio for a couple of years. Ab Initio was an ETL company in Boston, kind of like Informatica in some ways. I was there for four or five years and learned about parallel processing and was helping customers solve interesting problems there.

00:02:45
Luke:
I remember one day a question came in from a customer at Ab Initio that involved trying to match data that didn’t really match, and I remember thinking, “Oh, that’s kind of a cool problem.” Probably like a lot of people on the line or watching this, you’ve tried your first time solving it yourself, maybe thinking you’d use blocking or something, though you probably didn’t even know it was called blocking at the time. I worked on it for a week or two with the customer, and eventually, someone at Ab Initio tapped me on the shoulder and said, “Hey, you should have come to us first,” and it turned out there was already a product called Correlate being developed there. That was my introduction to more advanced approaches to fuzzy matching. So, I did that fora couple of years, and then I liked it so much that I went to a company called Tamr, an MDM startup in the Boston area. I kind of switched to the post-sale services side, implementing projects, eventually leading a team, and started learning more about the stewardship side. So, I understood the engineering side from Ab Initio, and then at Tamr, I started learning about stewardship—like golden records, the need for cluster IDs, and to do overrides and rules-based approaches. That was great.

00:04:25
After a couple of years, I started getting the itch—I was really intrigued by Spark. Spark was something that Tamr used under the hood for its fuzzy matching engine, but I didn’t get to play with it much. I’d see stack traces and have no idea what they meant, thinking, “Why’d this fail?” I got really interested in Spark, so I decided to go to Databricks. I’ve been here now for five years, and by 2021, I started hearing about Zingg. I had tried to build my own framework at Databricks just as a way to learn Spark and do my own thing, and then I started hearing a little about Zingg. I started looking at it, testing it, and thought it was pretty cool. That brings me to my relationship with Zingg, Sonal, and the team here. So, that’s my intro, and I’d love to walk you through some stuff we’ve put together and share a little bit of the story. But Sonal, if you wanted to chime in with any thoughts, that’s how I remember it—it’s been three or four years that we’ve been working together. Does that sound right?

00:05:35
Sonal:
Well, that’s—um, I forgot to mention the Ab Initio story, but that’s an even richer entity resolution background. I think one thing that I found charming in your brief was that all the stack traces led you to Spark, which is kind of funny and interesting. ER is obviously a tough problem, and as a physicist, surely you’re drawn to some of the tougher ones. So now, over to you to walk us through—we’re just looking forward to that.

00:06:10
Luke:
Yeah, let me go right to the screen and start showing some stuff here. I'm still happy to take questions if anyone has any; feel free to interrupt me—I don’t mind. But I’ll tell you a little more history real quick. I don’t have too much more, but around November 2021, I think, is probably when Sonal and I started corresponding for the first time. I don’t even remember what we talked about back then, but I remember trying to imagine some kind of partnership, whether official or unofficial, and it didn’t take too long for us to get to something pretty cool. In August 2022, we put out the first blog and solution accelerator along with two of my colleagues, Mimi and Brian. This was our first attempt, and it got a lot of attention—it was really great. The story was pretty interesting. You can actually tell this is pre-ChatGPT because there aren’t a lot of bullets and numbering, which nowadays you see a lot in blogs, and you can usually tell someone used ChatGPT to write it.

00:07:19
But I think it really resonated and got a lot of attention. I will say, though, that the solution accelerator, which if you were to follow these links would take you to the notebooks, was really complicated in retrospect. Maybe we were trying to be a little too complete. We really tried to imagine an entire production workflow—how you would implement it, all the scaffolding and everything you’d need to build to actually have something that works. In particular, we covered how you need to prepare data, the difference between an initial workflow of deduplicating a large amount of data and then the incremental workflow of deduplicating smaller chunks of data as they arrive daily or more frequently. So, while it still has good stuff in it, I think it was maybe not accessible to everyone because of just the sheer amount of detail.

00:08:20
That was the first one, and almost exactly a year later, we released another blog, this time on products. This was right around when Zingg version 0.4 was being released, and it was being developed very aggressively by the Zingg team. The two new things about this one, which you can see in the little picture here, are, first, that it includes images. While this isn’t as interesting, it’s important to some people working on part mastering, where they have images as part of their data and want to use image similarity to help distinguish potential parts from each other. The Zingg team somehow got this feature in pretty quickly, and we were able to demonstrate a concept of how it might be used in a pipeline.

00:09:12
But the second thing to note, if you’re not seeing it yet, is the UI. There’s a box with data in it, comparing item IDs and names, with images and buttons like “Uncertain,” “Match,” and “No Match.” This was inspired by a notebook I had been developing for years, just working on it behind the scenes. Whereas the earlier blogs leaned towards completeness, I really wanted a notebook that was simple, conveyed the message clearly, and was distributable. That’s what I want to walk through for the rest of the time, or at least most of it. I now call this my “Zingg POC” notebook. It’s had several names, and I’ve shared it on the Zingg Slack channel or sent earlier iterations of it. This notebook is something I send out almost every week now to potential or existing customers at Databricks—basically anyone who wants to see it.
The goal here wasn’t to be complete but to provide something that someone could work through in a week, do a couple of iterations of matches, and feel like they’d answered some questions about how Zingg might work on their data. I’ll pause for a second in case there are any questions, but my goal today is to walk through the notebook. I’ll speak less on the Zingg specifics and more on why it’s oriented this way—the background notes and behind-the-scenes ideas. Any questions or thoughts, feel free to let me know; otherwise, I’ll keep going.

00:11:22
If you've looked at some of the notebooks on the Zingg GitHub, you’ll likely notice some overlap. Honestly, I don’t remember who created them first—I remember Sonal had a very early example of using Zingg and Databricks, which initially inspired me. I then built my own version and sent it back to her, and she maybe updated hers, and so on. At some point, they’ve overlapped quite a lot, so I’m not sure who wrote what originally, but you’ll probably see some similarities between them, along with some new stuff.
It starts by breaking down the POC into five steps. Really, only the first four are common for a POC, while step five is something someone might need in a very specific case. The five steps are installing Zingg, discussing the data you’re going to use, configuring Zingg, and running Zingg.
If I’m ever in a rush, like if I only have 10 minutes or even five to demo to someone who’s potentially interested, I almost always skip straight to step four. Step four is the interesting part—it’s the active learning loop. I often try to emphasize to my audience just how unique it is to have an open-source package that’s scalable (running on Spark), performs probabilistic matching, and has an active learning cycle built into it. As far as I know, there isn’t another package that does this. There might be others out there now, but none that I’m aware of.
I explain that there’s a three-part iteration loop: finding what I call “interesting pairs” to label (sometimes edge cases), labeling those pairs, and then saving them. Then you make a decision: do I move forward, or do I go back and repeat steps one through three? This iteration loop is the smallest, tightest loop in the process. I’ll sometimes just skip straight to that when presenting the actual notebook.

00:13:50
Now, a couple important things just to talk about, um, right hereat the top, this compatibility is something that's top of my mind, and I think Sonal’s as well. Um, it's not the easiest story, unfortunately, and it's gotten maybe a little trickier recently with some decisions made by Databricks and, um, how that has left some, some partners that, that, uh, use Databricks and Spark in the past. Uh, but hopefully there's a path forward. Um, in short, right now, if you're if you have if you're interested in using the 0.3.4 Zingg, I think your best bet is Databricks 12.2. It's a long-term supported one, that means it has three years of support. Uh, we’ll keep it supported and updated and patched.
For Zingg 0.4.0, right now the best fit is 14.2. Uh, 14.2 unfortunately is not a long-term supported version, that would be 14.3. But, uh, Databricks implemented some pretty significant changes in 14.3 for third-party Spark, uh, vendors, and so, um, so we're still trying to figure out what to do there. Thankfully, it's going to be good for at least another six months or so and, uh, hopefully there’ll be a, um, you know, something in the future that gives us a longer-term, uh, runway.
Installing Zingg libraries, I, you know, I walk them through it, try to make it as easy as possible, uh, just to, to make sure it's it's very clear. Often people that I talk to don't even know what GitHub is, or well, they know GitHub but they don't really know how to interact with it; they don't know how to download a JAR file or download a GZIP file and identify a JAR file, so try to make that as easy as possible.

00:15:25
Get to the starting data, so in this case, in my example, I use the, the North Carolina voter data from University of Leige, I believe. Uh, and I, I do mention some limitations, I try to be very very honest and open. Every time I find something new I add more to this notebook. But, uh, while I believe the, um, Enterprise version of Zingg is compatible with Databricks Unity Catalog, the open-source version of Zingg is not directly compatible with Unity, and so in my demo I still use H Metastore, which was what we had back in the day before Unity.
For those who are familiar with Databricks, this is not a sales pitch for Databricks so I'm not going to get into all that stuff. But, um, there's some, there are some compatibility stuff, and, uh, I try to make it very clear things that do and don't work. Um, so in this case I'm, I'm looking at North Carolina voters and I try to to point out like the likely transcription errors and mistakes in the data. Um, I've had some thoughts about maybe coming in and actually blanking out some of the data to simulate real data more. No, nobody has data that's 100% populated, right? There's always going to be missing data. I haven't gotten around to that yet.
Go through the basics of installing and getting things or configuring Zingg. So, you know, pip install to get the Python, uh, library in place, um, setting up your, your model directory, uh, or your Zingg base directory and then the model that you're going to work on, um, with some, some text to explain what we're doing here.

00:16:54
Defining the input design, this is usually maybe the next place that, that takes a little bit of discussion with anyone I'm presenting this to. Um, I, I have a demo in this case where I'm using a single dataset and again my, my goal here is to, to build a model and cluster all the data so I'm trying to kind of do the idea of like throw all the records into a big bag, shake it all up, and then extract all the matches from it and put them into clusters.
Um, and a lot of times I get asked, well, I have more than one source. Um, this might have been the very first thing I ever asked Sonal, actually, probably in2021, was, uh, what do I do if I have more than one source? Um, and there's really nothing stopping you. I have kind of the, the setup here if you have more than one source then you need more than one input pipe, and here's the method fordoing it. Um, but there is there is the requirement that input sources must have the same schema. So usually I'll talk to customers about unifying their schema with a little bit of upfront processing.
The terminology that I've been using recently is to imagine Zingg picking up at the silver layer of the Medallion structure, if if you are familiar with, like, the Databricks, uh, view of, um, of how data processing goes and you have, like, the bronze, silver, gold, uh, layers. Um, I don't think Zingg picks up from bronze, uh, you, you probably want to have a couple steps where you've ingested the raw data, normalized the schema, a couple placeholder steps where you could come back and clean it up a bit. Um, but at some point you've unified the schema, maybe that's your silver layer, and now you can you can point all of your sources to Zingg.

00:18:30
The next step is to define the output, so we all know, you just create another output pipe or create another pipe, this time it's an output pipe. Um, I'm using Delta as input and output, um, and, uh, we can move on to maybe some more interesting stuff.
Um, I, I do enjoy going through this step, this is usually one of the steps that gets a lot of attention. Um, I, I explain how we're going to match the data on name, suburb, and postcode. We're going to pass through the record but not use it, and if we wanted to maybe change that postcode from a fuzzy to a, um, pin code, we totally can because, as we all know, in the docs you have a nice list of, of different methods for matching, and this usually gets, you know, some internal "ooh" and "ah" that, you know, you have some, some flexibility here.
Um, I do have a note here, and maybe this is a time that I can ask Sonal if there's an update here, but I remember, um, in an engagement, finding that the order in which we specified the columns to, to Zingg—so, like, I'm going to specify them in this order—was important. And in fact, Zingg maybe looked and gave more preference to things earlier in the list than later. I'm not sure if that's still true; I might need to change it if not, but, but that was something that we had noted, and so I added it to the notebook.

00:19:54
Sonal:
Yeah, so yes, I think your observation was right. And, uh, this happens because the blocking tree is learned, you know, we kind of iterate overall the fields, and if it finds a good, um, a good block function or combination on the first, uh, few attributes, that's where it kind of stops. Because blocking by definition is supposed to be not really very exact, but just be able to split the number of comparisons in a way that the computation doesn't blow up. So, that's something that, um, I think, uh, we, we also kind of—it’s part of the code, it’s still where it is. I have thought about, you know, making some kind of a random shuffle around the fields and trying to figure that out. So that may be some of the things that, you know, we can kind of, uh, we can try.
So as, as one of the next things that I'm thinking about in the coming, uh, one of the releases for Zingg is to come out with some of these experiments and, you know, let, let people try it, uh, and then give us feedback on whether it's working better for them or worse. And I think that, um, that's something that, uh, will take community help in terms of whether they feel they're learning better in terms of, um, a new algorithm on the blocking with some minor modifications—not really a whole new algorithm but minor tweaks like the one Luke mentioned—and maybe, uh, that works better in some scenarios. But as of now, even our recommendation is that if you have stuff which is better populated and more discerning and, uh, you know, like some stuff like SSN number which are probably very clearly going to be unique or going to be matches across your records, you want to put them first.

00:21:45
Luke:
Yeah, good. Um, I love the idea of crowd sourcing the feedback on ideas, and, and I'll say that in pre-sales I, I always think of a, a good way to spin this, and so I, I tell, I tell people this is a tuning knob; it's a, it's a feature. Um, you get to, you know, you're always asked, you know, "Can you, how do I tell Zingg that one, one field is more important than another?" I'm like, "Oh, well thankfully, Zingg has a feature and you just put them in order." But, uh, but I think it's great to, um, to continue experimenting and, and crowd sourcing that.

Sonal:
I think before we move on, Mayur is asking a question: "Any limitations for number of columns provided in a list?"

Luke:
Yeah, I, I would, well, I would guess, although, so I'll correct me if I'm wrong, that in general, the more fields you give, the more compute power you'll need, and probably the more training data you'll need. So I would say, um, when customers ask me this, this I'm basing this on from prior experience at other companies as well, I, I recommend to start simple. So I, I say don't just throw, uh, you know, everything at Zingg to figure out, you know, across a hundred columns what to use. I, I say, you know, if you were going to look, you know, if a person was going to look at a couple records and decide if they're the same, what would be the most important signals to a person? The person doesn't have to tell me exactly what the similarities need to be, um, but the person would probably look at, you know, these seven or eight columns. I would say start with those.
Um, and I think the more you put in, you should anticipate the more training data you'll need in order to get better accuracy. But, but Sonal, what do you think?

00:23:22
Sonal:
Yeah, I would agree there, Luke. There is no limitation per se, Mayur, on any, uh, you know, kind of columns or size of columns. But, um, my experience looking at many datasets also kind of concurs with what Luke is saying, which is it's mostly 7 to 10 or at max 12 columns we see most people being able to figure out whether two records are matches or not. And, uh, the general rule is to look at the most valuable columns, the ones that are best populated, and, uh, ones which are not really the same. Like if, if your entire population is the state of California, you probably don't want to add in 'state' as one of the columns to match up.

Luke:
Yeah, that's, yeah, distinguishing columns. Yeah, right. Well-populated, distinguishable, yeah, columns—things that tell, that, that give good signal that two things are the same, um, rather than, or, or different.
Yeah, right. Um, okay. Well, I, I have a couple more kind of important parts to show through here. I think one is, is a new take that I have on the performance settings. So, um, in, in the Zingg docs it's always talked about 'num partitions' and 'label data sample size.' Num partitions is, is going to be used whenever there is a distributable part of, of Zingg, which is mostly going to be like the match, the training, and the matching step. Um, and, and the idea here is to set it to something a factor of 20 or 30 bigger than the number of cores you're using. It's a little complicated, um, but, but honestly it's not too bad, and if you just look at the cluster you're using, take the number of nodes, multiply that by the number of cores per node, uh, that's your number of cores, and multiply that by 20 or 30—that gives you a good number to go with. Um, that one is not as hard to communicate.

00:25:15
Luke:
This one seems to always trip people up every time I talk to them, and, and I, I've also tried to go with this, you know, setting it to a small number. But I've actually come up with a new mnemonic that—or maybe not mnemonic, but a new, uh, um, rule of thumb that I've been communicating more often, which is that, um, if you took the entire amount of data coming into Zingg and multiplied it by this number, it should equal about 100,000 records. It should equal about 100,000. Um, that's completely experimental; that's kind of what I've seen. But if you are, if you're only going on about 100,000 records and that's your total data size, you don't really need to set this less than one, in my experience. If you're doing about a million records, then maybe set it to 0.1; if you're doing 10 million records, then maybe set it to 0.01. Andin each case, what Zingg will be doing is, is multiplying, is, is taking that sample size or that sample time times the total input and generating a, a set of data to build its blocking from and um and I think about 100,000 seems to be a good number.
So that's how I communicate it now and that's been going over a lot better I'm not sure if that helps anyone in the audience today but um but that seems to be about right um and and then you know I have places to set those numbers um I have some helper functions which you're totally free to look at, including some stuff that use IPython widgets, but I usually hide the code just because it's not very pretty I'm not really a developer I just pretend to be one and uh and then we get finally into the fun stuff which is the running step um and I guess Sonal if any questions do come up please stop me again in case I'm missing something in the chat I'm not able to see that um so the again the loop has three steps first we want to find the interesting pairs to label, find, let Zingg find the edge cases and so you run fine training data and I make sure I tell customers and people using it and POCing that it's really important for this Loop to be fast you want to be able to get through this entire Loop part one, part two, part three, and then back to part one you want to be able to do that in like less than 10 minutes um because you're going to do it a bunch of times um you're probably going to do it 10 or so times, uh, and so you don't want it to take hours um that was a mistake I made early on but in this case it's you know with an appropriately sized label data sample size this goes pretty quick.

00:27:40
Um just need to run this guy again roll back up here so you run part two and part two will now use IPy widgets to build a kind of Quasi labeling interface um and it looks very similar to one of the blogs I showed earlier and so you can start going through and actually asking people to label for you or label yourself so Mary Burris in Alexander and Mary Burris and Albemarle I'll say no uh Charlie Ward in Elizabeth City and Charles Ward in Clayton you can say no and you know I won't go through all of them but uh but if there are going to be some here, Dyrell Hudle in Elizabeth City but with all sorts of mistakes, that's a good one so you can label that one a match.
And this becomes kind of just a real simple interface that you can either outsource to um to a labeler who feels like they're having input into training the model or just label through real quick and if you've used Zingg you know that you get about 15 to 20 or so um each time you run it um and I really I really like the idea of running it and then labeling those and then saving those labels so only doing 15 or 20 at a time I um have gotten the question frequently which is to say well I'd rather just label 100 or 200 at a time and I tell people okay like you can do that but I think you're not really taking full advantage of the active learning loop here uh because each time you run part one again Zingg from my understanding is going to find new edge cases it's going to try to be more intelligent the second time around and if you are willing to label smaller chunks but multiple smaller chunks I believe the you should you should get to a better uh blocking strategy uh faster um with fewer labels.

00:29:30
So that's that's usually how I communicate to customers to really embrace the small training iteration loop really lean into the fact that this should be just a couple minutes and within two hours you could have a couple hundred really high quality labels um but you just go through a couple times and once you've labeled I don't know 30 or 40 matches um that used to be the rule of thumb back in the day you know around 30 or 40 matches and around 30 or40 non-matches uh you can move on to the next step um any questions or comments I think we've kind of kind of reaching the end of the notebook um have maybe just one or two more things to show and then I have an optional notebook to show next if case anyone's interested but um a very short one okay please please write in questions and Sonal will stop me if anyone has any thoughts or questions here.

00:30:22
The Next Step obviously is to run the train and or train match steps train is where Zingg is going to learn from all the labels and build a um binary classifier um or I believe it's binary classifier but a classification tree to start understanding if things are matches or not and then the match step is going to be where Zingg uses that that model to actually do duplicate and and cluster with transitive links um down to uh down to how all the data is related um so you run that and that can take a while we're properly taking advantage of of Spark uh Spark cluster here and so I think I just noticed this morning that um Sonal has put on some pretty good numbers um of like how long it might take or how big a cluster you might need for different sizes including up to 80 million or something so um that's really where you really want Spark. This is for this step. This is the first big step where we're doing like really big distributed processing and Spark can shine through here uh and then we get to look at the results.

00:31:23
And um I usually get a lot of uh questions here, a lot of interest into what these numbers mean and um I don't talk about them that much to be honest I I know what they mean but um I really just say for the purposes of a POC really focus on just the clustering and um and look for how things are related and if anyone we could talk about these today but again today is not really to educate about Zingg. Today is really just to kind of show the workflow um and I just today noticed that there's this generate docs option so I just threw that in this morning I think it's super cool um you can definitely do that um and kind of show all the labels that people have put in case you wanted to review the labels and kind of understand something to do with how Zingg is scoring things um but in full disclosure I haven't really given too much thought to this step yet.

00:32:16
So that is uh the that is the main notebook. I do down here talk about the link phase if someone wants to do more of that incremental approach or if they are really interested in linking data rather than doing duplication um but in this step I usually just take that exact same data I break it into a1% and a 99% sample um and I match one against the other using the existing model so I think that is all I'm going to say on this one but you're free to extend this however you see fit and to start thinking about um about how it works so that's the end of this notebook um I do feel very confident that uh customers and anyone using um this kind of interface can get to a pretty good result in just a couple days um maybe a week and um and then of course there's many ways to go from here but uh but you know it's 12:40 and I know we don't have forever so um I'll stop there.
I'll highlight that if anyone is interested I have another notebook I've been working on which is specifically on understanding how clusters change with time um so like for example in, um, how these things change with time like if new data comes in or if data gets deleted or um if IDs change, you know how do you notice if you have a million of these or 10 million of these and only 0.1% of them change in some fashion, how do you identify those changes um because maybe the changes are good, maybe the changes are bad, you don't know and so I have a uh an approach that I've been working on I still consider it kind of like a beta approach but the whole point is to um generate a function called generate cluster diffs that takes as input um two different clustering approaches like maybe the before and the after um and then two IDs, the record identifier and the cluster identifier and then report um in the form of five data frames, a tuple of five data frames, kind of five different potential outcomes and those are here and I'll stop with these five sentences.

00:34:30
Um records that exist only in the first cluster mapping so maybe records that got deleted, records that exist only in the second cluster mapping so maybe records that are new, clusters that are identical so clusters that have the exact same records but maybe the cluster ID has changed for some reason, uh and then clusters from set one that have split into more than one cluster from set two and the inverse: clusters from set two that are a combination of more than one cluster from set one. So you might imagine that a sum, you know, two independent clusters, maybe there were previously two clusters that represented the Luke Bilbro cluster and then a new record comes in and it's the lynchpin that links those two clusters together and so the second time you cluster they all collapse into one, so that would be this fifth one here, um, and you could have the inverse as well in theory.
So um, that's this sort of thing. I have some examples and this notebook is something I'm happy to share as well but as promised I'm going to stop and uh we can just open it up for discussion um but happy to chat about things and if anybody has Databricks questions happy to answer those but again I'm not trying to sell anybody on Databricks today so I'll uh I'll stop there, stop sharing and um open up the floor back to you Sonal for now I guess.

00:35:52
Sonal:
Yeah, so thank you so much uh for this uh for this nice uh I would say nice Zingg program. Uh we’ve really come a long way from the days we were doing those Spark Summit jobs with The Zingg JAR to a full-fledged Python API, which is, I think, a full entity resolution program which is no more than100 lines now and um, kudos to you for you know, uh clearing up so many things and uh making it as simple as it is so thank you for that. Um Abishek has one question here, which is: Records having match will have the same Z cluster, correct me if I'm wrong?

Abhishek:
Yeah because I saw three columns there and then you said like mostly the column which is interested user is in the Z cluster so technically in the final, I can group by on the cluster ID and get all the duplicates, correct? Uh, which has been identified?

00:37:05
Luke:
Okay, that's correct as long as you're willing to accept that there's a transitivity that's taking place. So, um, if A matches B and B matches C, A, B, and C will all be given the same ID, even if A and C didn't match directly. Um, so there's a transitive closure, is what it's often referred to, and so maybe A and C weren't explicitly identified as a match, but transitively they were identified as a match, which is often what you want in this situation. But yes, grouping by that ID is how you would maybe go build a golden record or something like that. Okay, and then there's a question also, um Abishek, yes, of course you can have access. It's a completely open thing. Um, if you are on, uh, I assume everybody is on the Zingg Slack, so I'll post it there again. I'll post both of them there again. I'll go to the, I don't know, Sonal will tell me which channel is best, uh for it and I'll post both notebooks there. Um, if, uh, I'll also put my email in the chat here so if anyone needs to reach out to me for whatever reason I can send it to you that way too. But yeah, this is not meant to be any sort of like private thing. It's completely open, uh, and you're welcome to use it. I mean, there's like an open Databricks license with it, but like our open Databricks license is like, yeah, just have fun with it, right? Just don't, just don't say that we cheated you or something.


00:38:22
Mayur:
Hey, uh yes, thanks Sonal, uh hey Luke, uh it's really great presentation. Uh, so there are a couple of things, uh you mention about like when we iterate the data, right? Uh, it might, uh, that might change the clusters, right? Uh, maybe that could be the case. So, uh, is there any, uh, like any particular number of records or totally depends on the data? Uh, if, uh, completely new data we are training our clusters on, then is it like you see like a lot of changes on their clusters or how these things, uh, is there like, maybe this is kind of a black box for me?

Luke:
Yeah, you know, the answer is it's hard to answer that in general, but I can tell you that in practice there are some common patterns. Um, and so I usually the daily changes that come in, like the way that data changes with time is mostly new records. There will be updates applied to records, but that's usually a smaller amount, that's what's in and updates are changes to existing records. Uh, and there will be some deletes, but I would say new records are by far and away the most common type of change to these systems, probably like a 10 to 1, like 10 times as many new records than updates. And new records don't really change clustering that much, usually. They rarely change clustering other than creating new clusters if there are new records that have never been seen before. And so of those five types of output that I showed in that cluster diff, where like there's the ones where records disappeared, the ones where records appeared, the ones where clusters stay the same, and then how clusters changed, the most common by far is the second one where, or the one that's for a new record showing up. That is without a doubt the most common, and that's pretty easy to handle actually. It's either going to generate a new cluster because you've never seen it before, or it's going to fit into an existing cluster. And I would say 90% per changes probably fit into that bucket. Um, the other 10% are all, it's kind of a mix. It really depends on how your data changes. Um, updates, you know, things can change, um, or maybe a month or two later you retrain the model and it gets better. Like you've labeled better data or you've pulled out some bad labels from before and your clustering has gotten smarter, and that can change your clustering. And that can change it, you know, you could see a couple percent change. But, um, but if you have a couple million, you know, you could be tens of thousands of clusters that are changing each time. So it's, uh, it's significant. It's technically human scale, but it is not easy human scale. So I think it's good to try to automate the ones you can. Um, the ones that you know, it's an add, automate it, deletes, automate it, same cluster just different cluster ID, automate it. Um, and then the ones where clusters have changed, you know, the thousand or so each week, you know, push that downstream to a process to handle that in a more manual approach.

00:41:26
Mayur:
So is there any standard practice to evaluate that models, um, like measure the accuracy?

Luke:
Yeah, there's, um, actually Sonal and I were just chatting about this this morning, I think we've talked about this a lot. Uh, there's precision and recall are the most common approaches, um, but with precision and recall in this type of approach, there's actually two forms. There is, um, because you're labeling data, a human is going in and labeling data and then you're building clusters at the end. And so the model can run, The Zingg model can be built and then in theory, I don't, this doesn't, I don't think this exists yet, but we were just chatting about this, you could ask for to ask how accurate Zingg was at reproducing the labels and you could look for false positives and false negatives on that and that would give you a precision and recall score. But that precision and recall is not the same as precision and recall in the clusters. Um, the final output, as you, as we saw, were those Z cluster IDs and records, and so how do you measure precision and recall of clusters? It's an interesting question. I do have some approaches and I think I put it in the solution accelerator. I do have some suggestions. Um, I'll double check and I'll put it in the, um, in the Zingg Slack Channel later. But there are some, there are some kind of ideas of how you might measure cluster accuracy. Um, uh, I, in general, think there's two, there's two generic approaches though. One is with ground truth. If you just have someone who has kind of accumulated true data and you can compare the two, that gives you an accuracy, or if you can do like a random sampling of the results. So maybe randomly select 100 clusters and have someone review those 100 clusters manually. Um, then you can do like a statistical inference of what the overall statistics of the global population are as well. So those are the kind of the two generic approaches that people would use, but like applying those in practice can be pretty complicated.
Um, hopefully that made sense. There's a lot of words there.

00:43:37
Mayur:
Yeah, yeah, yeah, I got it. So, uh, with the data bricks, how can we, uh, like, standardize this in the job flow? Uh, like with the new incremental data, is there any documents or some steps available?

Luke:
Yeah, I mean, so that was, I worry about that one because that was what we tried to do in this story, um, in this story which you, which you can, I'll put all these links again in the chat, but in this story, we really thought hard about how would you actually build a workflow, an entire workflow, where there's an initial step and an incremental step, and we talk explicitly like, like all the stuff that's going to happen. We talk about how you would use the link step and, and how you would, uh, perform the record linking, and then we talk about even more here like, um, I think there's some pictures inhere. Do we have pictures in here? Well, we talk about kind of all the steps that might be needed in order to incorporate these incremental matches. And, honestly, it's a good thing to read through to understand the steps, but I do think it's almost too in-depth for a POC. And so that's why I’ve shifted, that's again why I kind of shifted to this view which is to say a POC is the same for everybody. Um, an incremental approach is going to use the link phase of Zingg probably, um, but the way that it gets implemented is going to be so specific to that opportunity and that customer and what their use cases and requirements are that I don't know how to generalize it, um, in a way that is going to work. But if you wanted to see my best approach, it would be this one. This is where we really thought for months and months like what are the many edge cases that we could think of that you would need to think about in these initial and incremental workflows, and that's where we put it all.

00:45:25
Sonal:
I’d like to add something here. For people who are like seriously looking at incremental flow, uh, that's something that we've actually built on the Enterprise side, which is like, you know, a production workflow because a one-time match or a one-time link is fine. You can have varying data sets and, uh, Zingg open-source gives you a lot of power and a lot of, you know, ability to, uh, really master your data. But if you want to have a stream of updated records, deleted records, um, new records added, and you want to ensure that cluster IDs are preserved in the right way when clusters merge, the largest cluster's ID gets picked up. When clusters split, you, those Z IDs are, uh, provided in a way that makes sense, uh, based on the size of the cluster, uh, etc.
Um, so those are things that we've actually added as part of The Zingg Enterprise, so you may want to take a look at that. Um, the way the solution notebooks, that, Luke and, uh, we’ve worked on are for you to be able to kind of, you know, walk through all these phases to get to results, uh, quickly. But if your need is like, you know, you want, um, an updated cluster, the golden records coming in, um, and streaming them to different, uh, applications, having a matching run incrementally every hour or two hours, uh, maybe Zingg Enterprise is something that you would like to look at.

00:46:08
Luke:
That makes a lot of sense from my point of view. Yeah, I mean, I think what we learned was that it was probably a little too challenging to doon our own with just cobbling together open-source. So, kind of putting that managed layer together I think makes a lot of sense. That's probably what I would actually say if you need that regularity. I would say go to, so, and all, she's the expert, she knows how to do it right, and that's not where I am.

Sonal:
Yeah, and, uh, just, just between this close group, you know, honestly, when I wrote Zingg, uh, I thought it would be the toughest thing I would ever write, uh, in my life, and I was like, let me just see open sourcing and, you know, um, helping other people because it was really tough to put, uh, put together. But when I wrote incremental, it was like, you know, uh, it was far more complex than anything else that we've, uh, done in the open-source. So, yes, I would agree with L there that, uh, it is definitely a complex piece to keep those clusters managed.

00:48:12
Luke:
Yeah, just some inside baseball from my side. When I was at Tamr, they had to build an entire second product to do incremental matching, um, which was a huge headache. Um, so, uh, yeah, it's, it's hard. It is hard.
I was gonna say just, just, I think we're, yeah, probably at the end, so I just wanted to thank you guys for the questions and attention. Um, would be happy to chat with anyone if they have specific questions, but I will, I'll make sure all of my stuff gets put onto The Zingg Slack. My email's in the chat. Reach out directly, feel free. Um, I won't bite, I promise. So, um, happy to talk about open-source stuff.

Sonal:
Yes, it was, uh, great fun seeing this and, um, as I said earlier, it's we've come a long way, uh, from the early Spark Summit jobs that we were running to somehow, you know, um, getting to work with Databricks, and this looks really cool, so I really hope a lot more people will be tempted to try it and, and see the real benefits on their own data.
Right, uh, then thank you so much, uh, Luke, for this, uh, very engaging, um, open discussion, and I'm sure we all are waiting to hear, uh, you know, from you about, uh, uh, the sharing of the notebooks and, um, I hope some people will also reach out to you directly or to me, uh, whichever way works for people, and, uh, looking forward to having more of these discussions.

Luke:
Sounds good. It was a lot of fun. Great progress. Alright everyone, have a good rest of the week. Bye.

Recent posts