Training Entity Resolution Models - Sonal Goyal

November 6, 2024

00:00:20
Sonal:
So thank you for joining in from, you know, diverse places in the world and also for spending time and having an interest in Zingg. The topic today that I wanted to kind of run by is just how to maximize, you know, the accuracy and performance that you can get out of Zingg because I believe that's probably the most important thing that all of us are after. And I would like to cover some of the key internals within Zingg and what kind of configuration and what kind of training actually impacts the accuracy and the performance that you get. Feel free to stop me at any stage because it's a small group and would love to take your questions as we go along.

00:01:10
So the goal I think in, you know, while building Zingg models and also as a, as the key developers behind Zingg has been that we want to ensure that the matching happens in a performant way. This, the problem of entity resolution by its very nature is a compute-intensive problem. And the goal behind all the training and all the configuration that users do is to optimize 2 main things. One is to have really performant matching jobs, jobs that perform well within the time duration that you want your results. And 2nd is that they are as accurate as possible with a balance on the precision and the recall.

00:02:05
So there are two aspects of accuracy that Zingg optimizes on. One is precision. What I mean by precision is, when Zingg says that there are two records that match with each other, how accurate is that for the user? And that is an important step because you want to trust the results that you're getting. Second is, of all the problem space that you have, is Zingg able to give you all the pairs and all the matches that exist in that problem space? So if there were, you know, 100 pairs, 100 kinds of records that would have matched with each other, has Zingg given you all of them? Has it given you 90% of that, 80%of that? Statistically it's called recall, and Zingg internally balances between precision and recall. So, so that's essentially the, the internal aim that, you know, the optimization that we do and all this is seamlessly done. You would probably not even be thinking about this while you're, you know, configuring Zingg, but these are core aspects that these Zingg algorithms kindof train and tune on.

00:03:35
Now coming to, you know, time, time is obviously one of the most critical factors when we think about, you know, entity resolution. And the reason is that absolutely there are no unique keys on which we can actually join these records together. So the space of computation, the space of comparison is really, really huge. And it is very difficult when you know, when you have, let's say you have 10,000 records, you want to compare those 10,000records against each other without any unique join key. Because the comparisons are so huge, Zingg has internally something called a blocking model and its goal is to quickly bring the near similar records together. So it is in a way, it is a broad-based clustering, very custom to the problem that is at hand. And this is also learned while you are labeling your data sets.

00:04:38
So when Zingg is showing you the fine training data phase, while you are doing the labeling, the blocking model is also being learned. And some of the aspects in the blocking that affect—and blocking is completely learned from the records that you mark as matches. Not the records that you mark as non-matches, but it is completely based on the records that you mark as matches on the record set. You mark as matches on all the fields. We apply different kinds of hash functions. Some of them will be like when you're marking records, do the first 2 characters mostly match for a particular field? Let's say first name, if that's something that, you know, is a good way to bring those near similar records together, Zingg would actually, you know, block them together.

00:05:34
So, so let's say, you know, like you have records across multiple cities and in all likelihood you don't want a record to match records which are across different cities. So Zingg would learn that from the training because you would always be marking records which belong to, let's say, San Francisco as matches. But San Francisco and Fremont would not be marked as a match. And that's how Zingg would actually learn that, you know, whenever the records the user says are a match the state or the city is always a match. And probably it's a, it's a good key to block on, but it's not only equal, we actually do very fine-grained hash functions. And it's actually a tree that we create out of it. So it goes over all the fields that you define as part of the blocking, as part of the field definitions. All the fields that you mark as "don’t use" are not used in the blocking, but all the other fields on which you plan to run any kind of matching are considered in the blocking.

00:06:44
Now the second aspect for Zingg is similarity prediction, which is essentially that if you have two records which are in the same block, are they a match or are they not a match? And internally we use a logistic regression classifier for that. The features are created based on the match types from the field definitions that you select, and together these two important aspects of Zingg help you balance between the accuracy and the performance, the precision and the recall.

00:07:26
Now, really, what knobs does a user have here? A key aspect while building Zingg was that, you know, a user is potentially doing a ton of other data transformations. They are putting their data pipelines together. They're focusing on the business functions, and we don't want them to be bothered about algorithms or weights or, you know, thresholds. We want it to be as simple of an interface as possible: say yes, no, can't say, or not sure. And based on that, let Zingg learn. So that was, in a way, a design principle; our approach has been that the user should not have to worry about tuning Zingg for performance or for accuracy. It should come to the user out-of-the-box, but still, obviously Zingg’s results are very, very dependent on how you’re using it. The knobs available to the user are something that are very intuitive to use. They are not entity resolution domain-specific that you need to worry about. You don't need to pick up a new skill or even machine learning skills or even Spark skills.
What, however, you need to know deeply about—and those are things that you probably already know, and those are the ones that with a bit more awareness of their impact—would help you derive maximum mileage out of your Zingg deployments. So things like your own data, which you already know, what field definitions you choose, how you configure the match types, and the number of labels that you actually make. Essentially these are the main things that impact both the performance and accuracy of Zingg models. Now let's look at each of them one by one.

00:09:30
‍So essentially, I think what, what we think that most users would know is that they already have an understanding of, you know, what their data looks like, which fields are well-populated, which fields do not make sense, what kind of variations exist in your data. So some level of understanding is definitely needed. And I think most users eventually, when they start looking at Zingg, they're already aware of the problems in their data, and they kind of know that, you know, some fields have common words, there's some level of data cleansing that they know they can already do. The data is already denormalized. So to that extent, I think most people kind of are aware of what their data is like and what are the fields that are worth it for them to be a consideration to be put into Zingg.

00:10:30
Now, coming to, you know, choosing the field definitions, I think this is something that, that just, I just wanted to highlight. I think most people do it right, but the idea here is that, you know, even if you have like, you know, 15 columns, we need to choose the smallest number of distinctive and well-populated fields for matching. I say well-populated because if those fields do not have data, like if, you know, if a majority of them are null, I mean, there's absolutely no signal that Zingg can derive out of them. And it could be treated as a match or a non-match based on how you configure it, but then it's not really very, very helpful. However, if you know, if you have data which is mostly seventy-ish percent populated and you think that, you know, adding that to your model really helps, those are fields that you should definitely look at.

00:11:37
Another way to look at those fields is: is this field distinctive enough? If all my data is like the state of California, it's not distinctive. It doesn't matter for Zingg. You probably don't even care about putting it in the field definitions as a match type. You should probably use "don’t use" for those kinds of fields. Now, sometimes because Zingg is like an ML-based or an AI-based platform, people assume that it will be able to figure out that, you know, M is equivalent to male and 1 is equivalent to male. But that level of intelligence mostly, you know, is kind of more on our data preparation side. And that's not something that we do—at least not as of now. So that kind of, if you have some data like that, it probably makes sense to standardize that. Also for fields, as of now, Zingg is still, you know, non-semantic to some extent—a doctor or a physician and, you know, a lawyer and an attorney for a human may mean the same thing, but to Zingg, because it’s still doing string edit distance, etc., they don't mean the same thing yet. However, if you have semantic needs, you could actually put in embeddings for those corresponding fields and use array type, which is a type in Zingg, and Zingg would be able to actually match them.

00:13:22
Now, when we look at, you know, choosing the field definitions, I think the biggest thing here is the match types. Now, what essentially does match type internally translate to? Internally, match types translate into the features, the machine learning features for the classifier, and one match type will translate into one or many features depending on what kind of match type it is. So if you look at fuzzy, fuzzy is something that we say that, you know, you have variations in the field, possibly typos or abbreviations, and you still want to go ahead and match those records. Please remember that matching happens not only column-wise, but also the entire row in its entirety is considered for a match. So even if something is not matching exactly or it is matching exactly, that alone by definition will not really be the only test. The entire record in its entirety would be looked at, and only then it will be decided whether it is a match or non-match by Zingg. So if you choose a fuzzy match type, internally it will translate to two to three features. It can be applied to strings, integers, double states, and, you know, it will be tolerant to the variations in those fields. And as you mark your records, as you label your records, it will learn to what extent a variation you are happy with. And that is where it will learn the, you know, what is the right value for the feature weight or for the column weight that it needs to apply. So sometimes, you know, people are asking us whether there are some weights that they can change or some field settings that they can change. You don't need to really bother about that. As you are labeling with the match types that you are choosing, Zingg would automatically pick up which fields are important to you, what level of variations are important to you, and that is pretty much what you need to be bothered about.

00:15:35
So, we have all these match types. "Don't use" is something that we really get people to use for IDs and other fields that you want to appear in your output but not really in your input. They don't ever contribute to any kind of matching. We have some specialized match types like email and pin code. The null or blank is very interesting because if you have, let's say, you know, a data set which is like a column which is around seventy-ish percent populated, you still want to put it into Zingg. By default, Zingg is going to treat all your nulls. If two records have null values, that column would be treated as a match. But if you want to indicate separately that, you know, this was a null, Zingg would record that, and it will start becoming intelligent about the kind of null matching that you're doing. And you can get different results. And sometimes you want to ignore results where, you know, one of the attributes is null. So you can get those kinds of results using a null or blank. It is powerful, but it needs more training compared to not using that feature.

00:16:43
So, I think the idea here is that, obviously, the more match types that you have per field, and just to add to that, you can actually add more than one match type for a column. So you can have, you know, fuzzy, let's say, I think we have some of the other ones now, fuzzy numeric, where you say that, you know, I am worried about variations in the field, but I also want Zingg to pay specific attention to the numbers that appear in this column. And I want it to pick out those numbers and compare them also, and make sure that how I'm labeling is how the matching is reflected. So you could actually, you know, have multiple columns, multiple match types for a column, and accordingly, multiple features will get created. So that's something that you can definitely try if you want, you know, specific match types to happen. These match types that I have in front of me right now—the numeric, the numeric with units, only alphabets, exact only alphabets, fuzzy—force Zingg to create custom features specifically extracting data out of that column. And that is an indicator that, you know, as you label it, it will learn that kind of pattern and make sure that it matches according to that.

00:17:58
Now, a thing to keep in mind is that, you know, the more match types we have, the more labeling we will need because we are creating more features, and Zingg needs to learn that. So if, because, you know, if we've labeled two records as matches and it's actually deriving multiple features out of those columns, it needs more variation in the training. So you need to probably add more labeling compared to what you would do otherwise. Similarly, more match types will impact the performance—not drastically, but they will definitely have a hit. Like if you’re, you know, for all your columns, if you have two to three match types, that is definitely going to have a hit on your performance. So choose wisely, because essentially it means that Zingg is actually processing more.

00:18:58
Now, you know, sometimes I think a very simple rule for labeling is that when we look at the data, intrinsically, sometimes we know that, you know, this person has moved. So I know this customer, this person has moved to a different city or a different address, and yes, they are matches to me, and Zingg would probably figure out, but treat the first Timothy Chen as a match only when you’re sure that the second Alexandra Hutton record, which also has different addresses, can also be treated as matches. So the labeling has to be consistent. If you’re okay with addresses being different for the same person name, it’s okay to label them, you know, as matches. But if it’s just human knowledge, if that signal doesn’t exist in the data, there is no way for any application to learn, and your results will not be consistent. So that’s just something that, you know, you need to keep in mind.

00:20:06
Now, how do labels, you know, kind of help us? I mentioned already that the blocking is very dependent on how many records you label. And you would see that, you know, there is a warning in the logs which says that, you know, this is the number of comparisons that Zingg is making. As you label more, this number of comparisons will reduce. So anytime you feel that your performance is not good, a very likely scenario is that there isn’t enough training data, and the blocking model has not been accurate. And you can actually look at this row, look at this line in the logs, and see if the number of comparisons is really huge, you probably need to reduce it. You can add more training there.
The second aspect to labeling is that you know, field combinations are covered. As you label Zingg, you will see that, you know, it starts asking you, showing you pairs which are like, let’s say, the first name is different or the first name is the same, but the last names are different, and the states are maybe nearly the same, stuff like that. So it will show you different field combinations, and just when you’re labeling, just try to make sure that in totality, those kinds of combinations are covered. It has actually been able to look at some variations in state, some variations in your other fields, to create the right model. Our general rule of thumb is that, you know, if you get to 40 to matching records, probably you would have covered mostly all the field combinations. But in case you feel that that’s not happened, you can actually force-feed stuff into Zingg through training samples, or you can just go into, you know, labeling a bit more of the data, making sure that some other aspects of your data are also covered.

00:22:10
Now I think some exciting things that we’re working on, which is just to improve usability. We’ve actually been working on a couple of things based on some of the deployments that we have done over the past few months, some of the user feedback that we have got. One of the aspects is that, you know, if there are unbalanced training samples manually added by a user and the match jobs start failing, and the user doesn’t really know, you know, why suddenly Zingg is not performing. It’s doing a lot of disk rewrites, jobs are failing. So we are actually going to give you knobs to verify that your blocking model is actually well-trained. And there would be a way in which you can actually figure out that, you know, your data is well spread into the blocking, and some blocks are not really huge and some blocks are very small. So that is something that we are planning to expose to help with debugging some of the Zingg jobs.
Another aspect that we are actually working on, and these are honestly very experimental at this stage as we speak, is that, you know, as I mentioned, the job of the fine training data and the labeling phase is to get to learn a lot from the matching records, and we also learn from the non-matches, but we learn more from the matches. And we’re trying to optimize the flow for the fine training data and labeling so that we can show you more matching records than the non-matching records. Because I think right now, what happens is that, you know, we start with trying to show you 10 matches and 10 non-matches. And mostly in the initial rounds, you end up with just the non-matches, and it takes a lot more cycles. So that's one more phase that we're trying to see as an experiment. If that works, it would be nice for you to be able to iterate quickly on a model and get to a functional Zingg deployment faster. So this is again an experiment that we're working on to alter the samples that we're showing to you, leaning more towards probable matches to get to actual matching results faster, so that you can label more quickly than you currently can and somehow get to better, you know, a faster deployment in that sense.

00:25:00
Another aspect that we are also working on is, you know, just improving the blocking tree, the way we look at the fields, to make sure that we can cover more of the columns within one single run so that your models can converge faster and it takes less time for fine-tuning data and labeling. All of these, as we speak, these are just experiments that we are actually working on, and we'll keep you updated on how they work. I think the biggest thing, biggest challenge for us is that we don't own any data or any datasets. We have some customer datasets that some of the techniques we've applied on, but they are very... It's likely that they worked. We got good results on those. So it's very difficult for us to actually just say generically that they will work for most datasets. And that's why I say that they're still experiments. And I would love it if, you know, we could try them on some of the community users if they're interested in helping us with that. It would be much faster and easier to roll out.
And I think just to add to that, in the coming release, we are also adding a lot of blocking tree optimizations so that when we, so this is something we've already done in the enterprise product, but we are bringing this to open source so that when you're running fine-tuning data, it runs much faster and you can converge to models much easier. So this is pretty much what I wanted to, you know, talk about today. I would love to hear your questions.

00:27:08
Aniello Guarino:
Hi, Sonal. Thanks. Thanks very much for, you're just showing, it was really interesting. It’s not... it’s our question that actually was raised already on chat with you. I was thinking about quality metrics like the precision list recall that you earlier mentioned. So we had some questions coming from the team that you think that if there was a way to kind of look at the internal quality metrics that have been used by the model to try and evaluate the results faster and better for us. Because currently, as you mentioned, this is not a feature that will be available for the community version, but it would just be available for the enterprise, which is fine. At the same time, we actually, kind of as far as I can tell from what I've seen by the earlier tests, it seems like the community version, the open-source version, is actually fine for our needs. But at the same time, the team was actually worried about the fact that there is no way for us to look at the quality metrics and try to quickly evaluate our results rather than simply going and checking the output. Because let's say we’ve got millions of contacts and we need to deduplicate them, it would be inconvenient for us to each and every time we retrain the model to go and check some samples of the data. So we actually needed a faster and quicker way to actually see, “Oh, this is actually what we want” by looking at some precision versus recall, some quality metrics, for example, the column importance scores, which at the moment we don’t have access to. So I guess the question here is, do you have, do you plan for releasing this feature even further to the community version or do you see this as a feature that will only be available with the enterprise version?

00:29:10
Sonal:
OK. So thanks for the question here. So, yeah, so precision and accurate recall, you know, is something obviously we care about a lot, but what we think is and that's completely what we've seen on almost all the deployments is just having the model accuracy, right? Just reporting on what the model is learning from the training data. So you’re giving, what, 40samples, 50 samples to the model. Those stats are not helpful. It doesn't translate into, you know, production match output stats. The reason is that your data is obviously changing. You're bringing in a fresh set of data, you're linking. What you've trained on the model is only a very small proxy to what actually is happening on the output. So, so that’s, you know... And like if we knew on the output what we are missing, if we're not accurate, we would have actually solved it. So it is, in some ways, a tricky problem, you know that. How do we do that? On the enterprise side of the product, what we've done is we've actually added all the hooks because it has a lot more stuff that we've already built in terms of the framework. So, so there we actually are able to kind of expose, you know, what are the features it is generating, how they are playing up in combination. If 3 records match with each other, but only two of them genuinely matched and one was like a transitive match—it only matched with one of the records—a user is able to figure that out.

00:30:55
So, so those are things that, you know, obviously on the enterprise side, we've been able to add just because of how the code is structured. I'll definitely take your suggestion and think about ways in which we can do something for the open-source. I think for now, maybe if you have like a golden data set, you can compare your results against that for the model and see if that's something that you want to benchmark against all the time. So if you're changing the model, you can see that, you know, in your previous data set, it gave you, let's say, 80% accuracy. But when you change your model, you change your match type, it is now going up to 85%. So this is a better model to choose. So that's one thing that I see people doing on the open-source community side. And I think that first step, you know, you have some kind of a golden data set or a truth set or something that you can actually benchmark against every time you put in a new model.

00:32:02
‍Aniello Guarino:
OK, yeah, thanks. I mean, that's fair enough. And then I got another question about the link phase. We've been testing the linking phase for a while because we need to have an incremental way of finding dupes against an already deduped historical data set against a delta data set. So as far as I understand, the incremental delta also needs to be duplicate-free, right? But that step was actually needed with the find incremental matches, which is the other model that we've been using. So it seems like we first need to find the matches against the delta, then we need to remove the dupes from it, and then we run the link phase. Is that correct?
‍
Sonal:
The link phase, as I remember, would actually do a best match of the second data set against records in the first data set. So if there are three matches it finds in the first data set, it will link against all three. I don't think we have a restriction on them being duplicate-free or something.

Aniello Guarino:
OK, that's good to know. Because in the documentation, I've seen that actually you mentioned that the data set needed to be duplicate-free. But maybe I’m wrong. But I think, as far as yeah.

Sonal:
So the best use of link is obviously that, you know, those two records—the idea behind link is that those two data sets are duplicate-free and you want to map one against the other. So that's always been the idea in terms of implementation. I may be wrong here because it's been a while since we looked at link. Mostly we work on the run incremental phase in the enterprise product and that works differently. So if you see something else different from the documentation, let us know. We'll check and get back.

Aniello Guarino:
Yeah, yeah. OK, OK. So I'll run some tests. I actually ran some tests already with having some dupes in both data sets. It was producing some results, but I thought maybe it would be best if we just remove the dupes upfront and then run the link phase. But if you're saying there's actually technically no difference by doing it or not doing it, then I will probably avoid running an intermediate step like removing the dupes first because that will add complexity and would increase costs as well. That's fine. And also, I was expecting actually to be fair to have as a result the same number of records coming from the historical data set plus the records from the delta. But I realize that’s actually not the case because we might have multiple records, so duplicated, coming from both data sets. So the results won't necessarily be like the number of historical records plus the number of incremental records for some reason. But that's maybe something else.

Sonal:
So link, yes, link would only show you the matches. It was not meant as a... it's supposed to find the links, right?

Aniello Guarino:
So if links have matched in that case, then I will need to post-process the results because our requirements would be to have the original data set plus the delta well, kind of being as an output to the model itself. So if link doesn’t do that, then obviously it needs to be post-processed. But that's good to know at least. Yeah, thank you.

00:35:50
Sonal:
Sure. Any other questions, Ron? How are you?

Ron Sweeney:
I'm doing well. Thank you. Appreciate this.

Sonal:
Do you have any questions, Ron? How are things going with you?

Ron Sweeney:
I do not. I've been listening and kind of looking at some of the things that have been updated lately, but I do not have any questions. Thanks.

Sonal:
So any feedback, any, I think, any, you know, anything that you want to share, especially I think for the people here today, if there are data sets that you think we could use for some of the features that we're building. If there's any feedback on the fine-tuning data and the labeling phase, please do let us know. If you've had, you know, if you've had trouble with accuracy or performance for Zingg. So if your jobs have failed, those are things that we're very actively working on right now, and we would really love it if we could get some of your feedback there.

00:37:25
OK, cool. So I think this is pretty much it from my side. And you know how to find us on Slack and GitHub. If you've enjoyed, you know, if you just like the project, if you think it's valuable to you in some way, if you want to do a case study, just help the community with some use cases or some documentation, it'd be a great help to us. And in any case, thank you so much for joining, and I hope you have a great run with Zingg and just let us know how things are. Yeah. Cool. Thank you. Thank you so much. Bye. Bye

Training Entity Resolution Models - Sonal Goyal

Recent posts

Training Entity Resolution Models - Sonal Goyal

Zingg With Databricks - Lucas Bilbro, Lead Solutions Architect, Databricks

Zingg with Spark Connect - Sem Sinchenko, Data Engineer, Raiffeisen Bank Russia

Sign Up For Sonal's Newsletter And Be a Part Of Our Journey