Resolving entities through Zingg and consuming the output in a graph database is the right approach.
As organizations collect more data and harness additional in-house and cloud tools, trusted information about customers, suppliers, partners, and other entities becomes increasingly fragmented. Without a unified view of core business entities, analytics suffers. If our customer data is in silos, we cannot quantify customer lifetime value, new customers added per quarter, average repeat order per customer, and other critical customer engagement metrics. This leads to a poor understanding of our business and our customers, which affects the brand value. Customers expect us to provide personalized offers and recommendations to them, but if our base data is not integrated, we can not delight our customers and discover opportunities to cross-sell and upsell.
This problem of siloed data is not restricted to customer information for revenue purposes alone. Compliance departments need clear entity definitions of suppliers to make sure that they are not dealing with people or organizations under sanction lists. Anti-money laundering and KYC activities begin with establishing the single source of truth.
Resolving multiple records without unique identifiers and belonging to the same entity is known as entity resolution. On the surface, entity resolution looks like an easy problem to solve —humanly it is intuitive to identify records with variations so surely a computer program can do better! Maybe we have some common identifier across different systems and records that we can leverage to unify the data? Unfortunately, even with trusted identifiers like emails, people use work, personal, school, and other IDs and this does not solve the problem completely.
Due to the different ways in which our systems capture and record information, most of our real-world data looks like this.
As we see in the sample records above, we do not have any single attribute that matches exactly across the entity. This example comprises just 3 records and 4 attributes, imagine crafting similarity rules in a programming language like Python or SQL to match these records!
Graph databases, with their inbuilt linkage patterns, are a natural fit for disambiguating records and resolving entities. TigerGraph, a leading graph database, is a powerful tool for entity resolution. As outlined in the TigerGraph blog post, we can build a graph schema of the above three records by defining five types of vertices, one for the actual customer, and the rest for the four attributes — name, address, phone, and city. By representing the attributes as vertices and entity-has relationship as an edge, we can translate the above three records into the following graph.
We can then define similarity metrics like cosine or Jaccard to link names, phone numbers, cities and addresses. After matching two vertices, we can add an edge between them so as to link them. Then we can group the similar entities together using connected components.
This approach, due to the transitive nature of the relationships, is a far better approach than running queries in a relational database. However, it still leaves us with a lot of work in terms of defining similarity criteria between attributes, combining multiple attributes together as well as linking them all to deduce entity relationships. Pair-wise similarity of attributes is computationally prohibitive, and even after leveraging distributed graphs, we end up with scalability challenges. What if we could resolve entities in an easier way and then load the graph into TigerGraph for our KYC, AML, and Customer 360 scenarios?
One way to solve the above challenges is to use a specialized open-source entity resolution framework like Zingg. This frees us from the intricacies of entity resolution, leaving more time to work on our inferences. (Full disclosure: I am the author of Zingg)
A typical workflow in resolving entities with Zingg looks like this.
1. We build a configuration json specifying our input and output data locations and define which fields we want to configure for matching.
2. We train Zingg through its interactive learner. This picks out representative sample pairs that we mark as acceptable matches or non-matches. We can label up to 30–40 matches through the labeler and Zingg will automatically figure out the similarity thresholds and the weights of each attribute.
3. We run our data through Zingg and get the resolved entities. We can reuse the models created in step 2 with newer and incremental datasets.
The output of Zingg, when loaded to TigerGraph, looks like this:
The matching vertices are joined by an edge with a probability score. We can now bring in our transactional data for further analysis of the resolved entity graph in TigerGraph.
This approach is better than a graph alone approach because of the following reasons:
Thus, all the hard work of entity resolution can be offloaded.
By resolving entities through Zingg and leveraging them in Tigergraph, we combine the best of both worlds — easy and scalable entity resolution through Zingg and further analysis of the graph in TigerGraph. This leaves us a lot more time to work on our core business needs as these entities can now be utilized for our analytics and operational needs.