Lessons Learnt While Integrating Zingg's Open Source Entity Resolution Python APIs With Microsoft Fabric

Engineering

April 12, 2025

Microsoft Fabric is a unified platform for data analytics that integrates storage, compute, and data orchestration into a seamless ecosystem. Combining Fabric’s powerful capabilities with Zingg’s entity resolution expertise felt like a natural progression for our team. However, as with any integration, this journey brought its share of triumphs, roadblocks, and valuable learnings.

Let us walk you through our experience—our initial expectations, the challenges we encountered, how we tackled them, and the takeaways that made this process so rewarding.

Why Microsoft Fabric?

Microsoft Fabric offers a single SaaS platform for data integration, engineering, analytics, and governance. With its unified compute and storage, Fabric simplifies data workflows and eliminates silos across data lakes, warehouses, and real-time streams. This is particularly useful for teams who want to run end-to-end data analytics without constantly switching between different tools or systems.

Some standout features we’ve come to appreciate include:

Unified Data Platform: Fabric seamlessly integrates OneLake, Spark, and Power BI into a single interface, providing flexibility across analytics and storage use cases.

Notebook Experience: The built-in notebook interface allows users to interactively write and execute Python, Spark, and SQL code. This was a huge advantage for us as we ran Zingg workflows directly from notebooks.

Scalability and Collaboration: It allows easy collaboration between developers, data scientists, and analysts. Scaling up resources to accommodate large datasets is straightforward.

Simplicity for Data Storage: Microsoft’s OneLake acts as a single, unified storage layer that aligns well with our use case.

Given Microsoft's deep penetration with organisations of all sizes, we started getting queries for Zingg's ability to unify entities directly on OneLake. The need to find duplicate records, deduplicate datasets, and stitch together fragmented records is always amplified as data comes together.

Our goal was clear: make Zingg run seamlessly on Fabric and leverage entity resolution capabilities directly within Fabric. Sounds simple enough, right? Well, we soon discovered that it is the journey of integration that makes things truly interesting.

Setting Up Open Source Python APIs of Zingg on Fabric

The first part of the process involved setting up a proper environment in Fabric. We uploaded our Zingg JAR file and the entity resolution Python notebook to run the workflow. Fabric’s notebook interface worked well with our Python code, which was reassuring from the get-go.

However, we quickly realized that file management worked a little differently in Fabric. After uploading the required files, we discovered that they needed to be published to make them accessible to the environment. Without publishing, the system simply couldn’t find the files when we called them in the notebook. This was one of those subtle things that taught us to carefully follow Fabric’s file lifecycle.

Troubleshooting Authentication and Permissions

As we progressed, we ran into authentication issues while connecting to storage accounts. Fabric’s security measures are robust, which is a good thing, but it meant we had to carefully configure permissions and validate that our storage account IDs were correct. This turned out to be a two-part process:

Finding and validating the storage account ID within Fabric’s interface (it’s important to note that storage access in Fabric is tightly controlled for security).

Ensuring our configurations in Zingg pointed to the correct workspace, resource, and authentication credentials.

There were moments where seemingly small errors—like incorrect workspace or artifact names—threw us off track. For example, Fabric returned 400 errors indicating that “workspace IDs or artifact names” weren’t valid. These required careful debugging and rechecking configurations.

We eventually resolved these authentication roadblocks through a combination of Fabric documentation, trial and error, and collaborative problem-solving.

Customizing Zingg Entity Resolution for Fabric

Zingg is flexible and powerful, but we needed to tweak some aspects to align it with Fabric’s environment. This included modifying Zingg’s JAR file and adjusting our existing Python notebooks. There were initial hiccups around session management—our notebooks occasionally failed to connect to the right Spark sessions.

To solve this, we made targeted updates to ensure that the Spark environment aligned with Fabric’s runtime requirements. This process required rigorous testing to validate that Zingg’s entity resolution pipelines ran efficiently within Fabric.

The Journey of Testing Data Matching and Learning

Once we had Zingg up and running, the next phase involved extensive testing. We ran multiple iterations of the entity resolution pipeline to validate performance, troubleshoot bugs, and ensure seamless integration. Along the way, we documented everything—each issue we encountered and each workaround we discovered—so we could create a step-by-step guide for running Zingg on Fabric.

This guide isn’t just for us; it’s a resource that will help other teams set up Zingg with fewer roadblocks.

Through testing, we also gained a deeper understanding of Fabric’s capabilities. Its notebook interface proved to be user-friendly and responsive, making it easy to monitor and fine-tune our Zingg workflows.

Reflections on the Process

The integration process has been a rewarding mix of challenges and victories. Microsoft Fabric truly is a powerful platform for data engineering and analytics, offering a level of simplicity and unification that makes it a joy to work with.

At the same time, learning to navigate its workflows, file management quirks, and security protocols has been an important part of the journey. The platform is still evolving, and as we continue working with Fabric, we’re confident that many of these rough edges will become smoother over time.

For us, this experience has been about much more than integrating two tools—it’s been about learning, adapting, and pushing boundaries. Making Zingg work on Fabric has given us valuable insights into both platforms, and we’re excited about what this combination can achieve.

‍

What’s Next?

As we continue refining our setup, we are focusing on performance optimization and further testing. Our step-by-step guide has shaping up well, and we are confident it will serve as a useful resource for anyone looking to run entity and identity resolution on Fabric.

The journey doesn’t stop here. Fabric has opened up new possibilities for streamlining entity resolution, and we are excited to explore what’s next.

If you’re working on something similar, we’d love to hear about your experiences. Let’s learn, build, and innovate together!

‍

Lessons Learnt While Integrating Zingg's Open Source Entity Resolution Python APIs With Microsoft Fabric

Recent posts

Lessons Learnt While Integrating Zingg's Open Source Entity Resolution Python APIs With Microsoft Fabric

Kickstarting Zingg Community Events!

Zingg Joins The AWS SaaS Spotlight Program!

Sign Up For Sonal's Newsletter And Be a Part Of Our Journey