Bringing together data from different sources like websites, offline stores, and CRMs often creates a common challenge: duplicate records. Without a shared way to identify customers, businesses end up with fragmented and inconsistent data, making it hard to get a clear picture of their audience. This problem has long been a hurdle for organizations, but with tools like Zingg and Apache Iceberg, it’s now easier to detect and merge duplicates, ensuring clean and reliable customer data.
Why Zingg?
Zingg is an open-source entity resolution tool designed for modern data stack. By connecting to diverse data sources such as Databricks, Snowflake, Cassandra, AWS S3, and Azure, Zingg ensures simple integration and supports multiple formats like Parquet, JSON, and CSV. Its ability to deduplicate and unify records across data silos empowers organizations to create consistent and accurate data views of customers, suppliers, and more.
Why Iceberg?
Apache Iceberg is a high-performance table format designed for large-scale analytics. It simplifies data management by supporting schema evolution, time travel, and efficient querying without manual partition management. Iceberg's ability to handle petabyte-scale datasets makes it a perfect fit for modern data lakes.
The Synergy of Zingg and Iceberg
If we combine Zingg's powerful entity resolution with Iceberg's robust data architecture brings transformative benefits:
Here is a step-by-step guide on Running Zingg on Docker with Snowflake Iceberg tables:
Setup Zingg
Before setting up Zingg we need to create a config.json
The config.json file is the primary configuration file used to set up Zingg's matching and output processing.
1. Field Definitions (fieldDefinition)
This section defines the fields involved in the matching process, their properties, and how they are used:
2. Output Configuration (output)
Defines how and where the results of Zingg's matching process are stored:
3. Input Data Source (data)
Specifies the source data configuration:
Props Configuration
To integrate with Snowflake, we specify the required JAR files in a configuration file named props.conf. This ensures the correct dependencies are loaded when Zingg interacts with Snowflake.
Pulling the Zingg Docker Image
The Zingg image can be pulled from Docker Hub using the following command:
Starting the Docker Container
Mount the required directories and start the Zingg container:
Workflow Phases
Find and Label Phase
Zingg identifies sample record pairs during the findTrainingData phase, which users label as matches or non-matches. These labeled samples are then used to train Zingg's machine-learning model.
Run the findAndLabel phase using the following command:
Labeling Details
Training Phase
In the training phase, the labeled data is used to train Zingg's machine-learning models.
Run the training phase with:
Matching Phase
The trained models are then used to predict duplicate or matching records within the dataset. Zingg generates an output table that includes both the original columns and additional metadata.
Run the matching phase using:
Output Details
To store the output in an external Iceberg table, ensure the table is created in advance. The output includes:
Final Notes
This approach combines Zingg's smart data deduplication and unification with Iceberg's ability to handle large datasets, making it easier to manage and clean up data. By using Docker, you get a portable environment to run Zingg, and Snowflake ensures your data is stored and processed securely. These technologies offer a modern, flexible, and easy-to-maintain setup that helps businesses manage their data more efficiently. This solution not only improves data quality but also ensures your data pipelines are ready for future needs, making it simpler to make better data-driven decisions.