Step By Step Guide: Running Zingg with Snowflake Iceberg Tables

Bringing together data from different sources like websites, offline stores, and CRMs often creates a common challenge: duplicate records. Without a shared way to identify customers, businesses end up with fragmented and inconsistent data, making it hard to get a clear picture of their audience. This problem has long been a hurdle for organizations, but with tools like Zingg and Apache Iceberg, it's now easier to detect and merge duplicates, ensuring clean and reliable customer data.

Why Zingg?

Zingg is an open-source entity resolution tool designed for modern data stacks. By connecting to diverse data sources such as Databricks, Snowflake, Cassandra, AWS S3, and Azure, Zingg ensures simple integration and supports multiple formats like Parquet, JSON, and CSV. Its ability to deduplicate and unify records across data silos empowers organizations to create consistent and accurate data views of customers, suppliers, and more.

Why Apache Iceberg?

Apache Iceberg is a high-performance table format designed for large-scale analytics. It simplifies data management by supporting schema evolution, time travel, and efficient querying without manual partition management. Iceberg's ability to handle petabyte-scale datasets makes it a perfect fit for modern data lakes.

The Synergy of Zingg and Iceberg

Combining Zingg’s entity resolution with Iceberg’s robust data architecture delivers major advantages:

• Unified Data Mastering

Zingg performs intelligent entity resolution on Iceberg tables, deduplicating records to provide consistent entity views.

• Scalability & Performance

Iceberg delivers fast optimized query execution, while Zingg’s ML algorithms handle efficient matching at scale.

• Easy Integration

Zingg integrates directly with Iceberg tables—no complex transformations required.

• Enhanced Data Governance

You get data lineage, schema evolution, and trustworthy master data in one workflow.

• Future-Ready Architecture

The combination ensures scalable, adaptable pipelines ready for evolving data needs.

Step-by-Step Guide: Running Zingg on Docker with Snowflake Iceberg Tables

We will use Azure for storing the Iceberg data, with Snowflake managing the Iceberg tables. You can also use other services like AWS.

Setting Up Iceberg Tables on Snowflake

1. Create Azure Blob Storage Account

  • Select Primary service > Azure Blob storage
  • Keep the rest of the settings as default
  • Note the Storage Account name – this will be used later in Snowflake setup


2. Create Storage Container

  • Navigate to Storage Accounts > Data Storage > Container
  • Create a new container
  • Note the container name – this will also be used in the Snowflake setup
image4.png, Picture


3. Setup Snowflake

  • Create a new Warehouse or use the default COMPUTE_WH warehouse
  • Create a Database for Iceberg
  • Create External Volume
image9.png, Picture


4. Connect External Volume to Azure

  • Run the command: DESC VOLUME <volume_name>
  • Under the property values, find and click the URL (AZURE_CONSENT_URL) to authorize Snowflake to use Azure storage
  • Note down the AZURE_MULTI_TENANT_APP_NAME value before the underscore
image20.png, Picture
  • Once authorized, use the AZURE_MULTI_TENANT_APP_NAME value (before the underscore) in the Azure storage → Add role assignment section
image18.png, Picture


5. Verify Connectivity

SYSTEM$VERIFY_EXTERNAL_VOLUME('azure_zingg');

image17.png, Picture


6. Create Iceberg Table

Create your Iceberg table in Snowflake with the appropriate schema.

image2.png, Picture


7. Load Data into the Iceberg Table

Normally this is done from some external source, but we are just loading a small dataset from test.csv.

image14.png, Picture


Setting Up Zingg

Before setting up Zingg, you need to create a config.json file. This is the primary configuration file used to set up Zingg's matching and output processing.

Configuration File Structure

1. Field Definitions (fieldDefinition)

This section defines the fields involved in the matching process, their properties, and how they are used:

  • fieldName: The name of the field being configured
  • matchType: Specifies the type of matching for the field:
    • fuzzy: Allows partial matches or minor differences (e.g., "Jon" vs. "John")
    • exact: Requires an exact match of the values
    • dont_use: Excludes the field from matching
  • fields: Refers to the field name in the dataset
  • dataType: Specifies the data type of the field, typically string for textual data
2. Output Configuration (output)

Defines how and where the results of Zingg's matching process are stored:

  • name: Logical name for the output
  • format: Specifies the output format; in this case, Snowflake's connector for Spark (net.snowflake.spark.snowflake)
  • props: Contains Snowflake connection details
3. Input Data Source (data)

Specifies the source data configuration:

  • name: Logical name for the input dataset
  • format: Input format, similar to the output format
  • props: Connection details, similar to the output configuration
image8.png, Picture


Props Configuration

To integrate with Snowflake, specify the required JAR files in a configuration file named props.conf. This ensures the correct dependencies are loaded when Zingg interacts with Snowflake.

image7.png, Picture


Working with Docker

Pulling the Zingg Docker Image

The Zingg image can be pulled from Docker Hub using the following command.

image3.png, Picture

Starting the Docker Container

Mount the required directories and start the Zingg container:

image5.png, Picture


Zingg Workflow Phases

Phase 1: Find and Label

Zingg identifies sample record pairs during the findTrainingData phase, which users label as matches or non-matches. These labeled samples are then used to train Zingg's machine-learning model.
Run the findAndLabel phase using the following command:

image6.png, Picture

Labeling Details:

The interactive learner minimizes user effort by selecting representative pairs. Once the job finishes, proceed to the next step, where pairs are manually labeled.

image16.gif, Picture
Phase 2: Training

In the training phase, the labeled data is used to train Zingg's machine-learning models. Run the training phase with:

image10.png, Picture


Phase 3: Matching

The trained models are then used to predict duplicate or matching records within the dataset. Zingg generates an output table that includes both the original columns and additional metadata.
Run the matching phase using:

image12.png, Picture


Understanding the Output

To store the output in an external Iceberg table, ensure the table is created in advance. The output includes:

  • Source Table Columns: Preserved in their original format
  • Z_CLUSTER: A unique identifier assigned to matching records, grouping duplicates
  • Z_MINSCORE: Indicates the minimum similarity score for any record in a cluster
  • Z_MAXSCORE: Indicates the maximum similarity score for any record in a cluster
image11.png, Picture

Final Notes

This approach combines Zingg's smart data deduplication and unification with Iceberg's ability to handle large datasets, making it easier to manage and clean up data. By using Docker, you get a portable environment to run Zingg, and Snowflake ensures your data is stored and processed securely.

Explore the Zingg GitHub repository to start implementing Identity RAG, and leverage the attached notebook to integrate these capabilities into your LangChain workflows.

These technologies offer a modern, flexible, and easy-to-maintain setup that helps businesses manage their data more efficiently. This solution not only improves data quality but also ensures your data pipelines are ready for future needs, making it simpler to make better data-driven decisions.