Step By Step Guide: Running Zingg with Snowflake Iceberg Tables

Bringing together data from different sources like websites, offline stores, and CRMs often creates a common challenge: duplicate records. Without a shared way to identify customers, businesses end up with fragmented and inconsistent data, making it hard to get a clear picture of their audience. This problem has long been a hurdle for organizations, but with tools like Zingg and Apache Iceberg, it’s now easier to detect and merge duplicates, ensuring clean and reliable customer data.

Why Zingg?

Zingg is an open-source entity resolution tool designed for modern data stack. By connecting to diverse data sources such as Databricks, Snowflake, Cassandra, AWS S3, and Azure, Zingg ensures simple integration and supports multiple formats like Parquet, JSON, and CSV. Its ability to deduplicate and unify records across data silos empowers organizations to create consistent and accurate data views of customers, suppliers, and more.

Why Iceberg?

Apache Iceberg is a high-performance table format designed for large-scale analytics. It simplifies data management by supporting schema evolution, time travel, and efficient querying without manual partition management. Iceberg's ability to handle petabyte-scale datasets makes it a perfect fit for modern data lakes.

The Synergy of Zingg and Iceberg

If we combine Zingg's powerful entity resolution with Iceberg's robust data architecture brings transformative benefits:

Unified Data Mastering: Zingg enables intelligent entity resolution, deduplicating records from Iceberg tables and ensuring a unified view of data entities across the organization.

Scalability and Performance: Iceberg's optimized querying and Zingg's machine learning algorithms ensure efficient processing of large datasets, reducing latency and improving performance.

Easy Integration: Zingg's support for Iceberg's table format allows organizations to leverage their existing Iceberg-based data lakes without complex transformations directly.

Enhanced Data Governance: Together, Zingg and Iceberg ensure better data governance by providing lineage, schema evolution, and robust data mastering capabilities.

Future-Ready Architecture: This integration future-proofs data pipelines by combining Zingg's entity resolution with Iceberg's scalability, enabling businesses to adapt to evolving data needs.

Here is a step-by-step guide on Running Zingg on Docker with Snowflake Iceberg tables:

We will use Azure for storing the Iceberg data. Snowflake will manage the Iceberg tables.

We can use other services also, like AWS

Setting Iceberg tables on Snowflake

Create Azure Blob Storage Account

Select Primary service > Azure Blob storage, the rest settings can be kept default.

Storage Account name, this will be used later in Snowflake setup.

Create Storage Container

Under Storage Accounts > Data Storage > Container.

Create a new container, this container name also will be used in the snowflake setup.

Setup Snowflake

Create a new Warehouse or use the default COMPUTE_WH warehouse and also Database for Iceberg.

Create External Volume

Connect External Volume to Azure-

Run the command desc volume <volume_name>

Under the property values: find and click the URL (AZURE_CONSENT_URL) to authorize Snowflake to use Azure storage.

Note down the AZURE_MULTI_TENANT_APP_NAME value before the underscore.

Once authorized, use AZURE_MULTI_TENANT_APP_NAME value before the underscore in the section azure storage→ add role assignment.

Verify connectivity with SELECT SYSTEM$VERIFY_EXTERNAL_VOLUME('azure_zingg');

Create Iceberg Table

Load data into the Iceberg table. Normally this is done from some external source, but we are just loading a small dataset from test.csv.

Setup Zingg

Before setting up Zingg we need to create a config.json

The config.json file is the primary configuration file used to set up Zingg's matching and output processing.

1. Field Definitions (fieldDefinition)

This section defines the fields involved in the matching process, their properties, and how they are used:

fieldName: The name of the field being configured.

matchType: Specifies the type of matching for the field:

fuzzy: Allows partial matches or minor differences (e.g., "Jon" vs. "John").

exact: Requires an exact match of the values.

dont_use: Excludes the field from matching.

fields: Refers to the field name in the dataset.

dataType: Specifies the data type of the field, typically string for textual data.

2. Output Configuration (output)

Defines how and where the results of Zingg's matching process are stored:

name: Logical name for the output.

format: Specifies the output format; in this case, Snowflake's connector for Spark (net.snowflake.spark.snowflake).

props: Contains Snowflake connection details

3. Input Data Source (data)

Specifies the source data configuration:

name: Logical name for the input dataset.

format: Input format, similar to the output format.

props: Connection details, similar to the output configuration.

Props Configuration

To integrate with Snowflake, we specify the required JAR files in a configuration file named props.conf. This ensures the correct dependencies are loaded when Zingg interacts with Snowflake.

Pulling the Zingg Docker Image

The Zingg image can be pulled from Docker Hub using the following command:

Starting the Docker Container

Mount the required directories and start the Zingg container:

Workflow Phases

Find and Label Phase

Zingg identifies sample record pairs during the findTrainingData phase, which users label as matches or non-matches. These labeled samples are then used to train Zingg's machine-learning model.

Run the findAndLabel phase using the following command:

Labeling Details

The interactive learner minimizes user effort by selecting representative pairs.

Once the job finishes, proceed to the next step, where pairs are manually labeled.

Training Phase

In the training phase, the labeled data is used to train Zingg's machine-learning models.

Run the training phase with:

Matching Phase

The trained models are then used to predict duplicate or matching records within the dataset. Zingg generates an output table that includes both the original columns and additional metadata.

Run the matching phase using:

Output Details

To store the output in an external Iceberg table, ensure the table is created in advance. The output includes:

Source Table Columns: Preserved in their original format.

Z_CLUSTER: A unique identifier assigned to matching records, grouping duplicates.

Z_MINSCORE: Indicates the minimum similarity score for any record in a cluster.

Z_MAXSCORE: Indicates the maximum similarity score for any record in a cluster.

Final Notes

This approach combines Zingg's smart data deduplication and unification with Iceberg's ability to handle large datasets, making it easier to manage and clean up data. By using Docker, you get a portable environment to run Zingg, and Snowflake ensures your data is stored and processed securely. These technologies offer a modern, flexible, and easy-to-maintain setup that helps businesses manage their data more efficiently. This solution not only improves data quality but also ensures your data pipelines are ready for future needs, making it simpler to make better data-driven decisions.

‍

Join Slack