Redica Systems is a data and analytics company that is primarily into the life sciences sector, serving pharmaceutical and medical technology(medtech) clients. It analyzes data from external sources like health agencies, inspection reports, and regulatory guidelines. Redica Systems provides solutions related to quality improvement, compliance, and regulation processes. With expertise in handling both structured and unstructured data, it enables customers to derive meaningful insights. This helps them make informed decisions in a complex and highly regulated life sciences industry.
The primary challenge of Redica Systems lies in unifying structured and unstructured data from external agencies. These include inspection reports and regulatory documents, which often lack standardized identifiers. With global variations in data formats, incomplete records, and the need to deduplicate and cluster entities like manufacturing sites and investigators, the process demands scalable solutions. These solutions must handle large datasets while ensuring high accuracy, automation, and the flexibility to address edge cases.
Cheers. First of all, thank you for having me here. My name is Arijit Saha, though I also go by AJ. A lot of people in the industry will recognize me by that name. Currently, I am the Chief Technology Officer at Redica Systems. We are a data and analytics company in the life sciences vertical, catering primarily to pharmaceutical and MedTech customers. My background for the last two decades has been mostly in data and AI. I started as a software developer building web applications, gradually moving into databases and the BI world. Obviously at that time there was no such terminology like data engineering or so much of AI buzz. I was lucky enough to get early exposure to a lot of machine learning, AI, and data problems in big enterprises. The last decade or so has been mostly in startups. Previously I was with a startup, started as a principal engineer and eventually taking up the CTO role. For the past two and a half years, I've been with Redica Systems as the CTO.
Let me start with Redica Systems. As I mentioned, we are in the life sciences vertical. We are focusing on helping improve quality and compliance in this space. A lot of data that we deal with comes from external agencies like health agencies, and third party sources. Some of it is publicly available data, while some is procured through proprietary means. Now, the nature of this data is not like a typical enterprise system. We do not have control over how the data is produced. However, we need to bring insights and meaning out of that data to help our customers make better decisions regarding product quality in pharma and staying compliant with rules, regulations, and other requirements. Prior to Redica Systems, my background was primarily in structured data analytics and machine learning. But at Redica Systems, I faced the challenge of dealing with both structured and unstructured data. We handle a lot of textual data, such as documents from agencies, inspection reports, rules, regulations, and guidelines. To make this data more meaningful for our customers, we add a structure on top of it. This helps with search, enhancing the data in a way where we can bring out certain information out of the documents. One significant challenge is that similar information often comes from multiple sources, making identity resolution a first-class problem. It is a core issue for our business. However, we did not aim to develop an algorithm from scratch. That's when we got exposure to Zingg and started working with it at a much earlier stage. This was two and a half years back when you started as an open source and community edition. We have business entities like manufacturing sites, distribution centers, laboratories, investigators from different agencies, and documents. Our task is not just to deduplicate but to create a golden source of data for our customers. We then assign a unique identifier, the Redica ID, to each entity. Many of our customers face similar issues. Even though they're not dealing with this exact data source but even within their internal systems, they often have multiple records for the same entity due to acquisitions or organic growth. One of the challenges is the lack of a global standard for identifiers. Health agencies in different countries use different IDs. Our goal is to use the Redica ID as a global identifier to unify these objects into golden records, helping manage and track risk around this.
One of the problems we solve is around inspection intelligence. If you are a pharma company, and part of its quality team, you want to be prepared for any sort of inspections or audits that happen and stay ahead of it in terms of quality. You can not be reactive. We help with these kind of signals about your peers, your competitors or even maybe for your suppliers. So that also brings another component of the product which is around supplier or vendor intelligence. To be able to do risk assessments, it's really important that we resolve identities to assign the risk to the right set of records. The third product revolves around regulatory monitoring. So what changes are happening with rules, regulations and standards and how does that impact you? We help our customers stay compliant by bringing out intelligence related to that. We have also started looking at new domains like medical devices. We just started on that journey with post-market and pre-market intelligence. For example, before submitting approval for your new device or after submission, how do you monitor for recalls or any sort of adverse events.
Again, a lot of this can be a combination of structured and unstructured data. Here, identity resolution is able to link the entities with each other and create a layer not just for search but also for analytics, aggregates, and key metrics. That, in short, is what we are dealing with.
We think of it as three layers of processing. First, we have a set of pre-processing steps. Then, there are certain entities where we use Zingg with a machine learning approach for identity resolution. Finally, we have some post-processing to handle additional domain rules and bring everything together further. Pre-processing mostly involves basic data cleaning and tasks like address normalization. The site entity that I mentioned previously is typically a physical place, but there are some nuances to it where it can also be a website or maybe a person's name. Generally, it has a name and an address. When you are dealing with addresses, especially global addresses, you realise they have numerous variations in how they are written. Not all countries follow similar standards, and in some cases, sources may not even provide full addresses. We perform address parsing and cleaning beforehand so it can be used as a parameter in Zingg. Apart from that, we do some basic checks, like string similarities. Using these checks, we try to achieve deduplication.
For example, in the case of sites, we deal with over 10 million records. Through pre-processing and deduplication, we can bring this number down to 1 million. However, this 1 million is the hardest problem to solve with generic rule-based methods. That’s when we rely on Zingg to further reduce it to around 350,000 or 400,000 clusters. After that, certain domain-specific rules come into play, which are difficult to handle in a standard way. These include very specific rules like "if this happens, then do these things". Maintaining and keeping the Redica ID immutable is another crucial step. These tasks are handled in the post-processing stage, which further reduces the number to around 330,000 sites. So our journey is—10 million to 1 million, 1 million to 400k, and 400k to 330k. It is still not perfect, there are some edge cases that occur occasionally. A big aspect of this data processing is the "human in the loop" element. Reviews and the ability to correct issues are essential. Our data platform integrates automation with a human-in-the-loop workflow. This allows for corrections, especially for edge cases, which are extremely difficult to automate.
This is broadly the approach we are following for identity resolution. Some objects are simple enough that we don’t need to use the machine learning approach. These may involve basic deduplication with domain rules, like applying certain business rules. However, as we deal with more data and enter new domains, we are encountering more entities where the approach used for sites becomes necessary. In such cases, a machine learning or AI-based approach is often better. It’s not just about handling volume; it’s also about identifying very complicated patterns.
II'll talk about our general data platform stack. At the center of everything, we use Snowflake as our data warehouse. I’d say it’s kind of a lakehouse architecture, but it’s mostly focused on the data warehouse side. We are on AWS, so we use S3 extensively, especially as an object store. A lot of documents and other files are stored on S3. For transformations, we use dbt. Additionally, we use some other analytics tools like Sigma, which sits on top of Snowflake and provides a spreadsheet-like interface, giving us the ability to interact with the data effectively.
One aspect of building this platform from scratch was human observability right from the start. Exposing data to the business as early as possible helps the technical teams understand the business nuances of the data. This also enables us to take corrective actions much earlier in the life-cycle, rather than waiting for bugs to appear at a later stage. We also use another visualization tool called Good Data, which is more for embedded analytics. Our application was initially built using PHP, but we are now transitioning to a Python-based back end and a React front end. That’s the high-level overview of our tech stack.
It was actually very quick. In 2022, we rebuilt the entire data platform in nine months. Within that time, we spent a couple of sprints on Zingg, just to set it up and quickly get it running for sites. We quickly went through the documentation and reached out to you for certain questions. We were asking questions and then implementing Zingg. Out of the box, the benefit we saw was good enough for us to get started. Over time, we have taken a few more cycles to optimize it, especially around pre-processing and post-processing. In general, our Zingg implementation has been based on the knowledge we initially got through your documentation. That’s why I’m curious to re-look at things and see some of the latest innovations you’ve been working on, especially regarding your enterprise product. Maybe there are things we’re missing, but for now, I’d say the accuracy has been good enough. As I said, I don’t want to put an exact number, but compared to the previous approach, it’s obviously much better—probably more than 90% better. However, there are still some edge cases that are very difficult to optimize, and we will probably always struggle with those. Once you reach the 90+% accuracy, the last few percentage points are the most difficult to optimize—like the last 1-2%. That’s what we want to focus on next year: see what other improvements we can make and bring in other entities where we’re still using some rule-based systems for identity resolution.
Currently, we process the data twice a week. At the moment, we do full processing for entity resolution. The entire batch is processed, and then we handle keeping the Redica ID the same for existing clusters as part of the post-process. We are looking at options to move toward incremental processing going forward, but for now, it’s twice a week. The entire data set goes through the process. Earlier, it used to take a very long time—almost 5-6 hours to process just one entity. Now, we have brought it down to 40-45 minutes for the entire pipeline that includes the Zingg processes. Our data doesn't change frequently, but we want to introduce incremental capabilities, especially event driven processing. The way the data arrives is out of our control, but when something happens or an event occurs, we want to process it quickly and make it available, rather than waiting for a scheduled batch process. These are some of the improvements the team is currently focused on.
First of all, thank you for setting this up. At that time, it seemed early, but I liked the approach, and the community was very helpful. My team probably reached out many times and got assistance from both the community and you, and you were always very approachable. That made it an easier decision for us to adopt Zingg. I've been following the exciting journey you're having, with all the new developments, including the Zingg Enterprise Version. Since you now have a native approach to run Zingg on Snowflake using Zingg Enterprise, we are very interested. Right now, we use the Elastic Map Reduce service to run the Zingg community version.
I am looking forward to our collaboration and seeing what else we can do with Zingg to improve the overall data quality. As I mentioned, Redica ID is a key component of our data, and it is a value proposition to our customers. We are very serious about it and want to ensure we’re doing everything possible to set it as a global standard. Redica ID, in turn, helps the broader community because it’s about patient health, better medicines, and similar outcomes. I like that purpose and how the data delivers value, not just as a business, but for the greater good of the world. In that sense, identity resolution plays a significant role in solving this data puzzle.
I’m happy to chat about Zingg anytime. I'm such a big fan. So, thank you.