Back to Blog
Intelligence March 14, 2026 | 11 min read

Entity Resolution: Connecting the Dots Across Data Sources

Fuzzy matching, confidence scoring, and graph-based deduplication at scale

Prismatic Engineering

Prismatic Platform

Why Entity Resolution Matters


When investigating a company, data arrives from dozens of sources: the Czech

ARES registry returns "Navigara s.r.o.", the EU sanctions list references

"NAVIGARA SRO", and a WHOIS record shows "Navigara, s.r.o." as the domain

registrant. Are these the same entity? Entity resolution answers this question

systematically across millions of records.


Without proper entity resolution, investigations produce fragmented profiles

with duplicate entries, missed connections, and unreliable risk scores. The

Prismatic entity resolution system addresses this with a multi-stage pipeline

combining fuzzy matching, confidence scoring, and graph-based analysis.


Fuzzy Matching Strategies


The first stage applies multiple string similarity algorithms in parallel.

Each algorithm captures different aspects of name similarity:


Jaro-Winkler distance works well for short strings and handles common

typos. It gives higher weight to matching prefixes, which is useful for

company names that share a root but differ in legal suffixes.


Levenshtein distance counts the minimum edits needed to transform one

string into another. It handles insertions, deletions, and substitutions

but struggles with transpositions.


Token-based comparison splits names into tokens, normalizes each one

(removing legal suffixes like s.r.o., a.s., GmbH, Ltd), and computes the

Jaccard similarity of the token sets. This handles word reordering and

partial matches effectively.


Phonetic matching using a Czech-aware Soundex variant catches names

that sound similar but are spelled differently, which is common when

names are transliterated from Cyrillic or other scripts.


Confidence Scoring with Nabla


Raw similarity scores from fuzzy matching are transformed into calibrated

confidence values using the Nabla epistemic framework. Nabla distinguishes

between two types of uncertainty:


Epistemic uncertainty reflects our lack of knowledge. When we have only

a company name and no ICO (company ID), the uncertainty is epistemic and

can be reduced by fetching additional identifiers.


Aleatoric uncertainty reflects inherent randomness in the data. Two

genuinely different companies may have similar names, and no amount of

additional data will change that. The system must accept this baseline

uncertainty.


The Nabla confidence score combines both uncertainty types into a single

value between 0.0 and 1.0, with calibration ensuring that a 0.8 confidence

means roughly 80% of matches at that level are correct.


Relationship Graphs with KuzuDB


Once entities are resolved, their relationships are stored in KuzuDB, an

embedded graph database. Nodes represent entities (persons, companies,

domains) and edges represent relationships (owns, directs, registered_at,

associated_with).


Graph queries enable powerful investigative patterns:



-- Find all companies within 2 hops of a sanctioned entity

MATCH (s:Entity {sanctioned: true})-[*1..2]-(c:Entity {type: 'company'})

RETURN c.name, c.ico, length(path) as distance

ORDER BY distance ASC


The graph structure also supports community detection algorithms that

identify clusters of related entities, revealing corporate structures

and beneficial ownership chains that are invisible in tabular data.


Deduplication Strategies


Entity resolution is not a one-time operation. As new data arrives, the

system must continuously merge duplicates while preserving provenance.

The deduplication pipeline uses a three-stage approach:


Blocking reduces the comparison space by grouping entities that share

a common attribute (same country, similar name prefix, overlapping address).

Without blocking, comparing N entities requires N squared comparisons.


Pairwise comparison applies the full fuzzy matching pipeline to each

pair within a block. Pairs exceeding the confidence threshold are marked

as potential duplicates.


Transitive closure merges chains of duplicates. If A matches B and

B matches C, all three are merged into a single canonical entity even

if A and C would not have matched directly.


Practical Considerations


Entity resolution at scale requires careful attention to performance.

The blocking stage reduces comparisons from O(n squared) to roughly O(n),

but the constant factor matters when processing hundreds of thousands of

records. The system uses ETS tables for intermediate results and streams

comparisons through Task.async_stream/3 with bounded concurrency to

avoid overwhelming the BEAM scheduler.


False positive rates are monitored continuously. When the system merges

two entities, it records the decision with full provenance so that analysts

can review and override incorrect merges. This feedback loop improves the

matching algorithms over time.


Tags

entity-resolution fuzzy-matching kuzudb nabla graph deduplication