We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Entity Resolution: Connecting the Dots Across Data Sources
Fuzzy matching, confidence scoring, and graph-based deduplication at scale
Prismatic Engineering
Prismatic Platform
Why Entity Resolution Matters
When investigating a company, data arrives from dozens of sources: the Czech
ARES registry returns "Navigara s.r.o.", the EU sanctions list references
"NAVIGARA SRO", and a WHOIS record shows "Navigara, s.r.o." as the domain
registrant. Are these the same entity? Entity resolution answers this question
systematically across millions of records.
Without proper entity resolution, investigations produce fragmented profiles
with duplicate entries, missed connections, and unreliable risk scores. The
Prismatic entity resolution system addresses this with a multi-stage pipeline
combining fuzzy matching, confidence scoring, and graph-based analysis.
Fuzzy Matching Strategies
The first stage applies multiple string similarity algorithms in parallel.
Each algorithm captures different aspects of name similarity:
Jaro-Winkler distance works well for short strings and handles common
typos. It gives higher weight to matching prefixes, which is useful for
company names that share a root but differ in legal suffixes.
Levenshtein distance counts the minimum edits needed to transform one
string into another. It handles insertions, deletions, and substitutions
but struggles with transpositions.
Token-based comparison splits names into tokens, normalizes each one
(removing legal suffixes like s.r.o., a.s., GmbH, Ltd), and computes the
Jaccard similarity of the token sets. This handles word reordering and
partial matches effectively.
Phonetic matching using a Czech-aware Soundex variant catches names
that sound similar but are spelled differently, which is common when
names are transliterated from Cyrillic or other scripts.
Confidence Scoring with Nabla
Raw similarity scores from fuzzy matching are transformed into calibrated
confidence values using the Nabla epistemic framework. Nabla distinguishes
between two types of uncertainty:
Epistemic uncertainty reflects our lack of knowledge. When we have only
a company name and no ICO (company ID), the uncertainty is epistemic and
can be reduced by fetching additional identifiers.
Aleatoric uncertainty reflects inherent randomness in the data. Two
genuinely different companies may have similar names, and no amount of
additional data will change that. The system must accept this baseline
uncertainty.
The Nabla confidence score combines both uncertainty types into a single
value between 0.0 and 1.0, with calibration ensuring that a 0.8 confidence
means roughly 80% of matches at that level are correct.
Relationship Graphs with KuzuDB
Once entities are resolved, their relationships are stored in KuzuDB, an
embedded graph database. Nodes represent entities (persons, companies,
domains) and edges represent relationships (owns, directs, registered_at,
associated_with).
Graph queries enable powerful investigative patterns:
-- Find all companies within 2 hops of a sanctioned entity
MATCH (s:Entity {sanctioned: true})-[*1..2]-(c:Entity {type: 'company'})
RETURN c.name, c.ico, length(path) as distance
ORDER BY distance ASC
The graph structure also supports community detection algorithms that
identify clusters of related entities, revealing corporate structures
and beneficial ownership chains that are invisible in tabular data.
Deduplication Strategies
Entity resolution is not a one-time operation. As new data arrives, the
system must continuously merge duplicates while preserving provenance.
The deduplication pipeline uses a three-stage approach:
Blocking reduces the comparison space by grouping entities that share
a common attribute (same country, similar name prefix, overlapping address).
Without blocking, comparing N entities requires N squared comparisons.
Pairwise comparison applies the full fuzzy matching pipeline to each
pair within a block. Pairs exceeding the confidence threshold are marked
as potential duplicates.
Transitive closure merges chains of duplicates. If A matches B and
B matches C, all three are merged into a single canonical entity even
if A and C would not have matched directly.
Practical Considerations
Entity resolution at scale requires careful attention to performance.
The blocking stage reduces comparisons from O(n squared) to roughly O(n),
but the constant factor matters when processing hundreds of thousands of
records. The system uses ETS tables for intermediate results and streams
comparisons through Task.async_stream/3 with bounded concurrency to
avoid overwhelming the BEAM scheduler.
False positive rates are monitored continuously. When the system merges
two entities, it records the decision with full provenance so that analysts
can review and override incorrect merges. This feedback loop improves the
matching algorithms over time.