“Information is the new oil” is the latest trend, and like oil, crude data needs to be refined before it can be consumed. In other words, having big data won’t serve any purpose unless the data is good enough to be useful. With the potential for mismatching, duplication, and other quality threats from ingesting data across disparate sources, ensuring the accuracy and quality of data is more important than ever.
This is where big data meets Master Data Management (MDM). Based on the concept of “better to be safe than sorry,” MDM users can apply data matching techniques to resolve some data quality conflicts. Applying these techniques enables users to determine the data that is “most likely” to be correct, and if not perfect, at least at a “Fit to Purpose” level of quality. This post discusses two matching techniques, Deterministic Matching and Probabilistic, or “Fuzzy,” Matching, in the context of big data.
Describing Deterministic and Probabilistic Matching
Deterministic Matching performs comparisons based on given factors and weighing calculations on two data records to determine a precise match. It then generates a score for whether the data records match or not. In contrast, Probabilistic Matching takes into account the relative closeness of the data and the context of the data records, and then assigns each of the identifiers a weighted score for the likelihood that the data records match.
While Deterministic Matching generates exact matching and non-matching relevancy scores, Probabilistic Matching engines calculate only a composite weight based on the scores for the individual identifiers.
Pros and Cons for Deterministic and Probabilistic Matching
Deterministic Matching is more vulnerable for false positive and negatives outcomes, while Probabilistic Matching is more secure due to the embedded concept of “likelihood,” which puts the tolerance-level risk on the user, who decides what level of plausibility will be accepted (as a match).
Deterministic Matching can be prohibitive when working with big data if you are looking for exact matches over a large sample of data within a realistic time period. However, there are techniques and algorithms that can apply Deterministic Matching to big data somewhat effectively, depending on the domain (for example, hashing).
Probabilistic Matching also has its potential pitfalls. Algorithms sample the data set and therefore do not scan all possible values, matching functions become more complex and time consuming, and the number of false-positive matches increase.
Another point to consider is ease of implementation. Deterministic Matching tends to be fairly easy and straightforward to define and implement, while Probabilistic Matching is generally more complex. While Deterministic Matching can employ basic application-provided matching functions, Probabilistic Matching usually applies a phonetic or statistical approach. Probabilistic Matching also incorporates an index for each match to denote relevance (likelihood) of the match.
Probabilistic Matching is commonly implemented differently based on the search domain. Domain-specific Probabilistic Matching approaches include the use of Soundex functions, locality-sensitive hashing (LSH) with a smart hashing, phonetic algorithms to simplify words based on their pronunciation, and approximate matching using graph theory.
When deciding on a matching technique to use, it’s important to keep in mind that there is not necessarily a right or wrong choice between Deterministic and Probabilistic Matching. The selection of one matching technique over another depends on several factors, including the characteristics of the data, size of data, the budget/time constraints, and the accepted tolerance for false (positives and negatives). In our experience at Knowledgent, a combination of these techniques and others is often needed when working with big data.
When do you use Deterministic Matching and Probabilistic Matching? Share your thoughts in the comments!