ChoiceMaker Technology Overview

Approximate Record Matching

Approximate record matching is employed when information is not always identified by a reliable unique key. Record-matching tasks can be broken down into three main categories:

Duplicate record removal or linkage: The same person, business, or thing is present more than once in a database. Duplicate records are removed or linked together
Database linkage: Two databases are linked or merged. This might occur, for instance, because of a corporate merger, to build a data warehouse, or to prevent duplication of effort by storing information common to multiple databases in a single enterprise-wide database.
Approximate database search: Search a database for records similar to an input record. Similarly, prevent users from adding duplicate records to the database by providing a real-time check of whether a record entered on a user entry screen is present in the database.

In all of these cases, the basic problem is essentially the same. Given an input or query record, search a target database for record(s) that denote the same thing (e.g., the same person, product, or company).

Matching Process Overview

Advanced approximate record matching systems generally perform matching as a two-step process, as illustrated in Figure 1:

Query record. A query record is sent to the matching engine.
Blocking. The engine searches the target database for records that are possible matches to the query record. The objective at this stage is to retrieve all possible matches and not too many non-matches.
Many possible matches. This is the set of records returned by blocking which are possible matches to the query record.
Decision making. For each possible match, the matching engine determines the probability that the record denotes the same thing as the query record. Possible matches are sorted into matches, potential matches, and non-matches based on two user-defined thresholds. I.e., any record matching the query record with a probability higher than the “match threshold” is declared a “match”.

Blocking

In the field of approximate record matching, a “blocking” step refers to a fast matching algorithm primarily used as the first step of a larger record matching system. The goal of a blocking step is to attempt to find all possible matches to an input query record rather than to aim for precision in determining which record is the correct match. Blocking thus aims for maximum “recall” or “sensitivity”, possibly at the expense of achieving high “precision” or “specificity”. In effect, blocking is a sort of “is it in the ballpark?” test that can be used to narrow down the number of records that need to be processed by the higher precision, but more computationally intensive “scoring” stage. In this second stage, ChoiceMaker uses the computationally intensive maximum entropy (ME) based model discussed below. The ME model tests every record returned by the blocking algorithm against the query record to see if they match.

Blocking provides a huge performance improvement, because when deduplicating a database of n records, there are approximately n2 /2 pairs of records in the database. Hence, when n is of any magnitude, it would be time-prohibitive for the system to attempt to examine all pairs of records.

Traditional blocking methods are generally based on an ad hoc judgment of the usefulness of different fields in a matching decision. For instance, a healthcare site might use Medicaid and medical record number matches as blocking characteristics— meaning that any records matching on those two fields would be passed on to the second-stage matching algorithm. Also commonly used are matches on birthday and first name, birthday and last name, birthday and Soundex code of last name, etc.

This traditional approach can work reasonably well, but its ad hoc nature places a limitation on the portability of any system built around it. It also has a problem of generating too many false negative responses (i.e., records that should have been linked, but were not). The quality of the blocking routine is important to the ability of the system to minimize the number of false negatives since pairs that are not seen as possible matches in the blocking phase will be missed even if the second-stage matching algorithm’s decision-making engine would have assigned them a high probability of match. At the same time, the system has to carefully manage tradeoffs between false negatives and run-time performance. If the blocking algorithm is too liberal in passing along hypothetical matches, system run-time may exceed the user's tolerance.

ChoiceMaker has developed a patent-pending approach to the blocking problem. This technique automatically builds a query that is as liberal as possible in efficiently retrieving records that match on individual fields or sets of fields while avoiding selection criteria that are predicted to return more than the maximum number of records. The ability to do blocking without extensive manual setup greatly simplifies the deployment of a record matching system. In addition, the algorithm has the advantage that response time is relatively predictable and independent of the data in the query record.

In contrast to the traditional approaches discussed above which block on fixed combinations of keys, ChoiceMaker’s automated blocking algorithm (ABA) dynamically constructs a set of database SELECT statements which we call “blocking sets.” Each blocking set is built to be as liberal as possible in retrieving records that match the query record on individual fields or sets of fields while not containing SELECT statements that are predicted to return more than a user-defined maximum number of records, m. The records retrieved are those retrieved by the union of the SELECT statements.

Machine Learning and Model Development

ChoiceMaker’s record matching models are built around a set of 50 – 200 “clues” (more commonly known in the AI community as “features”), which indicate a “match” or “differ” decision. Each clue reads the pair of records in question—generally the query record and a record from the database and determines whether a particular indicator of a match or non-match decision is present in this pair of records. If the clue’s criteria for predicting a match or non-match are met, then we say that the clue is “active” on the pair of records. Some sample clues include:

Are the first names the same and are they common, uncommon, or very rare?
Do the last names have the same phoneticization according to Soundex [6] or similar techniques?
Is the date of birth different?

These clues are written in ChoiceMaker’s ClueMaker® programming language described in “Building a Model.”

ChoiceMaker’s ML approach [3] constructs a record matching model that outputs the probability that a pair of records represents the same entity. The model is trained on a set of pairs of records that have been tagged as a “match”, “differ”, or “hold (unsure)”. Although we have experimented with other ML techniques, ChoiceMaker currently uses maximum entropy modeling (ME) [1] [2] in all of its production models. ME requires that each clue predict a decision (a “future” in the AI literature), in this case the predictions are “match” and “differ”. The ME training process then assigns each clue a weight, which is a positive real number indicating the relative predictive strength of the clue. At run time, the weights of all active clues are combined to form a probability of match by the following formula:

where MatchProduct is defined as the product of the weights of all clues predicting “match” for the pair and DifferProduct is the product of all clues predicting “differ” for the pair. To give a simple example, consider the example in the table below where we are matching the search record “Jim Conner” to the potential match record “Jim Connor.” Note here that “C560” is the phonetic code for both Conner and Connor according to Soundex. In this example, clues 1 and 3 activate predicting a “match” decision whereas clue 2 activates predicting a “differ” decision.

Putting this into the above formula gives us a probability of 79% that “Jim Conner” and “Jim Connor” are the same person:

MatchProduct = 1.5 * 5.5 = 8.25
DifferProduct = 2.2 = 2.2
Probability = 8.25 / (8.25 + 2.2) = 79%

Remember that the weights 1.5, 2.2, and 5.5 were determined during training by the maximum-entropy engine in an attempt to produce for as many examples as possible a decision that is consistent with the human marking. Note that the model most likely contains many additional clues which did not fire on this pair of records, such as “last names match,” but which would have fired if their conditions were met. Relative to other machine learning techniques, maximum entropy has proven to be a good choice for this problem because it is fast at run time and can assign accurate weights to multiple overlapping and interacting clues.

Testing

After developing clues in ClueMaker for our record matching model and training it on hand-marked data using ME, we are ready to test the model on a separate corpus of hand-marked data on which the model has not been trained. This testing on unseen data is performed on data tagged by ChoiceMaker Technologies during the development process and then is performed by the client on data tagged by the client for a final test prior to deployment. This final test is designed to be a scientific evaluation of the system, to ensure that the record matching model has not been tuned to the peculiarities of the training and development test data. Our development process is illustrated in the Figure below.

ChoiceMaker recommends that its clients evaluate the system by first determining what false-positive and false-negative error rates they can accept, where a false-positive is a record-pair identified as a match by ChoiceMaker which is human tagged as “differ” and a false-negative is a pair human tagged as “match” which ChoiceMaker identifies as “differ”

	Human Decision
Differ	Match
ChoiceMaker Decision	Differ	Correct	False Negative
Match	False Positive	Correct

Different clients may have different relative tolerances for false-positives and false-negatives. For instance, in a children’s immunization database, falsely identifying two individuals as the same person could be very serious because if, say, “John” was vaccinated but “Jon” was not, then John’s vaccine would appear on Jon’s record and Jon would not be vaccinated. On the other hand, in a counter-terrorism situation, it is more important not to miss any possibly matching suspects, even at the expense of falsely identifying non-terrorists as requiring investigative follow up.

Given the client’s stated tolerance for false-positive and false negative responses, the client can test the model using our ChoiceMaker Analyzer tool, which determines the “match” and “differ” thresholds (see section 1.1) which will yield no more than the desired error rates. The client can then evaluate the quality of the model based on the percentage of record-pairs that fall between the match and differ thresholds. Since the client will need to evaluate these records by hand, lower percentages are clearly desired.

If the percentage of record-pairs in the “false positive”, “false negative”, and “needs human review” buckets are acceptable to the client, then the model is ready to be deployed in production. On the other hand, if the accuracy is unsatisfactory, the client can use ChoiceMaker Analyzer to better understand the record-pairs that the system is classifying incorrectly. For example, if there are patterns of errors which suggest the need for additional clues or changes to existing clues, then the client can modify these clues and perform additional testing.

System Integration

ChoiceMaker has a flexible architecture which is easily integrated into almost any common system architecture. ChoiceMaker is a 100% Java application, allowing it to run on any platform supporting Java 1.4, including Windows, Unix, Linux and all other major operating systems. ChoiceMaker currently runs against Oracle, Postgres, and MS SQL-Server databases, with other database implementations planned when required. As noted below, when operating in batch mode, ChoiceMaker also takes XML or comma separated (.csv) files as input. Since it is designed from the ground up to be integrated into a larger system. ChoiceMaker offers both Enterprise Java Bean (EJB) and Web Services (SOAP) interfaces. For clients preferring a batch mode, command line interface, this is available too.

As mentioned above in Section 2, “Blocking”, ChoiceMaker offers two logical interfaces: real-time and batch. In the realtime version of the algorithm, ChoiceMaker’s inputs and outputs are as follows:

Input: A query record, comprising the identifying information about the person or entity the user is seeking in the database
Output: A list of record ID’s in the database which are possible matches to the query record.

For each ID, we return ChoiceMaker’s decision (either “match” or “hold/possible match”) and the probability that the ID matches the query record. In addition, depending on user requirements, we can also return a parsed and standardized version of the input record. For instance, we can parse a person’s name into first, middle, and last names and we can parse the address into house number, street name, apartment number, city, state, zip code, and country.

ChoiceMaker’s batch interface inputs a large number of query records all at once. These records may be input as one of the following:

An Oracle or MS SQL Server database
An XML file
A delimited flat file, such as a comma separated value (.csv) file which can be exported from Microsoft Excel

For each record in the batch, ChoiceMaker outputs the same information as the real-time version. Note that the batch interface detects matches both among the query records and between the query records and the database records. If there is no database, then ChoiceMaker simply deduplicates the query records.

As noted above, the batch interface provides very high-speed throughput, but does not provide a real-time response. It does, however, provide a Web Services or EJB API allowing the client’s system to monitor and control the progress of the batch process.

The batch interface also offers a “transitivity engine” which transforms ChoiceMaker’s pairwise match or hold decisions into user-defined database actions for each input record. These database actions, for a given record X, might include” Load X as a new record”, “Link X with record Y”, or “Place X in the queue of records to be human reviewed”.

A simple example of the transitivity engine’s usefulness is that it allows the user to specify what they want to do when ChoiceMaker detects that records A and B match and records B and C match, but ChoiceMaker cannot verify that A and C match. Another example is the following: assume that records 1 and 2 are records in a database presumed to be free of duplicates. What should be done when ChoiceMaker matches record A to both 1 and 2? These matches would seem to imply that 1 and 2 are duplicates, which contradicts the assumption that the database is free of duplicates. In each of these cases, the user might specify to ChoiceMaker’s transitivity engine that the records in question should be linked, that they should be assumed to be non-matches, or that they should be held for human review. The user also has the option of specifying different actions for the transitivity engine based on the strength of ChoiceMaker’s predictions. For instance, if A is a high probability match to both B and C, but B and C are just low probability matches to each other, the user can specify to the transitivity engine that B and C should be declared a match rather than held for human review.

Learn why ChoiceMaker is the best choice for personally identifying information record matching