-
Notifications
You must be signed in to change notification settings - Fork 551
API documentation
Load or initialize a data model.
# initialize from a settings file
deduper = dedupe.Dedupe('my_learned_settings')
or
# initialize from a defined set of fields
fields = {
'Site name': {'type': 'String'},
'Address': {'type': 'String'},
'Zip': {'type': 'String', 'Has Missing':True},
'Phone': {'type': 'String', 'Has Missing':True},
}
deduper = dedupe.Dedupe(fields)
init
A field definition or a file location for a settings file. Settings files are typically generated by saving the settings learned in a previous session. If you need details for this
file see the method writeSettings.
A field definition is a dictionary where the keys are the fields that will be used for training a model and the values are the field specification
Field types include
- String
- Custom
- LatLong
- Set
- Interaction
A 'String' type field must have as its key a name of a field
as it appears in the data dictionary and a type declaration
ex. {'Phone': {type: 'String'}}
The string type expects fields to be of class string. Missing data should be represented as an empty string ''
String types are compared using affine gap string distance.
A 'Custom' type field must have as its key a name of a field as it appears in the data dictionary, at 'type' declaration, and a 'comparator' declaration. The comparator must be a function that can take in two field values and return a number or a numpy.nan (not a number, appropriate when a distance is not well defined, as when one of the fields is missing).
Example custom comparator:
def sameOrNotComparator(field_1, field_2) :
if field_1 and field_2 :
if field_1 == field_2 :
return 1
else:
return 0
else :
return numpy.nan
Field definition:
{'Zip': {'type': 'Custom',
'comparator' : sameOrNotComparator}}
A 'LatLong' type field must have as its key a name of a field as
it appears in the data dictionary, at 'type' declaration. LatLong
fields are compared using the Haversine Formula.
A 'LatLong' type field must consist of tuples of floats corresponding to a latitude and a longitude. If data is missing, this should be represented by a tuple of 0s (0.0, 0.0)
{'Location': {'type': 'LatLong'}}
A 'Set' type field must have as its key a name of a field as it appears in the data dictionary, at 'type' declaration. Set fields are compares sets using the Jaccard index. Missing data is on implemented for this field type.
{'Co-authors': {'type': 'Set'}}
An interaction type field can have as it's key any name you choose, a 'type' declaration, and an 'Interaction Fields' declaration. An interaction field multiplies the values of the declared fields.
The 'Interaction Fields' must be a sequence of names of other fields you have defined in your field definition.
{'Name' : {'type', 'String'},
'Zip' : {'type': 'Custom',
'comparator' : sameOrNotComparator},
'Name-Zip : {'type': 'Interaction',
'Interaction Fields' : ['Name', 'Zip]}}
Categorical variables are useful when you are dealing with qualitatively different types of things. For example, you may have data on businesses and you find that taxi cab businesses tend to have very similar names but law firms don't. Categorical variables would let you indicate whether two records are both taxi companies, both law firms, or one of each.
Dedupe would represents these three possibilities using two dummy variables:
taxi-taxi 0 0
lawyer-lawyer 1 0
taxi-lawyer 0 1
A categorical field declaration must include a list of all the different strings that you want to treat as different categories.
So if you data looks like this
'Name' 'Business Type'
AAA Taxi taxi
AA1 Taxi taxi
Hindelbert Esq lawyer
You would create a definition like:
{'Business Type' : {'type', 'Categorical',
'Categories' : ['taxi', 'lawyer']}}
Usually different data sources vary in how many duplicates are contained within them and the patterns that make two pairs of records likely to be duplicates. If you are trying to link records from more than one data set, it can be useful to take these differences into account.
If your data has a field that indicates its source, something like
'Name' 'Source'
John Adams Campaign Contributions
John Q. Adams Lobbyist Registration
John F. Adams Lobbyist Registration
You can take these sources into account by the following field definition.
{'Source' : {'type', 'Source',
'Categories' : ['Campaign Contributions', 'Lobbyist Registration']}}
Dedupe will create a categorical variable for the source and then cross-interact it with all the other variables. This has the effect of letting dedupe learn three different models at once. Let's say that we had defined another variable called name. Then our total model would have the following fields
bias
Name
Source
Source:Name
different sources
different sources:Name
Bias + Name
would predict the probability that a pair of records were duplicates if both records were from Campaign Contributions
.
Bias + Source + Name + Source:Name
would predict the probability that a pair of records were duplicates if both records were from Lobbyist Registration
Bias + different sources + Name + different sources:Name
would predict the probability that a pair of records were duplicates if one record was from each of the two sources.
If a field has missing data, you can set 'Has Missing' : True
in the field definition. This creates a new, additional field representing whether the data was present or not and zeros out the missing data. If there is missing data, but you did not declare 'Has Missing' : True
then the missing data will simply be zeroed out.
If you define an an interaction with a field that you declared to have missing data, then Has Missing : True
will also be set for the Interaction field.
Longer example of a field definition:
fields = {'name' : {'type' : 'String'},
'address' : {'type' : 'String'},
'city' : {'type' : 'String'},
'zip' : {'type' : 'Custom', 'comparator' : sameOrNotComparator},
'cuisine' : {'type' : 'String', 'Has Missing': True}
'name-address' : {'type' : 'Interaction', 'Interaction Fields' : ['name', 'city']}
}
Learn field weights from file of labeled examples or round of interactive labeling.
See our CSV and MySQL examples for methods of creating a data dictionary data_d
. To create a data sample, see the dataSample documentation.
# given data_d, a list of frozndicts, grab a sample
# see CSV example
data_sample = dedupe.dataSample(data_d, 150000)
# load training data from an existing file
deduper.train(data_sample, 'my_training')
or
# given data_d, a list of frozndicts, grab a sample
data_sample = dedupe.dataSample(data_d, 150000)
# train with active learning and human input
deduper.train(data_sample, dedupe.training.consoleLabel)
data_sample
A sample of record pairs.
training_source
Either a path to a file of labeled examples or a labeling function.
In the sample of record_pairs, each element is a tuple of two records. Each record is, in turn, a tuple of the record's key and a record dictionary.
In the record dictionary the keys are the names of the record field and values are the record values.
For example, a data_sample with only one pair of records,
[
(
(854, {'city': 'san francisco',
'address': '300 de haro st.',
'name': "sally's cafe & bakery",
'cuisine': 'american'}),
(855, {'city': 'san francisco',
'address': '1328 18th st.',
'name': 'san francisco bbq',
'cuisine': 'thai'})
)
]
The labeling function will be used to do active learning. The function will be supplied a list of examples that the learner is the most 'curious' about, that is examples where we are most uncertain about how they should be labeled. The labeling function will label these, and based upon what we learn from these examples, the labeling function will be supplied with new examples that the learner is now most curious about. This will continue until the labeling function sends a message that we it is done labeling.
The labeling function must be a function that takes two arguments. The first argument is a sequence of pairs of records. The second argument is the data model.
The labeling function must return two outputs. The function must return a dictionary of labeled pairs and a finished flag.
The dictionary of labeled pairs must have two keys, 1 and 0, corresponding to record pairs that are duplicates or nonduplicates respectively. The values of the dictionary must be a sequence of records pairs, like the sequence that was passed in.
The 'finished' flag should take the value False for active learning to continue, and the value True to stop active learning.
i.e.
labelFunction(record_pairs, data_model) :
...
return (labeled_pairs, finished)
For a working example, see consoleLabel in training.
Labeled example files are typically generated by saving the examples labeled in a previous session. If you need details for this file see the method writeTraining.
Returns a function that takes in a record dictionary and returns a list of blocking keys for the record. We will learn the best blocking predicates if we don't have them already.
blocker = deduper.blockingFunction()
ppc
Limits the Proportion of Pairs Covered that we allow a
predicate to cover. If a predicate puts together a fraction
of possible pairs greater than the ppc, that predicate will
be removed from consideration.
As the size of the data increases, the user will generally want to reduce ppc.
ppc should be a value between 0.0 and 1.0
uncovered_dupes
The number of true dupes pairs in our training
data that we can accept will not be put into any
block. If true true duplicates are never in the
same block, we will never compare them, and may
never declare them to be duplicates.
However, requiring that we cover every single true dupe pair may mean that we have to use blocks that put together many, many distinct pairs that we'll have to expensively, compare as well.
Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of blocked data.
threshold = deduper.goodThreshold(blocked_data, recall_weight=2)
blocks
Sequence of tuples of records, where each
tuple is a set of records covered by a blocking
predicate.
recall_weight
Sets the tradeoff between precision and
recall. I.e. if you care twice as much about
recall as you do precision, set recall_weight
to 2.
Partitions blocked data and returns a list of clusters, where each cluster is a tuple of record ids
clustered_dupes = deduper.duplicateClusters(blocked_data, threshold)
blocks
Sequence of tuples of records, where each
tuple is a set of records covered by a blocking
predicate.
threshold
Number between 0 and 1 (default is .5). We will
only consider as duplicates record pairs as
duplicates if their estimated duplicate likelihood is
greater than the threshold.
Lowering the number will increase recall, raising it will increase precision.
Write a settings file that contains the data model and predicates
deduper.writeSettings('my_learned_settings')
file_name
Path to file.
Write to a json file that contains labeled examples.
deduper.writeTraining('my_training')
file_name
Path to a json file.
Randomly sample pairs of records from a data dictionary
data_sample = dedupe.dataSample(data_d, 150000)
data_d
A dictionary-like object indexed by record ID where the values are dictionaries representing records.
sample_size
Number of record tuples to return. 150,000 is typically a good size.
Takes in a data dictionary and a blockingFunction and returns blocks of data to be compared.
blocked_data = dedupe.blockData(data_d, blocker)
data_d
A dictionary-like object indexed by record ID where the values are dictionaries representing records.
blocker
A blockingFunction object.