-
Notifications
You must be signed in to change notification settings - Fork 551
API documentation
- Defining a model
- sample
- uncertainPairs
- markPairs
- train
- threshold
- match
- writeTraining
- readTraining
- writeSettings
- blocker
- thresholdBlocks
- matchBlocks
- Defining a model
- sample
- uncertainPairs
- markPairs
- train
- threshold
- match
- readTraining
- writeSettings
- writeTraining
- blocker
- thresholdBlocks
- matchBlocks
Class for active learning deduplication. Use deduplication when you have data that can contain multiple records that can all refer to the same entity. Active learning lets a user train dedupe on how to classify records as distinct or duplicates and efficiently learn how to match records.
For larger datasets, you will need to use the thresholdBlocks
and
matchBlocks
. This methods require you to create blocks of
records. For Dedupe, each blocks should be a dictionary
of records. Each block consists of all the records that share a
particular predicate, as output by the blocker method of Dedupe.
Within a block, the dictionary should consist of records from the data, with the keys being record ids and the values being the record.
> data = {'A1' : {'name' : 'howard'},
'B1' : {'name' : 'howie'}}
...
> blocks = defaultdict(dict)
>
> for block_key, record_id in linker.blocker(data_d.items()) :
> blocks[block_key].update({record_id : data_d[record_id]})
>
> blocked_data = blocks.values()
> print blocked_data
[{'A1' : {'name' : 'howard'},
'B1' : {'name' : 'howie'}}]
Initialize a Dedupe object with a field definition
# initialize from a defined set of fields
fields = {
'Site name': {'type': 'String'},
'Address': {'type': 'String'},
'Zip': {'type': 'String', 'Has Missing':True},
'Phone': {'type': 'String', 'Has Missing':True},
}
deduper = dedupe.Dedupe(fields)
or
deduper = dedupe.Dedupe(fields, data_sample)
data_sample
is an optional argument that we'll discuss fully below
num_processes
should be the number of processes to use for parallel processing, defaults to 1
A field definition is a dictionary where the keys are the fields that will be used for training a model and the values are the field specification
Field types include
- String
- Custom
- LatLong
- Set
- Interaction
A 'String' type field must have as its key a name of a field
as it appears in the data dictionary and a type declaration
ex. {'Phone': {type: 'String'}}
The string type expects fields to be of class string. Missing data should be represented as an empty string ''
String types are compared using affine gap string distance.
A 'Custom' type field must have as its key a name of a field as it appears in the data dictionary, at 'type' declaration, and a 'comparator' declaration. The comparator must be a function that can take in two field values and return a number or a numpy.nan (not a number, appropriate when a distance is not well defined, as when one of the fields is missing).
Example custom comparator:
def sameOrNotComparator(field_1, field_2) :
if field_1 and field_2 :
if field_1 == field_2 :
return 1
else:
return 0
else :
return numpy.nan
Field definition:
{'Zip': {'type': 'Custom',
'comparator' : sameOrNotComparator}}
A 'LatLong' type field must have as its key a name of a field as
it appears in the data dictionary, at 'type' declaration. LatLong
fields are compared using the Haversine Formula.
A 'LatLong' type field must consist of tuples of floats corresponding to a latitude and a longitude. If data is missing, this should be represented by a tuple of 0s (0.0, 0.0)
{'Location': {'type': 'LatLong'}}
A 'Set' type field must have as its key a name of a field as it appears in the data dictionary, at 'type' declaration. Set fields are compares sets using the Jaccard index. Missing data is on implemented for this field type.
{'Co-authors': {'type': 'Set'}}
An interaction type field can have as it's key any name you choose, a 'type' declaration, and an 'Interaction Fields' declaration. An interaction field multiplies the values of the declared fields.
The 'Interaction Fields' must be a sequence of names of other fields you have defined in your field definition.
Interactions are good when the effect of two predictors is not simply additive.
{'Name' : {'type': 'String'},
'Zip' : {'type': 'Custom',
'comparator' : sameOrNotComparator},
'Name-Zip' : {'type': 'Interaction',
'Interaction Fields' : ['Name', 'Zip']}}
Categorical variables are useful when you are dealing with qualitatively different types of things. For example, you may have data on businesses and you find that taxi cab businesses tend to have very similar names but law firms don't. Categorical variables would let you indicate whether two records are both taxi companies, both law firms, or one of each.
Dedupe would represents these three possibilities using two dummy variables:
taxi-taxi 0 0
lawyer-lawyer 1 0
taxi-lawyer 0 1
A categorical field declaration must include a list of all the different strings that you want to treat as different categories.
So if you data looks like this
'Name' 'Business Type'
AAA Taxi taxi
AA1 Taxi taxi
Hindelbert Esq lawyer
You would create a definition like:
{'Business Type' : {'type': 'Categorical',
'Categories' : ['taxi', 'lawyer']}}
Usually different data sources vary in how many duplicates are contained within them and the patterns that make two pairs of records likely to be duplicates. If you are trying to link records from more than one data set, it can be useful to take these differences into account.
If your data has a field that indicates its source, something like
'Name' 'Source'
John Adams Campaign Contributions
John Q. Adams Lobbyist Registration
John F. Adams Lobbyist Registration
You can take these sources into account by the following field definition.
{'Source' : {'type': 'Source',
'Categories' : ['Campaign Contributions', 'Lobbyist Registration']}}
Dedupe will create a categorical variable for the source and then cross-interact it with all the other variables. This has the effect of letting dedupe learn three different models at once. Let's say that we had defined another variable called name. Then our total model would have the following fields
bias
Name
Source
Source:Name
different sources
different sources:Name
Bias + Name
would predict the probability that a pair of records were duplicates if both records were from Campaign Contributions
.
Bias + Source + Name + Source:Name
would predict the probability that a pair of records were duplicates if both records were from Lobbyist Registration
Bias + different sources + Name + different sources:Name
would predict the probability that a pair of records were duplicates if one record was from each of the two sources.
If a field has missing data, you can set 'Has Missing' : True
in the field definition. This creates a new, additional field representing whether the data was present or not and zeros out the missing data. If there is missing data, but you did not declare 'Has Missing' : True
then the missing data will simply be zeroed out.
If you define an an interaction with a field that you declared to have missing data, then Has Missing : True
will also be set for the Interaction field.
Longer example of a field definition:
fields = {'name' : {'type' : 'String'},
'address' : {'type' : 'String'},
'city' : {'type' : 'String'},
'zip' : {'type' : 'Custom', 'comparator' : sameOrNotComparator},
'cuisine' : {'type' : 'String', 'Has Missing': True}
'name-address' : {'type' : 'Interaction', 'Interaction Fields' : ['name', 'city']}
}
If you did not initialize the Dedupe object with a data_sample, you will need to call this method to take a random sample of your data to be used for training.
data_sample = deduper.sample(data_d, 150000)
data
A dictionary-like object indexed by record ID where the values are dictionaries representing records.
sample_size
Number of record tuples to return. Defaults to 150,000.
If can't use this method because of the size your data (see the MySQL), you'll need to initialize Dedupe with the sample
deduper = Dedupe.(field_definition, data_sample)
data_sample
should be a sequence of tuples, where each tuple contains a pair of records, and each record
is a dictionary-like object that contains the field names you declared in field_definitions as keys.
For example, a data_sample with only one pair of records,
[
(
(854, {'city': 'san francisco',
'address': '300 de haro st.',
'name': "sally's cafe & bakery",
'cuisine': 'american'}),
(855, {'city': 'san francisco',
'address': '1328 18th st.',
'name': 'san francisco bbq',
'cuisine': 'thai'})
)
]
Returns a list of pairs of records from the sample of record pairs tuples that Dedupe is most curious to have labeled.
> pair = deduper.uncertainPairs()
> print pair
[({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})]
This method is mainly useful for building a user interface for training a matching model.
Add users labeled pairs of records to training data and update the matching model
labeled_examples
must be a dictionary with two keys, match
and distinct
the values are lists that can contain pairs of records.
labeled_example = {'match' : [],
'distinct' : [({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})]
}
deduper.markPairs(labeled_examples)
This method is useful for building a user interface for training a matching model or for adding training data from an existing source.
Learn final pairwise classifier and blocking rules. Requires that adequate training data has been already been provided.
ppc
Limits the Proportion of Pairs Covered that we allow a
predicate to cover. If a predicate puts together a fraction
of possible pairs greater than the ppc, that predicate will
be removed from consideration.
As the size of the data increases, the user will generally want to reduce ppc.
ppc should be a value between 0.0 and 1.0
uncovered_dupes
The number of true dupes pairs in our training
data that we can accept will not be put into any
block. If true true duplicates are never in the
same block, we will never compare them, and may
never declare them to be duplicates.
However, requiring that we cover every single true dupe pair may mean that we have to use blocks that put together many, many distinct pairs that we'll have to expensively, compare as well.
deduper.train()
Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of data.
data
is a dictionary of records, where the keys are record_ids
and the values are dictionaries with the keys being
field names
recall_weight
sets the tradeoff between precision and
recall. I.e. if you care twice as much about
recall as you do precision, set recall_weight
to 2.
> threshold = deduper.threshold(data, recall_weight=2)
> print threshold
0.21
Identifies records that all refer to the same entity, returns tuples of record ids, where the record_ids within each tuple should refer to the same entity
This method should only used for small to moderately sized datasets for larger data, use matchBlocks
data
is a dictionary of records, where the keys are record_ids
and the values are dictionaries with the keys being
field names
threshold
is a number between 0 and 1 (default is .5). We will consider
records as potential duplicates if the predicted probability
of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
> duplicates = deduper.match(data, threshold=0.5)
> print duplicates
[(3,6,7), (2,10), ..., (11,14)]
Write to a json file that contains labeled examples.
file_name
Path to a json file.
deduper.writeTraining('./my_training.json')
Read training from previously saved training data file
training_source
is the path of the training data file
deduper.readTraining('./my_training.json')
Write a settings file that contains the data model and predicates
file_name
Path to file.
deduper.writeSettings('my_learned_settings')
Generate the predicates for records. Yields tuples of (predicate, record_id).
data
A dictionary-like object indexed by record ID where the values are dictionaries representing records.
> blocked_ids = deduper.blocker(data)
> print list(blocked_ids)
[('foo:1', 1), ..., ('bar:1', 100)]
Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of blocked data.
threshold = deduper.thresholdBlocks(blocked_data, recall_weight=2)
blocks
Sequence of tuples of records, where each
tuple is a set of records covered by a blocking
predicate.
recall_weight
Sets the tradeoff between precision and
recall. I.e. if you care twice as much about
recall as you do precision, set recall_weight
to 2.
Partitions blocked data and returns a list of clusters, where each cluster is a tuple of record ids
clustered_dupes = deduper.matchBlocks(blocked_data, threshold)
blocks
Sequence of tuples of records, where each
tuple is a set of records covered by a blocking
predicate.
threshold
Number between 0 and 1 (default is .5). We will
only consider as duplicates record pairs as
duplicates if their estimated duplicate likelihood is
greater than the threshold.
Lowering the number will increase recall, raising it will increase precision.
Class for deduplication using saved settings. Use deduplication when you have data that can contain multiple records that can all refer to the same entity. If you have already trained dedupe, you can load the saved settings with StaticDedupe.
Initialize a Dedupe object with saved settings
settings_file
should be the path to settings file produced from the writeSettings
method of a previous, active Dedupe object.
num_processes
should be the number of processes to use for parallel processing, defaults to 1
deduper = StaticDedupe('my_settings_file')
Class for active learning record linkage.
Use RecordLinkMatching when you have two datasets that you want to merge. Each dataset, individually, should contain no duplicates. A record from the first dataset can match one and only one record from the second dataset and vice versa. A record from the first dataset need not match any record from the second dataset and vice versa.
For larger datasets, you will need to use the thresholdBlocks
and
matchBlocks
. This methods require you to create blocks of
records. For RecordLink, each blocks should be a pairs of dictionaries
of records. Each block consists of all the records that share a
particular predicate, as output by the blocker method of RecordLink.
Within a block, the first dictionary should consist of records from the first dataset, with the keys being record ids and the values being the record. The second dictionary should consist of records from the dataset.
> data_1 = {'A1' : {'name' : 'howard'}}
> data_2 = {'B1' : {'name' : 'howie'}}
...
> blocks = defaultdict(lambda : ({}, {}))
>
> for block_key, record_id in linker.blocker(data_1.items()) :
> blocks[block_key][0].update({record_id : data_1[record_id]})
> for block_key, record_id in linker.blocker(data_2.items()) :
> if block_key in blocks :
> blocks[block_key][1].update({record_id : data_2[record_id]})
>
> blocked_data = blocks.values()
> print blocked_data
[({'A1' : {'name' : 'howard'}}, {'B1' : {'name' : 'howie'}})]
We assume that the fields you want to compare across datasets have the same field name.
Draws a random sample of combinations of records from the first and second datasets, and initializes active learning with this sample
data_1
is a dictionary of records from first dataset, where the keys
are record_ids and the values are dictionaries with the keys being
field names.
data_2
is dictionary of records from second dataset, same form as
data_1
sample_size
is the size of the sample to draw
linker.sample(data_1, data_2, 150000)
Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of data.
data_1
is a dictionary of records from first dataset, where the keys
are record_ids and the values are dictionaries with the keys being
field names
data_2
is a dictionary of records from second dataset, same form as
data_1
recall_weight
sets the tradeoff between precision and
recall. I.e. if you care twice as much about recall as you do
precision, set recall_weight to 2.
> threshold = deduper.threshold(data_1, data_2, recall_weight=2)
> print threshold
0.21
Identifies pairs of records that refer to the same entity, returns tuples of record ids, where both record_ids within a tuple should refer to the same entity
This method should only used for small to moderately sized datasets for larger data, use matchBlocks
data_1
is a dictionary of records from first dataset, where the keys
are record_ids and the values are dictionaries with the keys being
field names
data_2
is a dictionary of records from second dataset, same form as
data_1
threshold
is a number between 0 and 1 (default is .5). We will consider
records as potential duplicates if the predicted
probability of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
Class for record linkage using saved settings. If you have already trained a record linkage instance, you can load the saved settings with StaticDedupe.
Train a matcher instance (Dedupe or RecordLink) from the command line.
> dedupe = Dedupe(fields, data_sample)
> dedupe.consoleLabel(dedupe)
Construct training data for consumption by the RecordLink
markPairs
method from already linked datasets.
data_1
is a dictionary of records from first dataset, where the keys
are record_ids and the values are dictionaries with the keys being
field names
data_2
is a dictionary of records from second dataset, same form as
data_1
common_key
is the name of the record field that uniquely identifies
a match
training_size
the rough limit of the number of training examples,
defaults to 50000
Every match must be identified by the sharing of a common key. This function assumes that if two records do not share a common key then they are distinct records.
Construct training data for consumption by the Dedupe
markPairs
method from an already deduplicated dataset.
data
is a dictionary of records, where the keys are record_ids and
the values are dictionaries with the keys being
field names
common_key
is the name of the record field that uniquely identifies
a match
training_size
is the rough limit of the number of training examples,
defaults to 50000
Every match must be identified by the sharing of a common key. his function assumes that if two records do not share a common key then they are distinct records.