-
Notifications
You must be signed in to change notification settings - Fork 6
Combine Entities
SemTK provides support for combining entities as part of an overall entity-resolution pipeline. The logic for deciding which entities are equal is left to domain-specific tools. SemTK provides Combine Entities functionality to:
- store relationships showing identical entities
- combine the entities
SemTK provides a simple model in EntityResolution.sadl and EntityResolution.owl.
An instance of the SameAs class can be ingested to declare two other instances the same. It contains the properties:
- target which indicates the "main" instance
- duplicate which indicates a "duplicate" instance
An application that performs entity resolution may ingest instances of SameAs using normal SemTK ingestion tools. The class may be extended with a subclass containing additional information. Be aware that this will be deleted during the combining process. The properties may be extended to sub-properties for clarity.
An error occurs if a target is not of a type that is a subclass* of the duplicate instance's type.
SameAs relationships may be chained, but these are considered errors:
- An object is a duplicate to two targets
- A chain of SameAs relationships is circular
- Cardinality violations where a SameAs instance does not have exactly 1 target and 1 duplicate
Combining entities is currently accessed through semtk-python3.combine_entities_in_conn() or the REST API for /nodeGroupExecution/dispatchCombineEntitiesInConn
Combining entities occurs in passes, where each pass consists of all SameAs whose duplicate is not also a target of another SameAs. The passes continue until no more SameAs meeting this criteria are found. If any additional SameAs still exist, they will be reported as an error, as they must violate cardinality or chaining rules.
For each SameAs the combination process is:
- Delete the duplicate instance's type relationships
- Delete any triple from duplicate where adding it to target would violate a cardinality constraint. e.g. if the class has property "name" with cardinality 1 and both duplicate and target have names, the duplicate's name is deleted and the target is retained
- Copy all remaining triples where the subject or object is duplicate such that the subject or object is now target, e.g. any triple that only occurs for one or the other of target and duplicate, or where multiple triples with the same predicate are allowed
- Delete those triples where the subject or object is duplicate
- Delete any triples containing the SameAs instance as the subject
Consider this example of two entities connected by a SameAs, and consider that:
- cardinality of identifier is one
- SubDD_Req is a subclass of REQUIREMENT
The duplicate version of the instance was created with a base class REQUIREMENT and an identifier string that uses different punctuation , abbreviations, etc.
After combining entities, there is a single instance of the SubDD_Req subclass:
- all three dataInsertedBy relationships are retained
- the wasImpactedBy relationship is moved to the target instance
- the duplicate identifier is removed, as combining it would violate the cardinality of 1
- the duplicate (super)type is remove
- the SameAs instance is removed