Replies: 5 comments
-
Hey, thank you for the thoroughly written issue. Regarding problem 1I don't think the Human node will need to be defined in multiple places. The GSuiteUser-to-Human link makes that connection if GSuiteUser.email == Human.email. Let me see if I can roughly sketch out what the model will look like. @dataclass(frozen=True)
class GSuiteUser(CartographyNodeSchema):
label: str = 'GSuiteUser'
properties: ... omitted ...
sub_resource_relationship: ... omitted ...
other_relationships: Optional[OtherRelationships] = OtherRelationships(
GSuiteUserToHumanRel(),
)
@dataclass(frozen=True)
class GSuiteUserToHumanRel(CartographyRelSchema):
target_node_label: str = 'Human'
target_node_matcher: TargetNodeMatcher = make_target_node_matcher({'email': PropertyRef('HumanEmail')})
direction: LinkDirection = LinkDirection.INWARD
rel_label: str = "IDENTITY_GSUITE"
properties: ... omitted ... and then loading it would be straightforward with from cartography.client.core.tx import load
load(
neo4j_session,
GSuiteUser(),
data,
lastupdated=update_tag,
... other params ...
) as long as the data = [
{
# ... all the fields that a GSuiteUser has.
'HumanEmail': 'homer@example.com',
},
{
# ... all the fields that a GSuiteUser has.
'HumanEmail': 'marge@example.com',
},
{
# ... all the fields taht a GSuiteUser has.
# This still works even if some items in the data don't
# have a `HumanEmail` defined because Neo4j subqueries let us do an
# optional match, and this lets us use a single generated query
# to ingest data in the same list with varying relationships.
}
] For a concrete example, see this integration test: https://github.com/lyft/cartography/blob/81902b23fa80e4ba5332ba00b4477e3a556d5eb7/tests/integration/cartography/graph/test_querybuilder_rel_subsets.py#L11-L18 The Human to GitHubUser link isn't present in OSS cartography but modeling it would look similar as above. Regarding this point,
If field values can change between intel sources, it might make sense to prefix the field name on a given node with SIMPLE_NODE_MISSING_PROPS = [
{
'Id': 'SimpleNode1',
'property1': 'The',
},
'Id': 'SimpleNode2',
'property2': 'Fox',
},
]
load(neo4j_session, SimpleNodeSchema(), SIMPLE_NODE_MISSING_PROPS, lastupdated=1) then the result I get with match(n:SimpleNode) return n.id, n.property1, n.property2; and all the nulls get treated as properties that don't exist on the node, which is exactly what we want. I'll add this as a formal test to the code though. Regarding problem 2
Automatically deleting objects created by another intel module is absolutely something we intend to avoid. We should try to make it so that the autocleanup can smartly delete only the objects we want. It's getting a bit late where I am so I'll give more of a hand-wavy explanation of my last few thoughts:
Thanks again for writing this up and informing the design of the data model. To summarize, I think the new data model addresses the concerns of problem 1, but I think I agree with problem 2 and I see a potential problem. Again, this is early stages and we are figuring this out as quickly as possible and I think things will make more sense as we continue to put them together. Will fix and make things smooth for your Lastpass change. |
Beta Was this translation helpful? Give feedback.
-
I decided to do a longer write up to explain more of the background and the "why" behind the data model: https://docs.google.com/document/d/1HI_EUgXd55affTNznEj80aY3vNVWlnWSLjwyxJ8nDpQ/edit#. It's long but I figured it's complicated enough that it needs to be documented in some way. |
Beta Was this translation helpful? Give feedback.
-
Thank you for this amazing reply and documentation. I will try to go deeper in this new schema, will open new issues if needed. |
Beta Was this translation helpful? Give feedback.
-
Going to transfer this to a discussion since there may be additional schema discussions to talk about. |
Beta Was this translation helpful? Give feedback.
-
Issue #1210 is relevant - I'll draft a fix PR shortly. Hopefully this will unblock the refactors for the rest of the project. Then we can create issues to perform the refactors and hopefully everything will be smooth. |
Beta Was this translation helpful? Give feedback.
-
Example:
(h)-[r:IDENTITY_GSUITE]->(user)
(h)-[r:IDENTITY_GITHUB]->(user)
I'm stuck with the CartographySchema implementation :
Problem 1 - multiple definition
Even with the schema reorganization (#1096) the Human node must be redefined several times that could lead to inconsistant fields
Add a cross_intel/common/wellknown list of Node in this schema will lead to a PropertyRef issue (field can change between intel sources)
Suggestion: add a notion of "constraint/schema" verified by the CI which checks the consistency of the schema (the nodes are defined several times but consistency is guaranteed)
Problem 2 - auto clean
In my usage, there is not one execution of the script but several (which can be triggered by Terraform CIs, crons etc ...)
With a CartographySchemaNode, autoclean will erasing the Humans created by another Intel
ex: GitHub will update a small subset of Human, but maybe sooner GoogleSuite updated more Humans nodes that will be purged by autoclean on the "GitHub execution"
Suggestion: add a cli parameter: autoclean_max_age (0 by default) and purge for lastupdate < NOW - autoclean_max_age instead of lastupdate != NOW
Beta Was this translation helpful? Give feedback.
All reactions