-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Food for thought: SymbolId can carry a pointer to the owning SymbolTable #3054
Comments
Sounds like a waste of memory (2x for the AST) and potentially on disk too if you persist that field along the IDs. |
Is memory increase really a concern here? Even for a reasonable sized design, this wouldn't be any substantial increase. Currently, VObject is 48 bytes of which 16 bytes is two SymbolIds. It adds another 16 bytes to each instance, rounding it to 64 bytes each instance. Runtime memory lost here is easily gained by not having to clone SymbolIds from one SymbolTable to other. |
"I can't envision a scenario where something like this would interfere with multithreading and/or multiprocessing since these are read operations and SymbolTable is guaranteeing location persistence." When parsing in multithreading, all symbol tables (one per file) are populated in parallel with no synchronization. Moreover a symbol table per file is what enables partial recompile, where only files that changed gets recompiled and error messages for those files get updated. Finally all files get recombined and symbols merged. |
What would be the motiviation to do so ?
|
That's one reason.
I have considered having the index plus the string_view. @alaindargelas seems to be concerned about adding 8 bytes to the implementation and this would double that to additional 16 bytes. I don't have a good justification to defend such a change. The other reason (apart from the error-avoidance) why I think a index + SymbolTable pointer works better is it avoids having to duplicate symbols in multiple SymbolTables. Treat SymbolId as a pinned (immovable) string and as long as the lifetime of the SymbolTable is longer than the contained symbols itself, all this duplication can be avoided. There's a lot of code where a Symbol is copied from one SymbolTable to other and that to me seems very redundant. This doesn't mean there's only one SymbolTable (contrary to what @alaindargelas suggested in his comment above) across the board. There are just as many SymbolTable instances as there currently are but they will be smaller in size. |
I am not proposing single table. The number of SymbolTable remains the same as it is today. Except that if I have a SymbolId instance, I don't have to know which SymbolTable it belong to also - the SymbolId instance holds that information. In a collection of SymbolId' each instance can be pinned in a different SymbolTable and that would just work. At the moment, to create such a collection, all the SymbolIds have to belong to the same SymbolTable.
How does this process break because of the proposed change? |
On a second though - This is likely to have a performance impact. At the moment, |
What I mean is to compare the address of the string, as it is uniqified in the symbol table. So integer or pointer comparison has the same cost. |
That might work. With any of these solutions, we could also potentially save on memory in Two questions arise for any of these solutions to work -
|
I'm not following all the details, but here is a hard constraint: Some answers: The symbol tables always outlive the strings they represent (Else it is a bug). |
One side affect or unintended behavior change. Consider two symbol ids that both say resolve to bad symbols. In the current implementation their native ids would be different and so they would be considered not equal. But if the SymbolId was a wrapper on the string (or some representation of it) both would be pointing to the same bad raw symbol making them equal. I don't know if this difference matter but worth at least pointing it out since it is different from current behavior. |
Opened a discussion #3228 |
@hzeller @alaindargelas I guess, neither of you are a fan of this approach. I haven't got any feedback on the discussion about PathId using this strategy here #3228. If we aren't making any progress on this I would recommend it closing this issue with your thoughts. |
I still think we should have a the filesystem keep track of its own IDs, so e.g. have its own SymbolTable. |
In the past @alaindargelas has had reservations against using synchronization primitives especially for SymbolTable. Also, it does get a bit complicated with batch jobs since they run on the same process instance sharing the same FileSystem (and with it owning the SymbolTable, it will be bloated). I don't think the current solution is bad (may be the class name does exactly fit the logic behind it). But, I am open to suggestions on improving the existing solution, for both SymbolId & PathId. |
SymbolId in itself is just a number and to resolve it correctly it has to know which instance of SymbolTable it needs to use. This is very error prone and mistakes aren't very obvious. (I actually ended up introducing a bug and running the processes ended up creating files and directories in folders that were hard to get rid of because of special characters in them).
If SymbolId were to carry a pointer against which to resolve, it will eliminate lot of these issues. Basically, treating the SymbolId more like a handle then as an id.
symbolTable->getSymbol(symbolId)
one would usesymbolId.value()
I can't envision a scenario where something like this would interfere with multithreading and/or multiprocessing since these are read operations and SymbolTable is guaranteeing location persistence.
@alaindargelas @hzeller Thoughts??
The text was updated successfully, but these errors were encountered: