Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hash function for AffineScalarFunc class #189

Closed
wants to merge 20 commits into from

Conversation

NelDav
Copy link

@NelDav NelDav commented Jan 22, 2024

closes #186

MichaelTiemannOSC and others added 11 commits July 5, 2023 19:44
Add __hash__ so it works with Pandas, Pint, and Pint-Pandas

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Tweak hash function so that when x==y, hash(x)==hash(y)

Test case:

```
import uncertainties
from uncertainties import ufloat

u = ufloat(1.23, 2.34)
v = ufloat(1.23, 2.34)
print(f"u{u} == v{v}: {u==v}")
print(f"hash(u){hash(u)} == hash(v){hash(v)}: {hash(u)==hash(v)}")
print(f"u{u} == (u+u)/2{(u+u)/2}: {u==(u+u)/2}")
print(f"hash(u){hash(u)} == hash((u+u)/2){hash((u+u)/2)}: {hash(u)==hash((u+u)/2)}")
```

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Ensure that we have a linear_combo attribute before trying to look up its keys.  Also re-indent so our conditional test is not too long, and add test case to test_uncertainties.

Now test_uncertainties passes all 31 tests original tests plus the new one, and test_umath passes all 9 tests.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Call `expand` directly if self._linear_part is not yet expanded.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Attempt to work around old version of nose that doesn't work with later versions of Python.  Will it work?

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Attempt replacing `nose` (which has not been regularly supported for years) with `pytest` for better CI/CD experience.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Try running nosetests directly, rather than via setuptools, which trips up nose-1.3.7 for python >= 3.6.

See nose-devs/nose#873

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Install `future` to satisfy `builtins` for Python 2.7.

Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
@NelDav NelDav changed the title feat: Add hash function for AffineScalarFunc class Add hash function for AffineScalarFunc class Jan 22, 2024
@NelDav
Copy link
Author

NelDav commented Jan 23, 2024

Duplicate of #170

@MichaelTiemannOSC
Copy link

The fact that the changes pass the tests is obviously a great first step. Looks like the next step is ensuring they work for all the versions of Pandas that uncertainties needs to support, not just the latest version.

@andrewgsavage
Copy link
Contributor

Could you get this running again? It would be good to get this into a future release.

@NelDav
Copy link
Author

NelDav commented Apr 2, 2024

Hi @andrewgsavage, nice to hear from you. I already were afraid, that this library is dead. Glad to see that a new maintainer was found :)

I merged the master branch into this one and hope, that this will fix the pipeline.

@wshanks
Copy link
Collaborator

wshanks commented Apr 2, 2024

The Python data model requires that objects that compare equal have the same value for __hash__. Since __eq__ compares values, using id() in __hash__ is not going to be sufficient. I think it might work to use

hash((self._nominal_value, frozenset((v._nominal_value, coeff) for v, coeff in self.derivatives)))

but I have not thought through edge cases carefully (__hash__ gets difficult when there are implicit calculations happening that could lead to floating point numbers rounding differently for different code paths of what should be the same object).

@NelDav
Copy link
Author

NelDav commented Apr 3, 2024

Hi @wshanks, yes yo are right, your solution will lead to same hashes for equal objects. However, it is more or less the same as the solution of this PR but in one line :)
It also generates the same problems when executing the unittests.

Btw, the __hash__ algorithm of this PR leads to same hashes for same values as well.
The reason, that the tests fail is very strange.
I try to explain in the next comment.

@NelDav
Copy link
Author

NelDav commented Apr 3, 2024

I checked the unittests and I really do not understand what's wrong.
I have some Python knowledge, but it is not good enough, to understand what's going on.
Maybe somebody else can explain.

I just analyzed the first failing test so far.
The problem is a key error in this line (test_uncertainties.py: L189):

fixed_deriv_value = func_approx.derivatives[arg]

This line is executed many time before, however for the function frac_part_modf, the key error occurs.
When checking the derivatives, we see tat there is one entry (the value is random for each execution):

{-0.3416063953213695+/-0: 1.0}

When comparing arg with the derivatives, we see, that the it is the same:

>>> print(arg == list(func_approx.derivatives.keys())[0])
True

We can also see, that those are identical objects:

>>> print(arg is list(func_approx.derivatives.keys())[0])
True

Accordingly, when checking if arg is in the list, we get True:

>>> print(arg in list(func_approx.derivatives.keys()))
True

What's really strange is, when we do not convert the keys view into a list , in will return False:

>>> print(arg in func_approx.derivatives.keys())
False

I also compared the hashes, of arg and the firs item of func_approx.derivatives.
However, since they are both the same object, the hash is also the same:

>>> print(hash(arg))
3674758567180263205
>>> print(hash(list(func_approx.derivatives.keys())[0]))
3674758567180263205

So, why does arg in list(func_approx.derivatives.keys()) return True, but arg in func_approx.derivatives.keys() False?
I think if we manage to solve this problem, the key error will be gone.

Copy link
Collaborator

@wshanks wshanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my earlier comment was misguided because I was thinking about Variable like a data container and not considering that ufloat(1, 1) != ufloat(1, 1) because the two objects represent independent variables. We should keep using id in the hash and not _nominal_value for the Variables.

if hasattr(self, '_linear_part') and hasattr(self, '_nominal_value'):
return super().__hash__()
else:
return id(self)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failure is coming from this part. __hash__ has to give the same value every time. To avoid the error, one option is to change __hash__ here to:

return hash((self._nominal_value, (id(self),), (1.,)))

and update __init__ to set self._nominal_value before the {self: 1.} dict is created. __setstate__ also needs to be updated to make sure _nominal_value is set before it creates the self-referential dict (but see my other comment about wanting to avoid tuples).

It is tempting to try to remove the self-reference but I have seen another case in a symbolic algebra library where a set up similar to this one was the best option.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are totally right, the hash must be the same every time. Not sure, how I could miss that.


ids = [id(d) for d in self.derivatives.keys()]
values_hash = tuple(self.derivatives.values())
return hash((self._nominal_value, tuple(ids), values_hash))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a case to test_hash like

z_xy = x + y
z_yx = y + x
assert z_xy == z_yx
assert hash(z_xy) == hash(z_yx)

That case fails for me with this implementation but the implementation is dependent on the iteration order of derivatives which depends on in the insertion order, which can vary depending on the order in which operations are performed. Changing the tuples to frozenset or sorting the values before casting to tuple could remove the order dependence.

Also, we could define __hash__ on LinearCombination and use that here instead of constructing it from derivatives directly. Perhaps there is no use for that though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth trying implementations that cover obvious use cases (such as z_xy == z_yx) and using assertions to raise if we've gone past the limit of our assumptions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm the problem of a forzenset might be, that the elements of a set must be unique right?
The ids are unique, but the values might not.
I will try to create a test for that...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the example:

a = ufloat(1.23, 2.34)
b = ufloat(1.23, 2.34)
c = ufloat(1.23, 2.34)

x = AffineScalarFunc(1, LinearCombination({a:1, b:2, c:1}))
y = AffineScalarFunc(1, LinearCombination({a:1, b:2, c:2}))
assert x != y                   # Valid
assert hash(x) != hash(y) # Both hashes are identical :(

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using sorted tuples now, to avoid this problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm the problem of a forzenset might be, that the elements of a set must be unique right?

This is not a blocking problem for the implementation of the __hash__. It is okay if unequal objects hash to the same value. Using frozenset would make it a little more likely to get hash collisions, but only in the case of multiple error terms with exactly the same value. Still, sorting should work just as well for the purpose of producing a deterministic hash. In the case of giant calculations with many error terms maybe sorting is slower than keeping a set, but in practice usually only a handful of variables are involved and sorting should be about as quick as creating a set.

@MichaelTiemannOSC
Copy link

I checked the unittests and I really do not understand what's wrong. I have some Python knowledge, but it is not good enough, to understand what's going on. Maybe somebody else can explain.
[...]
So, why does arg in list(func_approx.derivatives.keys()) return True, but arg in func_approx.derivatives.keys() False? I think if we manage to solve this problem, the key error will be gone.

When func_approx.derivatives.keys() is evaluated in order to compute the list, the affine scalar function is evaluated and derivatives computed. When func_approx.derivatives.keys() is referenced to check for key membership, it is only a view into the dictionary (a lazy iterable). Because derivatives are evaluated in a lazy fashion (so that correlated variables remain correlated), the test arg in func_approx.derivatives.keys() is actually looking across the list of unrealized computations, and not finding the equivalence of a realized computation with a potential--but unrealized--value.

@NelDav
Copy link
Author

NelDav commented Apr 4, 2024

ufloat(1, 1) != ufloat(1, 1) because the two objects represent independent variables. We should keep using id in the hash and not _nominal_value for the Variables.

I agree, the only problem would be, that assert ((2*x+x)/3)==x will be True but assert hash((2*x+x)/3)==hash(x) won't be.
However, I think that is no problem right?

If it is a problem, we could stick to return hash(frozenset([self._nominal_value, frozenset([id(self)]), frozenset([1.])])).
hash(ufloat(1,1)) != hash(uflaot(1,1)) would be valid anyway, because id(self) is part of the hash calculation.

@MichaelTiemannOSC
Copy link

As stated above, the Python data model requires that a == b requires that hash(a) == hash(b), so it would be a problem if hash(a) != hash(b) when a == b.

This was my solution from last year--an injection of hash functions into uncertainties in my own program's space:

def _AffineScalarFunc__hash__(self):
    if not self._linear_part.expanded():
        self._linear_part.expand()
    combo = tuple(iter(self._linear_part.linear_combo.items()))
    if len(combo) > 1 or combo[0][1] != 1.0:
        return hash(combo)
    # The unique value that comes from a unique variable (which it also hashes to)                                                   
    return id(combo[0][0])


def _Variable__hash__(self):
    # All Variable objects are by definition independent                                                                             
    # variables, so they never compare equal; therefore, their                                                                       
    # id() are allowed to differ                                                                                                     
    # (http://docs.python.org/reference/datamodel.html#object.__hash__):                                                             

    # Also, since the _linear_part of a variable is based on self, we can use                                                        
    # that as a hash (uniqueness of self), which allows us to also                                                                   
    # preserve the invariance that x == y implies hash(x) == hash(y)                                                                 
    if hasattr(self, "_linear_part"):
        if hasattr(self._linear_part, "linear_combo") and self in iter(self._linear_part.linear_combo.keys()):
            return id(tuple(iter(self._linear_part.linear_combo.keys()))[0])
        return hash(self._linear_part)
    else:
        return id(self)


try:
    import uncertainties

    uncertainties.AffineScalarFunc.__hash__ = _AffineScalarFunc__hash__
    uncertainties.Variable.__hash__ = _Variable__hash__
    [...]

@NelDav
Copy link
Author

NelDav commented Apr 4, 2024

Yes, that's true, the data model has two conditions for hashes:

An object is hashable if it has a hash value which never changes during its lifetime

Hashable objects which compare equal must have the same hash value.

However, as long as the object is mutable, and the hash depends on any of the mutable properties, we cannot guarantee, that the hash does not change during lifetime right?
Though, the only possibility to guarantee a constant hash during lifetime is to use a immutable value to calculate the hash.
However, if the hash does not change during lifetime, but the properties, __eq__ depends on can change, we cannot guarantee that two hashes are equal every time, __eq__ returns True.

Summarized, I think we have to violate one of the conditions.
The question is, which one do we want to violate?

PS: The data model recommends to not implement a the hash function for mutable objects at all:

If a class defines mutable objects and implements an eq() method, it should not implement hash(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).

Currently, I think we should violate the second condition for Variable and the first condition for AffineScalarFunction.
That would mean, returning hash(id(self)) for Variable and calculating the hash via nominal value and linear part for AffineScalarFunction.

I pushed a new proposal.
If anyone has further ideas for a better solution, please let me know.
The proposal is far from perfect.

@andrewgsavage
Copy link
Contributor

@MichaelTiemannOSC Should we know what "EA" means?

pandas ExtensionArray. An array of objects for use with pandas. Does not necessarily need to use a numpy array to hold data.

@wshanks
Copy link
Collaborator

wshanks commented Apr 5, 2024

Also, I don't know that the example adds too much here, but I want to note that I had a similar discussion several months ago about a very similar data structure in another project here.

@newville
Copy link
Member

newville commented Apr 5, 2024

@MichaelTiemannOSC Thanks for the clarification. I am not very fluent in pandas. I think propagating uncertainties on complex numbers is a fine idea. I'm not sure I would call it "easy", but it should be doable. I have a problem domain where it would very useful.

@MichaelTiemannOSC
Copy link

@MichaelTiemannOSC Thanks for the clarification. I am not very fluent in pandas. I think propagating uncertainties on complex numbers is a fine idea. I'm not sure I would call it "easy", but it should be doable. I have a problem domain where it would very useful.

Nothing like a new feature to help motivate continued efforts on maintenance ;-)

@jagerber48
Copy link
Contributor

So what is the story with the python data model. My understanding is that an object should be immutable and have an __eq__ function defined if we plan to add a hash. It sounds like the objects we are trying to add hash to, Variable and AffineScalarFunc I think, are both mutable? Or what exactly is the story here?

It sounds like we are planning to break the recommendations of the data model. I would prefer not to do that unless there is an awfully good reason to do so.

Again, I haven't dug into this as much as I'd like, or followed this discussion as closely as I'd like, but it sounds like the issue is that the uncertainties objects are technically immutable because they do some sort of lazy evaluation? And this lazy evaluation is for performance reasons?

@NelDav
Copy link
Author

NelDav commented Apr 5, 2024

My understanding is that an object should be immutable and have an eq function defined if we plan to add a hash.

@jagerber48 I think, that is not completely correct.

The data model defines the following:

The only required property is that objects which compare equal have the same hash value; it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple.
-- source

That is exactly, what we did in this PR.
Additionally, the data model mentions this:

If a class defines mutable objects and implements an eq() method, it should not implement hash(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
-- source

Yes, in general they recommend to not implement __hash__ if the object defines mutable objects and implements __eq__, which is the case for AffineScalarFunction.
However, they also tell us the reason, why we should not implement __hash__ in this scenario.
And the reason is, that the hash must be immutable.
As long as we can guarantee, that the hash is immutable, we do not have to care about the mutability of the object itself.
And since the hash is calculated from the _linear_part and _nominal_value, which are both immutable, the hash must be immutable as well.

@newville
Copy link
Member

newville commented Apr 6, 2024

So, yeah why isn't Variable.__hash__ not

      def __hash__(self):  I guess include the tag?
           return hash((self.nominal_value, self.std_dev, self.tag))

Like, this:

class Thing:
    def __init__(self, a, b):
        self.a = a
        self.b = b

    def __eq__(self, other):
        return (self.a == other.a  and  self.b == other.b)

    def __hash__(self):
        return hash((self.a, self.b))

thing1 = Thing(1, 3)
thing2 = Thing(7,  'word')

print("Test 1 ", thing1 == thing2, hash(thing1))
# False

# mutate thing 2

thing2.a = 1
thing2.b = 3
print("Test 2: ", thing1 == thing2, hash(thing2),)
# True

@wshanks
Copy link
Collaborator

wshanks commented Apr 6, 2024

I am repeating what @NelDav said a little but just to re-emphasize: the data model requires that objects that compare equal have the same hash and that the hash never changes. I think considering the case of dict keys is helpful since that is the main use case. When inserting into a dictionary, an object is hashed and then the dict is checked to see if that hash (or a truncated part of it dependent on the size of the dict) is already present. If the hash is already present, then the two matching objects are checked for equality to see if the object is already present or if it is just a hash collision. So both __hash__ and __eq__ are involved and you do not want the hash value to change after inserting an object into a dict because then you won't find it on the next look up.

AffineScalarFunc.__eq__ uses this function (which Variable does not override):

def eq_on_aff_funcs(self, y_with_uncert):
"""
__eq__ operator, assuming that both self and y_with_uncert are
AffineScalarFunc objects.
"""
difference = self - y_with_uncert
# Only an exact zero difference means that self and y are
# equal numerically:
return not(difference._nominal_value or difference.std_dev)

which casts both quantities to AffineScalarFunc (in case one is a Variable or plain float) and then takes the difference and checks that the nominal value and standard deviation are 0.

It is hard to point to how subtraction is defined because there is a lot of machinery for setting up the basic math methods, but basically AffineScalarFunc.__sub__ is set to uncertainties.core.wrap(float.__sub__, [lambda x, y: 1, lambda x, y: -1]). For the nominal value float.__sub__(self, other) is called (self and other are the original self and other cast AffineScalarFunc). For the standard deviation, the error propagation is performed, combining the .derivatives of self and other into a new dict we can call combined_derivatives by copying self.derivatives and then iterating over other.derivatives.items() and for any Variable key already in combined_derivatives subtracting the value, otherwise inserting the Variable with -value into the dict. Then the new standard deviation is sqrt(sum(variable.std_dev**2 * coefficient**2 for variable, coefficient in combined_derivatives)). Ignoring edge cases for a moment, the standard deviation is 0 only when all the coefficients in the sum are 0 and this happens only the Variables involved are the same for self and other and the derivative coefficients are exactly the same (otherwise there would be a non-zero term in this sum of squares for the Variable in self that did not have its derivative exactly subtracted by other). So what we see is tha the conditions for equality involve nominal_value, the identity of the Variables, and the coefficients of the derivatives of the variables. The conditions do not place restrictions on the std_dev of the Variables. So these are the quantities we have been including in the hash -- nominal_value and (Variable, derivative coefficient) pairs. So std_dev and tag are not required to calculate equality and so are not needed for the hash (ignoring the edge cases discussed next). Additionally, the documentation talks about the benefits of Variable.std_dev being mutable, so it definitely can not be included in the hash with that as documented behavior.

I had not thought about the edge cases closely before writing this last paragraph. There are a couple:

  1. Case where the std_dev of a Variable is 0:

    a, b = ufloat(1, 0), ufloat(1, 0)
    a == b  # True
    hash(a) != hash(b)  # Data model violation!

    This model violation is actually present in the current code, separate from this PR. The only ways I see around it are not allowing std_dev to be 0, or not allowing std_dev to be mutable (in which case std_dev could be included in the hash).

  2. Case where the derivative of a Variable is 0:

    from uncertainties import ufloat
    from uncertainties.umath import cos
    a = ufloat(0, 0.1)
    b = ufloat(1, 0.1)
    x = 2 * b
    y = 2 * b * cos(a)
    x == y  # True
    hash(x) != hash(y)  # Data model violation!

    This violation is added in this PR since x and y are AffineScalarFuncs and this PR adds AffineScalarFunc.__hash__. This case could be worked around by excluding Variables with 0 derivative from the hash.

I am not sure what to do about the std_dev==0 edge case. On the one hand, it doesn't seem that bad to forbid it. If there is no error, you could just use a plain float. On the other hand, the documentation highlights the benefits of changing std_dev to recalculate error and one thing a user might want to do is check the error when one error term is zeroed out. Perhaps __eq__ could be modified to raise an exception when one of the std_devs is 0.

@wshanks
Copy link
Collaborator

wshanks commented Apr 6, 2024

Also, I don't know that the example adds too much here, but I want to note that I had a similar discussion several months ago about a very similar data structure in another project here.

Thinking about the other similar case I mentioned, one difference is that instead of using the object id() for the hash it generates a uuid in __init__ and uses that. That is important for preserving identity across pickling. Consider the following:

import pickle
from uncertainties import ufloat


b = ufloat(1, 0.1)
x = 2 * b
x2, b2 = pickle.loads(pickle.dumps((x, b)))
b3 = pickle.loads(pickle.dumps(b))
x3 = pickle.loads(pickle.dumps(x))
x == 2 * b  # True
x2 == 2 * b2  # True
x3 == 2 * b3  # False!

The last line is false because b3 and x3 are pickled separately and the b-like Variable internal to x3 loses it reference to b3. x2 keeps the reference by being pickled and unpickled together with b2. After figuring out what to do in this PR, I think we want to switch Variable to using a stable id in its hash and equality check.

@NelDav
Copy link
Author

NelDav commented Apr 6, 2024

TLDR:
I think x3 == 2 * b3 # False is the correct behavior.
Since the hashes will also be unequal (hash(x3) == hash(2 * b3) # False), we are totally fine.


The last line is false because b3 and x3 are pickled separately and the b-like Variable internal to x3 loses it reference to b3. x2 keeps the reference by being pickled and unpickled together with b2.

That's true, however the hash will also be unequal. Which means, that it is conform with the python data model right?
The question is, is this behavior of __eq__ correct or not right?

In general, I think there are two questions, that we have to clarify:

  1. What is equality in our case? - Currently, two different Variable objects are always unequal. That means, for Variables, comparing via == is the same as comparing via is. I think this is totally fine and one of the important boundary conditions.
    Two different AffinceScalarFunctions however, can be equal. They are equal if and only if the _nominal_values are equal and if the expanded version of _linear_part is identical meaning, it contains the same references and the for each reference, the same derivation is given right?

  2. When unpickling the same Variable two times, should both objects be equal? - I think, they should not. There is a unit test, that explicitly checks if a Variable is unequal to the unpickled version of it self. (test)
    If that is a condition, why should two unpickled versions of the same Variable be equal?

If we decide, that two unpickled versions of the same object must be unequal, it follows, that if you unpickle a AffineScalarFunction object and a dependent Variable object with two separated pickle.loads calls, the AffineScalarFunction will not depend on the Variable, but on a copy of this Variable.

I personally think, the __eq__ method mixes some definitions.
It sometimes checks identity and in some situations, it checks for equality.
I think, the best solution would be to implement the __eq__ function as described in 1. above.
Additionally, we might want to add a further function, that checks for equality. (just checking if nominal value and std_dev of two AffineScalarFunction-objects are equal) We could call this function compare.

@NelDav
Copy link
Author

NelDav commented Apr 6, 2024

@wshanks Thank you very much for your detailed explanation. It is very informative.

I had not thought about the edge cases closely before writing this last paragraph. There are a couple:

  1. Case where the std_dev of a Variable is 0:

From my perspective, this is not an issue of the hash, but of the __eq__ method. As explained in the comment above, two Variable objects should always be unequal. A Variable object is only equal to itself.

  1. Case where the derivative of a Variable is 0:

This is a little strange to me. Probably it is because of my lack of knowledge about uncertainties.
Why should x equal y in the first place?

@wshanks
Copy link
Collaborator

wshanks commented Apr 6, 2024

@NelDav For your next to last comment, I created #218 since the PR discussion is long and the issue I raised is a somewhat separate concern about pickling and equality.

@NelDav
Copy link
Author

NelDav commented Apr 6, 2024

Thanks, I totally agree that this should be discussed in in another issue.

uncertainties/core.py Outdated Show resolved Hide resolved
@wshanks
Copy link
Collaborator

wshanks commented Apr 6, 2024

  1. Case where the std_dev of a Variable is 0:

From my perspective, this is not an issue of the hash, but of the __eq__ method. As explained in the comment above, two Variable objects should always be unequal. A Variable object is only equal to itself.

That's true of Variables as long as they have finite standard deviation. The current implementation provides that AffineScalarFuncs are equal when x - y has nominal value and standard deviation both equal to 0. Two Variables with the same nominal value will only be equal if both their standard deviations are 0. Otherwise the standard deviations will add as root squares and have to be non-zero. A Variable with zero standard deviation is essentially a float. That is why I am open to the idea of handling that case specially, like raising an exception if trying to check equality of two AffineScalarFuncs when one of them has a Variable with zero standard deviation. Perhaps another option would be to change __eq__ to be x.nominal_value == y.nominal_value and x.derivatives == y.derivatives. This definition allows for x - y == 0 +/- 0 but x != y when x or y has a Variable with 0 standard deviation, but maybe that is okay. It is like saying "x - y == 0 +/- 0 for all values of standard deviation of the independent variables". It still seems to make sense intuitively to me since a Variable's standard deviation is allowed to be changed at any time.

  1. Case where the derivative of a Variable is 0:

This is a little strange to me. Probably it is because of my lack of knowledge about uncertainties. Why should x equal y in the first place?

y = x * cos(a). For the example I gave, a.nominal_value = 0 so cos(a).nominal_value = 1. The derivative of cos(a) is -sin(a) so it is 0 for a = 0. So for that specific value, y.nominal_value == x.nominal_value and the derivative with respect to a is 0 so y is insensitive to error in a and so y's error is correlated with x's and y - x == 0 +/- 0. If we went with the suggestion above of __eq__ being x.nominal_value == y.nominal_value and x.derivatives == y.derivatives then x != y here. That is a little bit more unexpected in this case. We might want to extend the suggested definition to comparing derivatives after excluding any entries with a derivative coefficient of 0.

@NelDav
Copy link
Author

NelDav commented Apr 6, 2024

Perhaps another option would be to change eq to be x.nominal_value == y.nominal_value and x.derivatives == y.derivatives

That was also my first idea. And I think it would be a nice solution.
I will test it and implement a version which excludes derivatives equal to 0.

@NelDav
Copy link
Author

NelDav commented Apr 6, 2024

I just tested the new __eq__ definition and there are some problems when comparing ufloats with normal floats.
For example inside the power_wrt_ref function:

assert op(ufloat(-1, 0), 9) == ref_op(-1, 9)

In this case, the new definition will return false, because the derivatives of the left side will be {-1.0+/-0: 9.0} and the one of the right side is empty {}.
I think it is strange, that it is empty, so I step into the conversion from float to ufloat (core.py L267-269)
There, a AffineScalarFunction is generated from the float. But the linear part does not refer to it self.
I think it would be better to generate a Variable, which by default refers to itself.
However, when generating a Variable instead, the derivatives would clearly be {self:1.0}.
But that is not equal to the derivatives of the left side variable.

So, I think there is a problem as soon as we are checking for uncertainties with a std_dev of 0.
I thought about this for a while now and I do not know what behavior is the best.
On one hand, I think it is a bit strange, that pow(ufloat(1,0), 2) == pow(ufloat(1,0),2) == pow(1,2) will compare equal, but pow(ufloat(1,0.1), 2) == pow(ufloat(1,0.1),2) will not.
On the other hand, it seems fine that a ufloat without std_dev is equal to a normal float.

But maybe we can move this discussion to #218 as well. It seems to be a more general problem of __eq__ and not of the hash it self.

@MichaelTiemannOSC
Copy link

Comparing the above (uncertainties with error of zero) with complex vs. real behavior:

>>> xx = complex(3+0j)
>>> xx
(3+0j)
>>> yy = 3.0
>>> xx == yy
True
>>> hash(xx) == hash(yy)
True

I think that an uncertain number that happens to be certain (like a complex number that happens to be real) should honor equality relationships. Can a variable's value or an affine scalar function's value ever change from certain (error == 0) to uncertain? I can see the error term going from undefined to defined--that's not a problem.

@NelDav
Copy link
Author

NelDav commented Apr 6, 2024

Yes, I think your comparison with complex vs. real fit quite well.
But is it equal? I mean, you cannot really say if it is equal or not right?
If you have 2±undefined, that can be 2±0.3 or 2±3 it must not be 2±0.
So, why should 2±undefined equal 2±0?

@wshanks
Copy link
Collaborator

wshanks commented Apr 6, 2024

@MichaelTiemannOSC I was going to raise a similar point. Note the following:

from uncertainties import ufloat

ufloat(1.0, 0) == 1.0  # True
hash(ufloat(1.0, 0)) == hash(1.0)  # False, data model violation!

So current uncertainties code is violating the Python data model. I don't see any way to avoid that with the current implementation of uncertainties's features. We can't enforce that a Variable with 0 standard deviation hashes to the same value as a plain float while also allowing the documented behavior that the std_dev attribute on a Variable can be changed at any time. I agree that I think we need to move to a discussion about what AffineScalarFunc.__eq__ currently does, what it is intended to do, and what would be best for it to do in the future. Some options we have touched on above include:

  1. Do not allow std_dev to be mutated after creation of Variable. Perhaps the feature of allowing recalculating error for a different std_dev value could be provided by a helper function rather than supporting mutating the value on a live Variable.
  2. Require __eq__ to raise an exception when std_dev is 0 on a Variable. If std_dev can be mutated, we can't allow it be 0 and still compare by standard deviation of the difference since different AffineScalarFunc could compare equal when std_devs are 0 but not when they are switched to finite values.
  3. Make Variable not inherit from AffineScalarFunc and not be something that supports math operations and just be an internal bookkeeping device of AffineScalarFunc. Then we could avoid considering edge cases of Variable.__eq__ and not support hashing AffineScalarFunc.

Besides these options and others, we need to consider what level of interoperability there should be between AffineScalarFunc and float (and other numeric types outside of uncertainties) (like in the data model violation demonstrated above).

@wshanks
Copy link
Collaborator

wshanks commented Apr 6, 2024

@jagerber48 I meant to mention earlier that I think this blog post does a nice job exploring some of the edge cases that come up when a class does not follow the Python data model.

@newville
Copy link
Member

newville commented Apr 6, 2024

-1 on merging. Too many overlapping changes and conflicting opinions. Please start over.

@NelDav
Copy link
Author

NelDav commented Apr 6, 2024

Too many overlapping changes

I do not understand. There are changes in two files, with in total only 117 new lines and 11 removed lines.
The current master branch is two commits ahead.
One of those commits removes two includes. The other one moved some test files into a sub directory named tests.
Resolving the conflicts should be pretty easy.

@wshanks
Copy link
Collaborator

wshanks commented Apr 6, 2024

@NelDav I agree that this PR should be closed for now. It would be better to come to a resolution at least among @MichaelTiemannOSC, myself, and you (just selecting those three as the ones that seem most interested seeing something like the current PR merged) before presenting a new proposed change in a new pull request. Given that in the recent course of discussion we have discovered violations of the Python data model with the code currently in the master branch of uncertainties, I think we need to think through how things should be working before adding more behavior related to __hash__ and __eq__. I have found the recent discussion helpful but perhaps it is too high volume for the main repo. We could continue perhaps in a fork (a Discussion here and ask those not interested to unsubscribe) and come back when there is consensus amongst us (the parties interested AffineScalarFunc.__hash__ being added). Having a PR here gives the impression of a change ready for consideration and until at least a few of us are in agreement I don't think we should impose on the wider group to follow the exploratory discussion.

@NelDav
Copy link
Author

NelDav commented Apr 6, 2024

It would be better to come to a resolution at least among @MichaelTiemannOSC, myself, and you (just selecting those three as the ones that seem most interested seeing something like the current PR merged) before presenting a new proposed

Where should we discuss such things, if we do not do that in a PR?

Given that in the recent course of discussion we have discovered violations of the Python data model with the code currently in the master branch of uncertainties, I think we need to think through how things should be working before adding more behavior related to __hash__ and __eq__

Yes, we have discovered some violations of the Python data model inside the current master branch.
But I disagree, that we have to figure out how to fix those violations.
The hash proposed in this PR (and this PR is only regarding the hash) fits to the current master and does (probably) not introduce further data model violations, beside the violations that already exist.
Fixing the __eq__ method was never part of this and should be addressed in a completely new PR.

and until at least a few of us are in agreement I don't think we should impose on the wider group to follow the exploratory discussion

Is there any disagreement between us at the moment? - For me, it feels like all disagreements between @MichaelTiemannOSC, you and me have been resolved.
If you think there are some disagreements, just open some issues in my fork.

@NelDav
Copy link
Author

NelDav commented Apr 6, 2024

I will close this PR now, for further questions or propositions regarding the hash, please create an issue at my fork: https://github.com/NelDav/uncertainties

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas sort_values with multiple columns does not work for AffineScalarFunc
6 participants