-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
MAINT: Simplify array setstate by using general deallocation code #30414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6dc78ef to
e79f832
Compare
mhvk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, tests pass now on this refactoring to bring deallocation attempts together. I added some comments in-line to help review.
| if (_buffer_info_free(fa->_buffer_info, (PyObject *)self) < 0) { | ||
| PyErr_WriteUnraisable(NULL); | ||
| if (fa->_buffer_info) { | ||
| if (_buffer_info_free(fa->_buffer_info, (PyObject *)self) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part shows both the advantage and disadvantage of combining the deallocation from two different parts:
- advantage:
_buffer_info_freewas not called in the__setstate__path. Perhaps it can never be set? Or just introduced later? Better safe than sorry! - disadvantage: bit spaghetti-codey to signal an error return only on one path. Also, we now have to make sure we clear any member that we deallocated (the assignment to NULL just below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it can never be set? Or just introduced later?
Both are true, since __setstate__ only makes sense on a fresh array so a buffer cannot have been exported for it (I honestly don't think __setstate__ is a good choice for our types, but that is a different issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you re-factor this and remove the unraisable propagation? This function has no reason to handle it, it can just propagate errors normally (it has be careful to be OK with an in-flight error, but it is, I think).
You can handle the PyErr_WriteUnraisable inside dealloc where it is obvious.
We could put the array_dealloc string into the static string data to avoid that error handling/NULL path, but it is also OK as it is (avoiding the NULL was there for older PyPy versions which is irrelevant.)
| fa->_buffer_info = NULL; | ||
| } | ||
|
|
||
| if (fa->weakreflist != NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should only be done for real dealloc (there is a guard against calling it when ref count != 0 in the code).
| " Required call to PyArray_ResolveWritebackIfCopy or " | ||
| "PyArray_DiscardWritebackIfCopy is missing."; | ||
| /* | ||
| * prevent reaching 0 twice and thus recursing into dealloc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand at all why this was there -- if there is a piece of code that tries to defref self again, it is a bug, and should be exposed. Given that tests pass without it, it doesn't seem to be necessary (any more?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is probably never needed, but please don't remove it. If we remove it, I would rather just make the whole pass a full error, since the warning has been around forever (yes, it's not a raised error unfortunately, as it it will be printed).
The problem this protects against is if PyArray_ResolveWritebackIfCopy calls Py_INCREC(self) for any reason at all, things would blow up.
And while it may be that there is no path that will do an incref, it seems opaque enough to follow that leaking a global reference here is immaterial.
(Yeah, I am basically saying that I don't want to audit/promise that PyArray_ResolveWritebackIfCopy won't cause an INCREF.)
| } | ||
|
|
||
| if ((fa->flags & NPY_ARRAY_OWNDATA) && fa->data) { | ||
| else if ((fa->flags & NPY_ARRAY_OWNDATA) && fa->data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else if since one cannot own data and have a base too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is correct the update-if-copy path above has both set.
| retval = WARN_IN_DEALLOC(PyExc_RuntimeWarning, msg, unraisable); | ||
| assert(retval == 0); // must be true if unraisable. | ||
| } | ||
| // Guess at malloc/free ??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not directly relevant here, but this seems pretty insane... I guess for deallocation one has little choice...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all that insane, it allows downstream C extensions to create NumPy arrays in C that own their data but use an existing allocation.
It might be nice if those created a little capsule that owned the data instead, but that is a pretty big ABI break and honestly, I don't think this path is that bad.
| } | ||
|
|
||
| /* | ||
| * Reassigning fa->descr messes with the reallocation strategy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay, this all becomes irrelevant!
| "impossible dimension while unpickling array"); | ||
| return NULL; | ||
| } | ||
| if (dimensions[i] == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike in ctors, here the code never replaced 0 by 1, so there is no need to separately track whether the result is empty: multiplication by 0 will just ensure that.
| Py_DECREF(rawdata); | ||
| rawdata = tmp; | ||
| if (tmp == NULL) { | ||
| Py_SETREF(rawdata, PyUnicode_AsLatin1String(rawdata)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small simplification
| } | ||
| } | ||
|
|
||
| if ((PyArray_FLAGS(self) & NPY_ARRAY_OWNDATA)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bigger cleanup. Note that this misses cleaning up _buffer_info, or the case where there is a base. Furthermore, it just clears WRITEBACKIFCOPY, while if that is set, presumably something should happen. (It may well be that none of those can happen, but better safe than sorry...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, none of this is remotely possible. If you call __setstate__ on anything but a freshly created array during pickling, you are in trouble anyway.
| if (num == 0) { | ||
| num = 1; | ||
| } | ||
| /* Store the handler in case the default is modified */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to avoid this strange dance with mem_handler.
seberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments, although by the time I reached the end of this, I am wondering if we can't just tighten error checks rather than taking into account things like a mutated data handler?
To be clear, I am not sure that this will simplify things quite enough to be worthwhile, but as such I like the idea of __setstate__ being restrictive.
The fact is, we just created that array with _reconstruct, unfortunately, that might have caused a call to __array_finalize__ which can do some crazy things (like reshaping), I don't think it can do many crazy things, such as:
- The data handler must be the default.
- The
OWNDATAflag cannot have been unset.
I suppose technically, it could in theory do buffer exports, but it cannot have writebackifcopy (in a reasonable way).
Although, I would be willing to gamble on the buffer export thing and help that one person who it might hit to deal with it.
| if (_buffer_info_free(fa->_buffer_info, (PyObject *)self) < 0) { | ||
| PyErr_WriteUnraisable(NULL); | ||
| if (fa->_buffer_info) { | ||
| if (_buffer_info_free(fa->_buffer_info, (PyObject *)self) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it can never be set? Or just introduced later?
Both are true, since __setstate__ only makes sense on a fresh array so a buffer cannot have been exported for it (I honestly don't think __setstate__ is a good choice for our types, but that is a different issue).
| " Required call to PyArray_ResolveWritebackIfCopy or " | ||
| "PyArray_DiscardWritebackIfCopy is missing."; | ||
| /* | ||
| * prevent reaching 0 twice and thus recursing into dealloc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is probably never needed, but please don't remove it. If we remove it, I would rather just make the whole pass a full error, since the warning has been around forever (yes, it's not a raised error unfortunately, as it it will be printed).
The problem this protects against is if PyArray_ResolveWritebackIfCopy calls Py_INCREC(self) for any reason at all, things would blow up.
And while it may be that there is no path that will do an incref, it seems opaque enough to follow that leaking a global reference here is immaterial.
(Yeah, I am basically saying that I don't want to audit/promise that PyArray_ResolveWritebackIfCopy won't cause an INCREF.)
| } | ||
|
|
||
| if ((fa->flags & NPY_ARRAY_OWNDATA) && fa->data) { | ||
| else if ((fa->flags & NPY_ARRAY_OWNDATA) && fa->data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is correct the update-if-copy path above has both set.
| retval = WARN_IN_DEALLOC(PyExc_RuntimeWarning, msg, unraisable); | ||
| assert(retval == 0); // must be true if unraisable. | ||
| } | ||
| // Guess at malloc/free ??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all that insane, it allows downstream C extensions to create NumPy arrays in C that own their data but use an existing allocation.
It might be nice if those created a little capsule that owned the data instead, but that is a pretty big ABI break and honestly, I don't think this path is that bad.
| if (_buffer_info_free(fa->_buffer_info, (PyObject *)self) < 0) { | ||
| PyErr_WriteUnraisable(NULL); | ||
| if (fa->_buffer_info) { | ||
| if (_buffer_info_free(fa->_buffer_info, (PyObject *)self) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you re-factor this and remove the unraisable propagation? This function has no reason to handle it, it can just propagate errors normally (it has be careful to be OK with an in-flight error, but it is, I think).
You can handle the PyErr_WriteUnraisable inside dealloc where it is obvious.
We could put the array_dealloc string into the static string data to avoid that error handling/NULL path, but it is also OK as it is (avoiding the NULL was there for older PyPy versions which is irrelevant.)
| * If unraisable, always succeeds. | ||
| */ | ||
| static int | ||
| _dealloc_all_but_self(PyArrayObject *self, npy_bool unraisable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if can find a nicer name "clear" might make sense. But we only use it twice, so doesn't matter much.
(clear is also a bit clearer on the fact that I think it now has to be setup to be OK to call twice due to error handling in __setitem__.)
| /* must match allocation in PyArray_NewFromDescr */ | ||
| npy_free_cache_dim(fa->dimensions, 2 * fa->nd); | ||
| Py_DECREF(fa->descr); | ||
| if (fa->dimensions != NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't npy_free_cache_dims include the != NULL check like typical free functions?
| } | ||
| } | ||
| overflowed = npy_mul_sizes_with_overflow( | ||
| &nbytes, nbytes, PyArray_ITEMSIZE(self)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm, the order here may be there intentionally for when itemsize is zero, we still need to limit the number of elements in the array to not overflow.
(For that path, even the "replace with 1" logic might be strictly speaking needed; iterations over a subset of axes shouldn't be able to cause overflows. All of this should be possible when unpickling an array created on a 64bit system on a 32bit one.)
| } | ||
| } | ||
|
|
||
| if ((PyArray_FLAGS(self) & NPY_ARRAY_OWNDATA)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, none of this is remotely possible. If you call __setstate__ on anything but a freshly created array during pickling, you are in trouble anyway.
| * Get rid of everything on self, and then populate with pickle data. | ||
| */ | ||
| if (dealloc_all_but_self(self) < 0) { | ||
| return NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we assume that the only place where __setitem__ is called is unpickling, which means that the only thing that can happen to self on error return is that it is deleted (at this point, the array is completely broken!).
Of course, this can't actually fail in that context, because all of the paths that might fail are actually unreachable.
In preparation for (perhaps) having the array not necessarily allocate separate memory for sizes+strides and data, this simplifies one of the places where
ndarrayinstance hacking takes place, viz.,__setstate__. Here, one has to replace all bits from initialization of a new instance by pickle with the data actually stored in the pickle. Since that replacement requires what most of what deallocation does, with the exception of deallocatingselfitself, this PR splits off that bit inarrayobjectand uses it inmethods.As part of the above, I made a few other simplifications in
__setstate__(first commit).