-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
ENH: StringDType scalar type
#28196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH: StringDType scalar type
#28196
Conversation
ed2c9c0 to
5beaac4
Compare
|
This will help a lot for static-typing! Because currently Anyway, I'd be happy to help out with the stubs for |
|
And as for the name, maybe >>> n_match = n_total = 0
... for dtype_name in dir(np.dtypes):
... dtype = getattr(np.dtypes, dtype_name)
... if not isinstance(dtype, type) or not issubclass(dtype, np.dtype):
... continue
...
... sctype = dtype.type
... sctype_name = sctype.__name__
...
... sctype_prefix_expect = dtype_name.removesuffix("DType").lower()
... matches = sctype_name.startswith(sctype_prefix_expect)
... print("T:" if matches else "F:", dtype_name.ljust(16), sctype_name)
... n_total += 1
... n_match += matches
... print(f"{n_match} / {n_total}")
...
T: BoolDType bool
F: ByteDType int8
T: BytesDType bytes_
T: CLongDoubleDType clongdouble
T: Complex128DType complex128
T: Complex64DType complex64
T: DateTime64DType datetime64
T: Float16DType float16
T: Float32DType float32
T: Float64DType float64
T: Int16DType int16
T: Int32DType int32
T: Int64DType int64
T: Int8DType int8
T: IntDType int32
F: LongDType int64
T: LongDoubleDType longdouble
T: LongLongDType longlong
T: ObjectDType object_
F: ShortDType int16
T: StrDType str_
F: StringDType str
T: TimeDelta64DType timedelta64
F: UByteDType uint8
T: UInt16DType uint16
T: UInt32DType uint32
T: UInt64DType uint64
T: UInt8DType uint8
T: UIntDType uint32
F: ULongDType uint64
T: ULongLongDType ulonglong
F: UShortDType uint16
T: VoidDType void
26 / 33Alternatively, maybe >>> np.dtypes.StringDType().name
'StringDType128'But my assumption here is that this is platform independent, and I'm not 100% certain whether that's actually the case. |
No, this a bad idea IMO. Right now the arena can grow but there isn't a hook to shrink it, so this choice would lead to a memory leak. At least with the existing StringDType design once you delete the array the dtype associated with the array will also be deleted (assuming no remaining references owned by a user script). But if it's a global arena there's no way for users to clear it. What we might consider is that when you create a scalar from e.g. That said, I have no idea if that could cause some annoying memory leaks. I'm not sure people necessarily expect numpy scalars being alive created from an array leading to the original array remaining alive. You could imagine scenarios where people create a really big numpy array of strings, get a single value, and then delete the array and are very surprised to find that the arena associated with the original array still exists. My long term goal is to eventually replace the unicode and bytes legacy DTypes (e.g. completely remove them from NumPy) in favor of a fixed-width mode for StringDType. So, for example, if you wanted to create an array of 10-byte wide UTF-8 encoded strings, you could do I bring all that up because given all those future plans and the complexities involved in trying to avoid copies and keeping the data wrapped by the scalar non-local to the scalar might just cause headaches in the future. By far the simplest choice is for the string scalar to own a copy of the full encoded string it wraps.
It's not, it's IMO That said, |
|
BTW, I hate to say it, but this change is big enough and it's a revision of what's in NEP-55 - it might make sense to go through and update that NEP to talk about the scalar. It would also help to guide discussion reviewing the PR to have a formal design document to look through. Ping @mhvk as well. |
|
I didn't expect anyone to jump on this. I need to think about introducing a scalar a bit more (yes, it solves many issues, but if you were to get rid of the default conversions, then this would be great). FWIW, I am pretty sure I would lean towards not making a "numpy scalar", but rather just a unicode subclass (although unlike a subclass operations on it have to return a subclass probably, so even that is tedious). As far as I can tell, this doesn't even make it a subclass, which seems unfortunate? I suppose this basically just supports what Nathan said: It would be good to have a short write-down of the pros and cons. |
How about |
You mean PyUnicode_Type, right? Just don't want to conflate with numpy's unicode scalar. Out of curiosity, why not subtype both |
|
Thanks for pinging. This looks like a nice contribution, but I agree with @ngoldbaum that it would help to think through exactly what the goals are. There's a part of me that still really wishes numpy had no scalar types at all, i.e., only array scalars, as having both is overall rather confusing, with many of the array methods being present on the scalars but not working as expected. So, it will be good to be really clear about the problem that needs to be solved, and then see what the best solution is. |
Yeah, sorry... sometimes it seeps through that I remember Python2, I guess :). |
|
Suggestion for a goal: The |
|
You mean a new super-class, I guess? Yes, we definitely need a But there is a meaning difficulty, because I would actually like to (in parallel) talk about eventually transitioning here, things like: the reason to do that earlier, is that it is much nicer to start forcing users to use it, if it has been around for a while (plus, it's useful!). |
Would that also allow users to create custom scalar types themselves in Python? But implementation details aside, I agree that from an OOP perspective, a "pure" abstract class seems like a good idea here.
To be honest, I think that I usually think of numpy scalars as read-only 0-d arrays. And for the most part, that's actually the case, as their Jaccard Index is a solid 90%.If you compare the >>> import numpy as np
>>> dir_trivial = set(dir(object))
>>> dir_array = set(dir(np.ndarray)) - dir_trivial
>>> dir_scalar = set(dir(np.generic)) - dir_trivial
>>> len(dir_array & dir_scalar)
112
>>> len(dir_array | dir_scalar)
146
>>> dir_irrelevant = {'__imatmul__', '__len__', '__itruediv__', '__ilshift__', '__class_getitem__', '__iadd__', '__irshift__', 'mT', '__delitem__', '__isub__', '__ior__', '__ipow__', '__ixor__', '__iter__', '__iand__', '__index__', '__imul__', '__rmatmul__', '__matmul__', '__complex__', '__imod__', '__ifloordiv__'}
>>> len((dir_array | dir_scalar) - dir_irrelevant)
124
>>> len(dir_array & dir_scalar) / _
0.9032258064516129Calculating the similarity between So using from typing import Self, final
@final
class StringV(generic[str], builtins.str):
dtype: StringDType
def __new__(
cls,
data: ScalarOrArrayLikeString, # fictional
/,
dtype: DTypeLikeString | Undefined = undefined, # fictional
) -> NDArray[Self]:
return array(data, dtype=dtype or "T")I'm not saying that this is what should be used or something; my point is that a simple workaround like this could already solve the majority of the current type-unsafety issues, i.e. reduce the possible ways in which you can accidentally shoot yourself in the foot. But of course, it'd be a lot better to have an actual scalar type for 100% type-safety, just like all other dtypes. |
Yes, exactly. I can understand that this may not great for typing. But for typing, custom DTypes are basically impossible anyway, no?
Yes, I meant the identity (I thought type
I have mixed feelings on this, in part because this notion doesn't work at all for "object" dtype and as you said neither for strings (they behave like Python strings and you can't suare being a Python string and an ndarray).1 So if you aim for 100% Jaccard index, maybe it would make sense to do what I also mentioned somewhere in the NEPs I think: Have (I admit, if we change 0-D array in means 0-D array out -- which we could in principle -- it will still be annoying to mostly preserve scalar in scalar out.) Footnotes
|
As long as the promotion rules are fixed, and the Liskov Substitution Principle isn't violated, then it should be no problem 👌🏻
Something similar is already the case: >>> import numpy as np
>>> np.float64 == np.dtype(np.float64)
TrueAlthough this could be implemented using just
Yea, that's one of the exceptions to this. Another one would be
A 100% jaccard index would require it to be a
I don't really know how that would be implemented. But from a typing perspective, it would indeed have
Since I work with mypy on a daily basis, and therefore have to wait 10s after each
That's exactly why in numtype, optype, and scipy-stubs, I consider 0-d arrays to be scalars (in input positions) 👌🏻 |
|
Ah... it should have been |
Addresses #28165
Hi @ngoldbaum,
This is a WIP workspace for the
StringDTypescalar feature. Some comments/questions:StringDTypeinstance for the scalar type? E.g. associated with the scalar type object, if possible. Therefore all scalars would live in a single arena. In this draft I think that a new dtype instance (and so a new arena) is created for each scalar creation (I call:PyArray_DescrFromType(NPY_VSTRING)to get a dtype in the scalar allocation).const char *directly in the scalar but I wanted to try outStringDTypeC API.np.vstrbut let's consider other options:np.vstring,np.static_string,np.stat_str(it also applies to all functions names that I drafted here).np.array(["a", "b"], dtype=np.vstr) + np.vstr("c"), work out of the box but I'm not sure if correctly (copyless?) underneath though.class TestVStringScalarshows current usage examples.