ENH: `StringDType` scalar type #28196

mtsokol · 2025-01-20T15:31:20Z

Addresses #28165

This is a WIP workspace for the StringDType scalar feature. Some comments/questions:

Does it make sense to have a single, global StringDType instance for the scalar type? E.g. associated with the scalar type object, if possible. Therefore all scalars would live in a single arena. In this draft I think that a new dtype instance (and so a new arena) is created for each scalar creation (I call: PyArray_DescrFromType(NPY_VSTRING) to get a dtype in the scalar allocation).
I remember you mentioned that we can just have const char * directly in the scalar but I wanted to try out StringDType C API.
I named the scalar np.vstr but let's consider other options: np.vstring, np.static_string, np.stat_str (it also applies to all functions names that I drafted here).
I see that ufuncs, e.g. np.array(["a", "b"], dtype=np.vstr) + np.vstr("c"), work out of the box but I'm not sure if correctly (copyless?) underneath though.
class TestVStringScalar shows current usage examples.

jorenham · 2025-01-20T16:35:43Z

This will help a lot for static-typing! Because currently _: dtype[generic] = StringDType() is rejected by type-checkers (i.e. builtins.str isn't assignable to generic), and this will fix that 👌🏻.

Anyway, I'd be happy to help out with the stubs for vstr & co.; so let me know if you need any :)

jorenham · 2025-01-20T17:01:50Z

And as for the name, maybe string or string_? It's a bit closer to StringDType. It's also
more in line with str_/StrDType, bool/BoolDType naming scheme, which currently holds for 26 / 33 of the dtypes:

>>> n_match = n_total = 0
... for dtype_name in dir(np.dtypes):
...     dtype = getattr(np.dtypes, dtype_name)
...     if not isinstance(dtype, type) or not issubclass(dtype, np.dtype):
...         continue
...     
...     sctype = dtype.type
...     sctype_name = sctype.__name__
...     
...     sctype_prefix_expect = dtype_name.removesuffix("DType").lower()
...     matches = sctype_name.startswith(sctype_prefix_expect)
...     print("T:" if matches else "F:", dtype_name.ljust(16), sctype_name)
...     n_total += 1
...     n_match += matches
... print(f"{n_match} / {n_total}")
... 
T: BoolDType        bool
F: ByteDType        int8
T: BytesDType       bytes_
T: CLongDoubleDType clongdouble
T: Complex128DType  complex128
T: Complex64DType   complex64
T: DateTime64DType  datetime64
T: Float16DType     float16
T: Float32DType     float32
T: Float64DType     float64
T: Int16DType       int16
T: Int32DType       int32
T: Int64DType       int64
T: Int8DType        int8
T: IntDType         int32
F: LongDType        int64
T: LongDoubleDType  longdouble
T: LongLongDType    longlong
T: ObjectDType      object_
F: ShortDType       int16
T: StrDType         str_
F: StringDType      str
T: TimeDelta64DType timedelta64
F: UByteDType       uint8
T: UInt16DType      uint16
T: UInt32DType      uint32
T: UInt64DType      uint64
T: UInt8DType       uint8
T: UIntDType        uint32
F: ULongDType       uint64
T: ULongLongDType   ulonglong
F: UShortDType      uint16
T: VoidDType        void
26 / 33

Alternatively, maybe string128 is also worth considering? Because

>>> np.dtypes.StringDType().name
'StringDType128'

But my assumption here is that this is platform independent, and I'm not 100% certain whether that's actually the case.

ngoldbaum · 2025-01-20T17:22:20Z

Does it make sense to have a single, global StringDType instance for the scalar type? E.g. associated with the scalar type object, if possible. Therefore all scalars would live in a single arena. In this draft I think that a new dtype instance (and so a new arena) is created for each scalar creation (I call: PyArray_DescrFromType(NPY_VSTRING) to get a dtype in the scalar allocation).

No, this a bad idea IMO. Right now the arena can grow but there isn't a hook to shrink it, so this choice would lead to a memory leak. At least with the existing StringDType design once you delete the array the dtype associated with the array will also be deleted (assuming no remaining references owned by a user script). But if it's a global arena there's no way for users to clear it.

What we might consider is that when you create a scalar from e.g. __getitem__ in an array, the scalar retains a strong reference to the DType instance and the scalar wraps a packed string.

That said, I have no idea if that could cause some annoying memory leaks. I'm not sure people necessarily expect numpy scalars being alive created from an array leading to the original array remaining alive. You could imagine scenarios where people create a really big numpy array of strings, get a single value, and then delete the array and are very surprised to find that the arena associated with the original array still exists.

My long term goal is to eventually replace the unicode and bytes legacy DTypes (e.g. completely remove them from NumPy) in favor of a fixed-width mode for StringDType. So, for example, if you wanted to create an array of 10-byte wide UTF-8 encoded strings, you could do np.empty(shape, dtype=T[10; UTF8]) or something along those lines. T[UCS4] is then identical to the old unicode legacy dtype and users can straightforwardly migrate and in a major version change we could make StringDType the default DType for bytes and strings, eventually completely removing the old legacy DTypes (or making them deprecated aliases for the new fixed-width mode).

I bring all that up because given all those future plans and the complexities involved in trying to avoid copies and keeping the data wrapped by the scalar non-local to the scalar might just cause headaches in the future.

By far the simplest choice is for the string scalar to own a copy of the full encoded string it wraps.

But my assumption here is that this is platform independent, and I'm not 100% certain whether that's actually the case.

It's not, it's StringDType64 on a 32 bit build. And also IMO we shouldn't be wedded to this numeric suffix, especially if we decide to change the on-disk representation of packed strings.

IMO np.str_ should go away in the long run (UCS-4 strings are a terrible default) and we may eventually want np.str or np.str_ for this string scalar in NumPy 3.0.

That said, np.string is OK in a vacuum but it's a little unfortunate that np.strings already exists and those names are so similar. I'd probably slightly lean towards np.vstr or np.vstring for now.

ngoldbaum · 2025-01-20T17:40:36Z

BTW, I hate to say it, but this change is big enough and it's a revision of what's in NEP-55 - it might make sense to go through and update that NEP to talk about the scalar. It would also help to guide discussion reviewing the PR to have a formal design document to look through.

Ping @mhvk as well.

seberg · 2025-01-20T17:52:04Z

I didn't expect anyone to jump on this. I need to think about introducing a scalar a bit more (yes, it solves many issues, but if you were to get rid of the default conversions, then this would be great).

FWIW, I am pretty sure I would lean towards not making a "numpy scalar", but rather just a unicode subclass (although unlike a subclass operations on it have to return a subclass probably, so even that is tedious). As far as I can tell, this doesn't even make it a subclass, which seems unfortunate?

I suppose this basically just supports what Nathan said: It would be good to have a short write-down of the pros and cons.

jorenham · 2025-01-20T18:02:09Z

That said, np.string is OK in a vacuum but it's a little unfortunate that np.strings already exists and those names are so similar. I'd probably slightly lean towards np.vstr or np.vstring for now.

How about stringv then? That way it'd still be compatible with the naming scheme I described in #28196 (comment)

ngoldbaum · 2025-01-20T18:04:50Z

but rather just a unicode subclass

You mean PyUnicode_Type, right? Just don't want to conflate with numpy's unicode scalar.

Out of curiosity, why not subtype both np.generic and str?

mhvk · 2025-01-20T18:29:12Z

Thanks for pinging. This looks like a nice contribution, but I agree with @ngoldbaum that it would help to think through exactly what the goals are.

There's a part of me that still really wishes numpy had no scalar types at all, i.e., only array scalars, as having both is overall rather confusing, with many of the array methods being present on the scalars but not working as expected. So, it will be good to be really clear about the problem that needs to be solved, and then see what the best solution is.

seberg · 2025-01-20T18:29:13Z

You mean PyUnicode_Type, right?

Yeah, sorry... sometimes it seeps through that I remember Python2, I guess :).

jorenham · 2025-03-19T09:31:06Z

Suggestion for a goal:

The dtype.type (read-only) attribute is documented and annotated as type[numpy.generic]. The only exception to this rule is dtypes.StringDType, whose type attribute is type[builtins.str]. This violates the Liskov Substitution Principle (LSP) (wiki), and is therefore type-unsafe, making it a likely source bugs. By introducing a new scalar type that is a subclass of numpy.generic and str, StringDType can be made to conform to the LSP in a backwards-compatible manner.

seberg · 2025-03-19T10:04:44Z

You mean a new super-class, I guess? Yes, we definitely need a numpy.abstract_scalar ABC class (because numpy.generic isn't an ABC but pretends to be one).

But there is a meaning difficulty, because type(np.dtype(dtype)) == type(dtype) won't be true for StringDType right now, that is a difficulty that needs to be dealt with (or lived with for now, I am not sure I believe returning a custom type won't create as many annoyances as it solves).

I would actually like to (in parallel) talk about eventually transitioning here, things like:

with np.dtypes.default_string("T"):
    ...

with np.dtypes.default_string(like=arr):
    ...

new_str = np.dtypes.default_string()
np.array(["mystr"], dtype=new_str)  # or maybe force the with...

the reason to do that earlier, is that it is much nicer to start forcing users to use it, if it has been around for a while (plus, it's useful!).
Once it's around for a while, you might be able to do a FutureWarning on it.
(To be clear, I would see this to happen in maybe 2-3 years, likely together with calling it a 3.0 release, although maybe this time with less breakage.)

jorenham · 2025-03-19T17:25:03Z

You mean a new super-class, I guess?
Hmm that's not what I had in mind actually; I was think of a class StringV(str, np.generic): ... , similar to np.str_.

Yes, we definitely need a numpy.abstract_scalar ABC class (because numpy.generic isn't an ABC but pretends to be one).

Would that also allow users to create custom scalar types themselves in Python?
Personally, I'm not a big fan of the abc stuff, mostly because of the type-unsafe register classmethod, but also because of the meta-class magic, that can break your code in unexpected ways. But as an alternative to abc.ABC, there's also the classic raise NotImplemented pattern that also does the trick 🤷🏻.

But implementation details aside, I agree that from an OOP perspective, a "pure" abstract class seems like a good idea here.

But there is a meaning difficulty, because type(np.dtype(dtype)) == type(dtype) won't be true for StringDType right now, that is a difficulty that needs to be dealt with (or lived with for now, I am not sure I believe returning a custom type won't create as many annoyances as it solves).

To be honest, I think that type(np.dtype(dtype)) == type(dtype) should only be true iff type(np.dtype(dtype)) is type(dtype). I know it's convenient and widely used, so we probably won't be able to change it. But I think it blurs the lines between dtypes and scalar-types more than needed.

I usually think of numpy scalars as read-only 0-d arrays. And for the most part, that's actually the case, as their Jaccard Index is a solid 90%.

If you compare the dir() of ndarray and generic, you'll find that they have 112 non-trivially unique overlapping members, and 146 unique members combined, so that a Jaccard Index of 112/146 = ~0.77. If you also account for the __i{op}__ methods (they are rudimentary from a typing perspective), and exclude ndarray methods like __len__ and __matmul__ that raise when 0-d, you'll end up with a Jaccard index of 112/124 = ~0.90.

>>> import numpy as np
>>> dir_trivial = set(dir(object))
>>> dir_array = set(dir(np.ndarray)) - dir_trivial
>>> dir_scalar = set(dir(np.generic)) - dir_trivial
>>> len(dir_array & dir_scalar)
112
>>> len(dir_array | dir_scalar)
146
>>> dir_irrelevant = {'__imatmul__', '__len__', '__itruediv__', '__ilshift__', '__class_getitem__', '__iadd__', '__irshift__', 'mT', '__delitem__', '__isub__', '__ior__', '__ipow__', '__ixor__', '__iter__', '__iand__', '__index__', '__imul__', '__rmatmul__', '__matmul__', '__complex__', '__imod__', '__ifloordiv__'}
>>> len((dir_array | dir_scalar) - dir_irrelevant)
124
>>> len(dir_array & dir_scalar) / _
0.9032258064516129

Calculating the similarity between np.generic with builtins.str in the same way gives us a Jaccard Index of ~0.0463, i.e. ~4.63% similar.

So using builtins.str as scalar type is significantly more type-unsafe, than if we would use a 0-d array of StringDType. Here's a trivial implementation of such a string scalar-type:

from typing import Self, final

@final
class StringV(generic[str], builtins.str):
    dtype: StringDType

    def __new__(
        cls,
        data: ScalarOrArrayLikeString,  # fictional
        /,
        dtype: DTypeLikeString | Undefined = undefined,  # fictional
    ) -> NDArray[Self]:
        return array(data, dtype=dtype or "T")

I'm not saying that this is what should be used or something; my point is that a simple workaround like this could already solve the majority of the current type-unsafety issues, i.e. reduce the possible ways in which you can accidentally shoot yourself in the foot.

But of course, it'd be a lot better to have an actual scalar type for 100% type-safety, just like all other dtypes.

seberg · 2025-03-22T16:28:29Z

Would that also allow users to create custom scalar types themselves in Python?

Yes, exactly. I can understand that this may not great for typing. But for typing, custom DTypes are basically impossible anyway, no?

To be honest, I think that type(np.dtype(dtype)) == type(dtype) should only be true iff type(np.dtype(dtype)) is type(dtype)

Yes, I meant the identity (I thought type == is is, but I guess crazy metaclassing could override that).

I usually think of numpy scalars as read-only 0-d arrays. And for the most part, that's actually the case, as their Jaccard Index is a solid 90%.

I have mixed feelings on this, in part because this notion doesn't work at all for "object" dtype and as you said neither for strings (they behave like Python strings and you can't suare being a Python string and an ndarray).¹

So if you aim for 100% Jaccard index, maybe it would make sense to do what I also mentioned somewhere in the NEPs I think: Have StringDType tell NumPy that it always to return 0-D arrays except for arr.item()?
Of course, if you really want scalars to exist just for speed, then that is a different angle...

(I admit, if we change 0-D array in means 0-D array out -- which we could in principle -- it will still be annoying to mostly preserve scalar in scalar out.)

Historically, I think scalars just came from "make 0-D arrays faster" (I am not sure), although their immutability and choice for object/strings don't really make sense in that regard.
So they are chimera-monsters... Neither quite 0-D arrays, nor quite scalars. But why does it bother us so much if their Jaccard index isn't 100%?
The problem IMO isn't that scalars are inconvenient, they are maybe mildly inconvenient for implementation. It is mainly that NumPy converts 0-D arrays to scalars everywhere... so scalars sneak up on you like ninjas in places where any sane logic (== any sane typing stub) would expect an array. ↩

jorenham · 2025-03-22T17:15:25Z

Yes, exactly. I can understand that this may not great for typing. But for typing, custom DTypes are basically impossible anyway, no?

As long as the promotion rules are fixed, and the Liskov Substitution Principle isn't violated, then it should be no problem 👌🏻

Yes, I meant the identity (I thought type == is is, but I guess crazy metaclassing could override that).

Something similar is already the case:

>>> import numpy as np
>>> np.float64 == np.dtype(np.float64)
True

Although this could be implemented using just dtype.__eq__, so without crazy metaclasses.

I have mixed feelings on this, in part because this notion doesn't work at all for "object" dtype and as you said neither for strings (they behave like Python strings and you can't suare being a Python string and an ndarray).

Yea, that's one of the exceptions to this. Another one would be np.void, which is the only scalar-type that isn't always 0-d (even though the stubs don't reflect this yet).

So if you aim for 100% Jaccard index, maybe it would make sense to do what I also mentioned somewhere in the NEPs I think:

A 100% jaccard index would require it to be a np.generic subtype. But maybe it can be both a np.generic and np.ndarray?

Have StringDType tell NumPy that it always to return 0-D arrays except for arr.item()?

I don't really know how that would be implemented. But from a typing perspective, it would indeed have StringDType.type: type[StringV] and StringV().item: () -> str.

Of course, if you really want scalars to exist just for speed, then that is a different angle...

Since I work with mypy on a daily basis, and therefore have to wait 10s after each ctrl+s for my the IDE to unfreeze, I doubt I'll even notice the difference 😅

The problem IMO isn't that scalars are inconvenient, they are maybe mildly inconvenient for implementation. It is mainly that NumPy converts 0-D arrays to scalars everywhere... so scalars sneak up on you like ninjas in places where any sane logic (== any sane typing stub) would expect an array.

That's exactly why in numtype, optype, and scipy-stubs, I consider 0-d arrays to be scalars (in input positions) 👌🏻

seberg · 2025-03-22T19:37:14Z

Ah... it should have been type(np.dtype(dtype.type) is type(np.dtype) (removing one level of type makes sense, but is not quite guaranteed)

mtsokol added the 01 - Enhancement label Jan 20, 2025

mtsokol self-assigned this Jan 20, 2025

Initial draft

5beaac4

mtsokol force-pushed the stringdtype-scalar branch from ed2c9c0 to 5beaac4 Compare January 20, 2025 15:37

jorenham linked an issue Jan 28, 2025 that may be closed by this pull request

ENH: add a StringDType scalar type that wraps a UTF-8 string #28165

Open

mtsokol mentioned this pull request Apr 28, 2025

NEP: NEP 55 revision - dedicated scalar type #28842

Open

Uh oh!

ENH: StringDType scalar type #28196

Are you sure you want to change the base?

ENH: StringDType scalar type #28196

Uh oh!

Conversation

mtsokol commented Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorenham commented Jan 20, 2025

Uh oh!

jorenham commented Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngoldbaum commented Jan 20, 2025

Uh oh!

ngoldbaum commented Jan 20, 2025

Uh oh!

seberg commented Jan 20, 2025

Uh oh!

jorenham commented Jan 20, 2025

Uh oh!

ngoldbaum commented Jan 20, 2025

Uh oh!

mhvk commented Jan 20, 2025

Uh oh!

seberg commented Jan 20, 2025

Uh oh!

jorenham commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Mar 19, 2025

Uh oh!

jorenham commented Mar 19, 2025

Uh oh!

seberg commented Mar 22, 2025

Footnotes

Uh oh!

jorenham commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Mar 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ENH: `StringDType` scalar type #28196

ENH: `StringDType` scalar type #28196

mtsokol commented Jan 20, 2025 •

edited

Loading

jorenham commented Jan 20, 2025 •

edited

Loading

jorenham commented Mar 19, 2025 •

edited

Loading

jorenham commented Mar 22, 2025 •

edited

Loading