You signed in with another tab or window.
Reload
to refresh your session.
You signed out in another tab or window.
Reload
to refresh your session.
You switched accounts on another tab or window.
Reload
to refresh your session.
By clicking “Sign up for GitHub”, you agree to our
terms of service
and
privacy statement
. We’ll occasionally send you account related emails.
Already on GitHub?
Sign in
to your account
area-vm
Use area-vm for VM related issues, including code coverage, FFI, and the AOT and JIT backends.
A lower priority bug or feature request
triaged
Issue has been triaged by sub team
type-performance
Issue relates to performance or code size
Currently Dart VM represents strings internally either as one byte (Latin-1) or two byte (
WTF-16
) arrays of characters. The choice of WTF-16 stems from Dart's original design goal to be Web compatible: native JavaScript strings are WTF-16. This choice however comes with some additional overheads when operating in predominantly UTF-8 based environments: as passing strings between Dart and such environment requires conversions on the boundary.
I would like us to evaluate the possibility of switching our string representation to be purely UTF-8 based, which would unlock the possibility of directly passing them across the boundary between the Dart and native code.
This issue serves as a starting point for discussions and place to track this investigation. It does not present concrete design, just tries to capture various points of interest that the final design should cover.
Ideally we would like to address the following sources of overhead:
Creating a
String
from a
Uint8Array
or
Int8Array
could simply create a view into this typed array (after validating the contents are valid UTF-8).
Creating a
String
from
Pointer<Int8>
representing a native UTF-8 encoded string could simply create an external
String
pointing to these UTF-8 bytes (after validating that contents are valid UTF-8).
Writing a
String
into an output which expects UTF-8 bytes could be a simple copying of bytes
Passing
String
to native code could either be a simple copying of bytes or direct access to underlying bytes in leaf FFI calls.
Dealing with APIs which expose WTF-16 encoding
Many
String
APIs are operating in terms of indices into WTF-16 representation of the string and return WTF-16 encodings. They will continue to work just fine as long as underlying UTF-8 string is Latin-1, however we would need to figure out how to support strings which contain multibyte encodings. We could either transcode UTF-8 string into WTF-16 on the fly when user does something that requires WTF-16 representation or we could adopt approach similar to Swift which allocates a
breadcrumb
data structure on the side to facilitate WTF-16 based access.
RegExp
We are using the port of irregexp engine. This engine was originally developed for V8 and is optimised for fixed character width strings. We would either need a replacement which can operate on UTF-8 strings directly or we need to inflate non-Latin1 UTF-8 strings into WTF-16 and run RegExp on that.
Unifying
String
and
TypedData
It would be interesting to explore the possibility of unifying
String
and
TypedData
representations as
String
is essentially a sequence of bytes with some additional semantics. Currently converting between these types requires copying bytes, but it would be interesting to consider design where
String
could be a view into
TypedData
and vice versa.
There are some concerns about dangers of accidental mutability here - but the same concerns already apply to external strings so we could probably ignore them. Recently added unmodifiable typed data types allow us to do it in a safer way, though restricting backing store sharing
Zero termination
For the purposes of the interoperability we should consider enforcing zero-termination of strings, when we allocate storage ourselves, because this would allow us to pass string storage directly to native code that expects zero strings
Externalization
It is worth considering if we should factor the possibility of
string externalization
into the implementation. We used to have a capability to externalise individual strings by copying their bytes out of the VM heap and mutating the
String
object. The implementation unfortunately was very error prone because it violated a fundamental invariant that objects can't change their class/layout after they were allocated.
We could consider if some less intrusive form of externalization could be incorporated in the design of new string representation. Potentially we could consider allocating all string payloads larger than certain size on special pages, where they could be pinned, instead of using normal heap allocation. Having support for pinned string payloads would allow to pass them to native code without any copying.
String encodings and different native environments
While UTF-8 is essentially becoming a de facto for most environments, UTF-16/WTF-16 is still used in some environments we care about:
NSString
is UTF-16 encoded.
Windows APIs are using UTF-16
Java strings are WTF-16
JavaScript strings are WTF-16 which is relevant for both JavaScript and Wasm (if
dart2wasm
starts using native JS strings via
stringref
).
/cc
@mkustermann
@a-siva
@rmacnak-google
@alexmarkov
@lrhn
@aam
You might want to expose access to the UTF-8 bytes as well, so that some operations can work directly on them without having to go through the WTF-16 API. For example
package:characters
may want to be able to iterate directly on the UTF-8.
But I guess "converting" the string to an unmodifiable UTF-8 byte array should be sufficient.
The WTF-8 problem also means that
s1 + s2
is non-trivial. It has to check for a trailing lead surrogate and leading tail surrogate, and if it finds those, re-encode those two WTF-8 seqeuences into a new one.
(And if you split a UTF-8 encoded Dart string at a UTF-16 boundary inside a surrogate pair, you also have to re-encode both sides. Good times!)
area-vm
Use area-vm for VM related issues, including code coverage, FFI, and the AOT and JIT backends.
A lower priority bug or feature request
triaged
Issue has been triaged by sub team
type-performance
Issue relates to performance or code size