添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account area-vm Use area-vm for VM related issues, including code coverage, FFI, and the AOT and JIT backends. A lower priority bug or feature request triaged Issue has been triaged by sub team type-performance Issue relates to performance or code size

Currently Dart VM represents strings internally either as one byte (Latin-1) or two byte ( WTF-16 ) arrays of characters. The choice of WTF-16 stems from Dart's original design goal to be Web compatible: native JavaScript strings are WTF-16. This choice however comes with some additional overheads when operating in predominantly UTF-8 based environments: as passing strings between Dart and such environment requires conversions on the boundary.

I would like us to evaluate the possibility of switching our string representation to be purely UTF-8 based, which would unlock the possibility of directly passing them across the boundary between the Dart and native code.

This issue serves as a starting point for discussions and place to track this investigation. It does not present concrete design, just tries to capture various points of interest that the final design should cover.

Ideally we would like to address the following sources of overhead:

  • Creating a String from a Uint8Array or Int8Array could simply create a view into this typed array (after validating the contents are valid UTF-8).
  • Creating a String from Pointer<Int8> representing a native UTF-8 encoded string could simply create an external String pointing to these UTF-8 bytes (after validating that contents are valid UTF-8).
  • Writing a String into an output which expects UTF-8 bytes could be a simple copying of bytes
  • Passing String to native code could either be a simple copying of bytes or direct access to underlying bytes in leaf FFI calls.
  • Dealing with APIs which expose WTF-16 encoding

    Many String APIs are operating in terms of indices into WTF-16 representation of the string and return WTF-16 encodings. They will continue to work just fine as long as underlying UTF-8 string is Latin-1, however we would need to figure out how to support strings which contain multibyte encodings. We could either transcode UTF-8 string into WTF-16 on the fly when user does something that requires WTF-16 representation or we could adopt approach similar to Swift which allocates a breadcrumb data structure on the side to facilitate WTF-16 based access.

    RegExp

    We are using the port of irregexp engine. This engine was originally developed for V8 and is optimised for fixed character width strings. We would either need a replacement which can operate on UTF-8 strings directly or we need to inflate non-Latin1 UTF-8 strings into WTF-16 and run RegExp on that.

    Unifying String and TypedData

    It would be interesting to explore the possibility of unifying String and TypedData representations as String is essentially a sequence of bytes with some additional semantics. Currently converting between these types requires copying bytes, but it would be interesting to consider design where String could be a view into TypedData and vice versa.

    There are some concerns about dangers of accidental mutability here - but the same concerns already apply to external strings so we could probably ignore them. Recently added unmodifiable typed data types allow us to do it in a safer way, though restricting backing store sharing

    Zero termination

    For the purposes of the interoperability we should consider enforcing zero-termination of strings, when we allocate storage ourselves, because this would allow us to pass string storage directly to native code that expects zero strings

    Externalization

    It is worth considering if we should factor the possibility of string externalization into the implementation. We used to have a capability to externalise individual strings by copying their bytes out of the VM heap and mutating the String object. The implementation unfortunately was very error prone because it violated a fundamental invariant that objects can't change their class/layout after they were allocated.

    We could consider if some less intrusive form of externalization could be incorporated in the design of new string representation. Potentially we could consider allocating all string payloads larger than certain size on special pages, where they could be pinned, instead of using normal heap allocation. Having support for pinned string payloads would allow to pass them to native code without any copying.

    String encodings and different native environments

    While UTF-8 is essentially becoming a de facto for most environments, UTF-16/WTF-16 is still used in some environments we care about:

  • NSString is UTF-16 encoded.
  • Windows APIs are using UTF-16
  • Java strings are WTF-16
  • JavaScript strings are WTF-16 which is relevant for both JavaScript and Wasm (if dart2wasm starts using native JS strings via stringref ).
  • /cc @mkustermann @a-siva @rmacnak-google @alexmarkov @lrhn @aam

    You might want to expose access to the UTF-8 bytes as well, so that some operations can work directly on them without having to go through the WTF-16 API. For example package:characters may want to be able to iterate directly on the UTF-8.

    But I guess "converting" the string to an unmodifiable UTF-8 byte array should be sufficient.

    The WTF-8 problem also means that s1 + s2 is non-trivial. It has to check for a trailing lead surrogate and leading tail surrogate, and if it finds those, re-encode those two WTF-8 seqeuences into a new one.
    (And if you split a UTF-8 encoded Dart string at a UTF-16 boundary inside a surrogate pair, you also have to re-encode both sides. Good times!)

    area-vm Use area-vm for VM related issues, including code coverage, FFI, and the AOT and JIT backends. A lower priority bug or feature request triaged Issue has been triaged by sub team type-performance Issue relates to performance or code size