link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

瘦瘦的烈马 · 30家节目制作公司大盘点！热门综艺都来自它们 ...· 4 月前 ·

文武双全的灭火器 · Microsoft Teams App ...· 5 月前 ·

直爽的墨镜 · 当我没有保存代码时，Python中的IDLE ...· 7 月前 ·

打盹的木耳 · LaTeX技巧874:为什么不带编号行间公式 ...· 11 月前 ·

欢快的罐头 · Can't pickle local ...· 1 年前 ·

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account area-vm triaged type-performance

Currently Dart VM represents strings internally either as one byte (Latin-1) or two byte ( WTF-16 ) arrays of characters. The choice of WTF-16 stems from Dart's original design goal to be Web compatible: native JavaScript strings are WTF-16. This choice however comes with some additional overheads when operating in predominantly UTF-8 based environments: as passing strings between Dart and such environment requires conversions on the boundary.

I would like us to evaluate the possibility of switching our string representation to be purely UTF-8 based, which would unlock the possibility of directly passing them across the boundary between the Dart and native code.

This issue serves as a starting point for discussions and place to track this investigation. It does not present concrete design, just tries to capture various points of interest that the final design should cover.

Ideally we would like to address the following sources of overhead:

Creating a


    String

from a


    Uint8Array


    Int8Array

could simply create a view into this typed array (after validating the contents are valid UTF-8).

Creating a


    String

from


    Pointer<Int8>

representing a native UTF-8 encoded string could simply create an external


    String

pointing to these UTF-8 bytes (after validating that contents are valid UTF-8).

Writing a


    String

into an output which expects UTF-8 bytes could be a simple copying of bytes

Passing


    String

to native code could either be a simple copying of bytes or direct access to underlying bytes in leaf FFI calls.

Dealing with APIs which expose WTF-16 encoding

Many String APIs are operating in terms of indices into WTF-16 representation of the string and return WTF-16 encodings. They will continue to work just fine as long as underlying UTF-8 string is Latin-1, however we would need to figure out how to support strings which contain multibyte encodings. We could either transcode UTF-8 string into WTF-16 on the fly when user does something that requires WTF-16 representation or we could adopt approach similar to Swift which allocates a breadcrumb data structure on the side to facilitate WTF-16 based access.

RegExp

We are using the port of irregexp engine. This engine was originally developed for V8 and is optimised for fixed character width strings. We would either need a replacement which can operate on UTF-8 strings directly or we need to inflate non-Latin1 UTF-8 strings into WTF-16 and run RegExp on that.

Unifying `String` and `TypedData`

It would be interesting to explore the possibility of unifying String and TypedData representations as String is essentially a sequence of bytes with some additional semantics. Currently converting between these types requires copying bytes, but it would be interesting to consider design where String could be a view into TypedData and vice versa.

There are some concerns about dangers of accidental mutability here - but the same concerns already apply to external strings so we could probably ignore them. Recently added unmodifiable typed data types allow us to do it in a safer way, though restricting backing store sharing

Zero termination

For the purposes of the interoperability we should consider enforcing zero-termination of strings, when we allocate storage ourselves, because this would allow us to pass string storage directly to native code that expects zero strings

Externalization

It is worth considering if we should factor the possibility of string externalization into the implementation. We used to have a capability to externalise individual strings by copying their bytes out of the VM heap and mutating the String object. The implementation unfortunately was very error prone because it violated a fundamental invariant that objects can't change their class/layout after they were allocated.

We could consider if some less intrusive form of externalization could be incorporated in the design of new string representation. Potentially we could consider allocating all string payloads larger than certain size on special pages, where they could be pinned, instead of using normal heap allocation. Having support for pinned string payloads would allow to pass them to native code without any copying.

String encodings and different native environments

While UTF-8 is essentially becoming a de facto for most environments, UTF-16/WTF-16 is still used in some environments we care about:


    NSString

is UTF-16 encoded.

Windows APIs are using UTF-16

Java strings are WTF-16

JavaScript strings are WTF-16 which is relevant for both JavaScript and Wasm (if


    dart2wasm

starts using native JS strings via


    stringref

/cc @mkustermann @a-siva @rmacnak-google @alexmarkov @lrhn @aam

You might want to expose access to the UTF-8 bytes as well, so that some operations can work directly on them without having to go through the WTF-16 API. For example package:characters may want to be able to iterate directly on the UTF-8.

But I guess "converting" the string to an unmodifiable UTF-8 byte array should be sufficient.

The WTF-8 problem also means that s1 + s2 is non-trivial. It has to check for a trailing lead surrogate and leading tail surrogate, and if it finds those, re-encode those two WTF-8 seqeuences into a new one.
(And if you split a UTF-8 encoded Dart string at a UTF-16 boundary inside a surrogate pair, you also have to re-encode both sides. Good times!)

area-vm triaged type-performance