今天突然在B站上刷到了这篇文章,
Rust WebAssembly性能的真相
,里面第一条评论总结的是“rust wasm性能已经不比很多js框架慢了,而现在性能的主要问题并不是不能直接操作dom api,而是js字符串是用utf16编码,在rust中是使用utf8,在字符串转换时需要重新编码是造成性能损失的一个重要原因。”,所以如果要转换成utf16,或者按照unicode字符处理,确实会有性能损失,但是如果能保证都是ascii字符,那直接按byte处理就可以了。
munpf: 不好意思,之前头晕没看清问题,理解错了。我记得之前有篇文章就提到过你说的这个问题,忘了是哪篇文章了,不放过大致意思好像就是rust和js之间的字符串传递会消耗很多时间。
office-windows11: 可以看看 rust 的源代码:里面从 string 得到一个 char 是个复杂的过程。
lib/rustlib/src/rust/library/core/src/str/validations.rs
pub unsafe fn next_code_point<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> Option<u32> {
// Decode UTF-8
let x = *bytes.next()?;
if x < 128 {
return Some(x as u32);
// Multibyte case follows
// Decode from a byte combination out of: [[[x y] z] w]
// NOTE: Performance is sensitive to the exact formulation here
let init = utf8_first_byte(x, 2);
// SAFETY: `bytes` produces an UTF-8-like string,
// so the iterator must produce a value here.
let y = unsafe { *bytes.next().unwrap_unchecked() };
let mut ch = utf8_acc_cont_byte(init, y);
if x >= 0xE0 {
// [[x y z] w] case
// 5th bit in 0xE0 .. 0xEF is always clear, so `init` is still valid
// SAFETY: `bytes` produces an UTF-8-like string,
// so the iterator must produce a value here.
let z = unsafe { *bytes.next().unwrap_unchecked() };
let y_z = utf8_acc_cont_byte((y & CONT_MASK) as u32, z);
ch = init << 12 | y_z;
if x >= 0xF0 {
// [x y z w] case
// use only the lower 3 bits of `init`
// SAFETY: `bytes` produces an UTF-8-like string,
// so the iterator must produce a value here.
let w = unsafe { *bytes.next().unwrap_unchecked() };
ch = (init & 7) << 18 | utf8_acc_cont_byte(y_z, w);
Some(ch)
}n<u32>
C#,Java,JavaScript 的 string 都是 UTF-16 数组;
而新生代的语言 rust, go, swift 的 string 是 UTF-8 数组
可以看看 rust 的源代码:里面从 string 得到一个 char 是个复杂的过程。
lib/rustlib/src/rust/library/core/src/str/validations.rs
pub unsafe fn next_code_point<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> Option<u32> {
// Decode UTF-8
let x = *bytes.next()?;
if x < 128 {
return Some(x as u32);
// Multibyte case follows
// Decode from a byte combination out of: [[[x y] z] w]
// NOTE: Performance is sensitive to the exact formulation here
let init = utf8_first_byte(x, 2);
// SAFETY: `bytes` produces an UTF-8-like string,
// so the iterator must produce a value here.
let y = unsafe { *bytes.next().unwrap_unchecked() };
let mut ch = utf8_acc_cont_byte(init, y);
if x >= 0xE0 {
// [[x y z] w] case
// 5th bit in 0xE0 .. 0xEF is always clear, so `init` is still valid
// SAFETY: `bytes` produces an UTF-8-like string,
// so the iterator must produce a value here.
let z = unsafe { *bytes.next().unwrap_unchecked() };
let y_z = utf8_acc_cont_byte((y & CONT_MASK) as u32, z);
ch = init << 12 | y_z;
if x >= 0xF0 {
// [x y z w] case
// use only the lower 3 bits of `init`
// SAFETY: `bytes` produces an UTF-8-like string,
// so the iterator must produce a value here.
let w = unsafe { *bytes.next().unwrap_unchecked() };
ch = (init & 7) << 18 | utf8_acc_cont_byte(y_z, w);
Some(ch)
}n<u32>
C#,Java,JavaScript 的 string 都是 UTF-16 数组;
而新生代的语言 rust, go, swift 的 string 是 UTF-8 数组
munpf: 遍历byte就可以了