添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

I am interested to know how fast some of my models run on the CPUs of a Pixel 3 phone. I am a moderately experienced pytorch programmer and linux user, but I have zero experience with android. I am not looking to build an app right now; I just want to know how fast my model runs on this particular phone.

The TensorFlow repo has this barebones android test thingy for timing the latency of a neural net of your choice on an android phone in TensorFlow: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/benchmark/android/README.md

Has anyone made anything similar for pytorch?

We have a binary to do this that can run on your android phone using adb.

To build,

./scripts/build_android.sh \                                                                                                                       
-DBUILD_BINARY=ON \
-DBUILD_CAFFE2_MOBILE=OFF \
-DCMAKE_PREFIX_PATH=$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())') \
-DPYTHON_EXECUTABLE=$(python -c 'import sys; print(sys.executable)')

To run the binary, push it to the device using adb and run the following command
./speed_benchmark_torch --model=model.pt --input_dims="1,3,224,224" --input_type=float --warmup=10 --iter 10 --report_pep true

It should be possible, we already output the total network latency in a format that is acceptable by FAI-PEP.
The flow should be similar to existing caffe2 for mobile, but use the speed_benchmark_torch binary instead.

Thanks! Is there a certain NDK version that is preferred? I know in TensorFlow, they like using old NDK versions for some reason.

Also, do we need the Android SDK to be visible to PyTorch anywhere?

The instructions for speed_benchmark_torch worked for me on the first try!

If anyone else wants try this on a Pixel 3 android phone, here is the setup that worked for me:

# in bash shell
cd pytorch #where I have my `git clone` of pytorch
export ANDROID_ABI=arm64-v8a
export ANDROID_NDK=/path/to/Android/Sdk/ndk/21.0.6113669/
./scripts/build_android.sh \
-DBUILD_BINARY=ON \
-DBUILD_CAFFE2_MOBILE=OFF \
-DCMAKE_PREFIX_PATH=$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())') \
-DPYTHON_EXECUTABLE=$(python -c 'import sys; print(sys.executable)') \
# speed_benchmark_torch appears in pytorch/build_android/install/bin/speed_benchmark_torch

Next, I followed these instructions to export a resnet18 torchscript model:

#in python
import torch
import torchvision
model = torchvision.models.resnet18(pretrained=True)
model.eval()
example = torch.rand(1, 3, 224, 224)
traced_script_module = torch.jit.trace(model, example)
traced_script_module.save("resnet18.pt")

Then, I put the files onto the android device

# in bash shell on linux host computer that's plugged into Pixel 3 phone
adb shell mkdir /data/local/tmp/pt
adb push build_android/install/bin/speed_benchmark_torch /data/local/tmp/pt
adb push resnet18.pt /data/local/tmp/pt

And finally I run on the android device

# in bash shell on linux host computer that's plugged into Pixel 3 phone
adb shell  /data/local/tmp/pt/speed_benchmark_torch \
--model  /data/local/tmp/pt/resnet18.pt --input_dims="1,3,224,224" \
--input_type=float --warmup=5 --iter 20

It prints:

Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Milliseconds per iter: 188.382. Iters per second: 5.30836

Pretty good! I believe resnet18 is about 4 gflop (that is, 2 gmac) per frame, so (4 gmac) / (188 ms) = 21 gflop/s. Not bad for ARM CPUs! (At least I assume it’s executing on the ARM CPUs and not any GPUs or other accelerators.)

Also, this whole process took me about 25 minutes, and everything worked on the first try. I use pytorch day-to-day, but I have very little experience with android, and this was also my first time using torchscript, so I’m surprised and impressed that it was so straightforward.

This thread is very useful and I’m trying to get this working. I can’t get past the step where build_android.sh is run without a bunch of errors. You can view my CMakeError.log here. Does anyone know what’s going on here? Alternatively, if someone could link me their speed_benchmark_torch executable that might also work.

@solvingPuzzles I’m in Ubuntu 18.04. I actually just got it working. I had to run
git submodule update --init --recursive
within the pytorch clone as well as run the

./scripts/build_android.sh \
-DBUILD_BINARY=ON \
-DBUILD_CAFFE2_MOBILE=OFF \
-DCMAKE_PREFIX_PATH=$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())') \
-DPYTHON_EXECUTABLE=$(python -c 'import sys; print(sys.executable)')

command as sudo -E. I also had to run the python commands as sudo so it could actually write the .pt file.

If someone met this error

abort_message: assertion "terminating with uncaught exception of type c10::Error: PytorchStreamReader failed locating file bytecode.pkl: file not found ()

just save the model using _save_for_lite_interpreter like this:
traced_script_module._save_for_lite_interpreter("resnet18.pt")

It worked with me ^^

Hi, I am trying to run quantized version trained via QAT with config settings as qnnpack, but this seems to give error.

terminating with uncaught exception of type c10::NotImplementedError: Could not run 'quantized::linear' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'quantized::linear' is only available for these backends: [QuantizedCPU, BackendSelect, Functionalize, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta].

but if I run a quantized model without qnnpack setting it runs fine. But how is this happening, since mobile chips are arm64 which requires qnnpack configuration.
p.s.: using Pixel 7 pro device with tensor g2 processor

this error means that your input tensor to the op is somehow coming from unquantized ops. you will need to provide repro.

cc @jerryzh168