If you’ve ever tried to debug a PyTorch program on an ARM64 system using Valgrind, you might have stumbled on something really odd: “Why does it take so long?”. And if you’re like us, you would probably try to run it locally, on a Raspberry pi, to see what’s going on… And the madness begins!

TL;DR, as you probably figured out from the title of this post, it’s a counter-intuitive experience: the more cores your machine has, the slower your (torch) code seems to run under Valgrind. Shouldn’t more cores mean more speed? Let’s dive into why that’s not always the case ;)

The background Link to heading

In an effort to improve our testing infrastructure for vAccel and make it more robust, we started cleaning up our examples, unifying the build & test scripts and started adding more elaborate test cases for both the library and the plugins. Valgrind provides a quite decent experience for this, especially to catch multi-arch errors, memory leaks and dangling pointers (something quite common when writing in C :D).

The issue Link to heading

While adding the Valgrind mode of execution in our tests for the vAccel plugins, we noticed something really weird in the Torch case. The test was taking forever!

Figure 1: Build & Test run on amd64

Specifically, while the equivalent amd64 was taking roughly 4 and a half minutes (Figure 1), the arm64 run was taking nearly an hour (53 minutes) – see Figure 2.

Figure 2: Why is it taking sooo long?

Debugging Link to heading

The first thing that came to mind was that there’s something wrong with our infrastructure. We run self-hosted Github runners, with custom container images that support the relevant software components we need for each plugin/case. We run those on our infra, a set of VMs running on top of diverse low-end bare-metal machines, both amd64 and arm64. The arm64 runners run on a couple of Jetson AGX Orins, with 8 cores and 32GB of RAM.

And what’s the first thing to try (especially when debugging on arm64? A Raspberry Pi of course!

So getting the runner container image on a Raspberry Pi 5, with 8GB of RAM, spinning up the container, building the library and the plugin, all took roughly 10 minutes. And we’re ready for the test:

 1# ninja run-examples-valgrind -C build-container
 2ninja: Entering directory `build-container'
 3[0/1] Running external command run-examples-valgrind (wrapped by meson to set env)
 4Arch is 64bit      : true
 5[snipped]
 6Running examples with plugin 'libvaccel-torch.so'
 7+ valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/develop/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --suppressions=/home/ananos/develop/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/torch_inference /home/runner/artifacts/share/vaccel/images/example.jpg https://s3.nbfc.io/torch/mobilenet.pt /home/runner/artifacts/share/vaccel/labels/imagenet.txt
 8==371== Memcheck, a memory error detector
 9==371== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
10==371== Using Valgrind-3.25.1 and LibVEX; rerun with -h for copyright info
11==371== Command: /home/runner/artifacts/bin/torch_inference /home/runner/artifacts/share/vaccel/images/example.jpg https://s3.nbfc.io/torch/mobilenet.pt /home/runner/artifacts/share/vaccel/labels/imagenet.txt
12==371==
132025.07.10-20:48:01.91 - <debug> Initializing vAccel
142025.07.10-20:48:01.93 - <info> vAccel 0.7.1-9-b175578f
152025.07.10-20:48:01.93 - <debug> Config:
162025.07.10-20:48:01.93 - <debug>   plugins = libvaccel-torch.so
172025.07.10-20:48:01.93 - <debug>   log_level = debug
182025.07.10-20:48:01.93 - <debug>   log_file = (null)
192025.07.10-20:48:01.93 - <debug>   profiling_enabled = false
202025.07.10-20:48:01.93 - <debug>   version_ignore = false
212025.07.10-20:48:01.94 - <debug> Created top-level rundir: /run/user/0/vaccel/ZpNkGT
222025.07.10-20:48:47.87 - <info> Registered plugin torch 0.2.1-3-0b1978fb
23[snipped]
242025.07.10-20:48:48.07 - <debug> Downloading https://s3.nbfc.io/torch/mobilenet.pt
252025.07.10-20:48:53.18 - <debug> Downloaded: 2.4 KB of 13.7 MB (17.2%) | Speed: 474.96 KB/sec
262025.07.10-20:48:54.93 - <debug> Downloaded: 13.7 MB of 13.7 MB (100.0%) | Speed: 2.01 MB/sec
272025.07.10-20:48:54.95 - <debug> Download completed successfully
282025.07.10-20:48:55.04 - <debug> session:1 Registered resource 1
292025.07.10-20:48:56.37 - <debug> session:1 Looking for plugin implementing torch_jitload_forward operation
302025.07.10-20:48:56.37 - <debug> Returning func from hint plugin torch
31[snipped]
32CUDA not available, running in CPU mode
33Success!
34Result Tensor :
35Output tensor => type:7 nr_dims:2
36size: 4000 B
37Prediction: banana
38[snipped]
39==371== HEAP SUMMARY:
40==371==     in use at exit: 339,636 bytes in 3,300 blocks
41==371==   total heap usage: 1,779,929 allocs, 1,776,629 frees, 405,074,676 bytes allocated
42==371==
43==371== LEAK SUMMARY:
44==371==    definitely lost: 0 bytes in 0 blocks
45==371==    indirectly lost: 0 bytes in 0 blocks
46==371==      possibly lost: 0 bytes in 0 blocks
47==371==    still reachable: 0 bytes in 0 blocks
48==371==         suppressed: 339,636 bytes in 3,300 blocks
49==371==
50==371== For lists of detected and suppressed errors, rerun with: -s
51==371== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3160 from 3160)
52+ valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/develop/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --suppressions=/home/ananos/develop/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/classify /home/runner/artifacts/share/vaccel/images/example.jpg 1 https://s3.nbfc.io/torch/mobilenet.pt
53==376== Memcheck, a memory error detector
54==376== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
55==376== Using Valgrind-3.25.1 and LibVEX; rerun with -h for copyright info
56==376== Command: /home/runner/artifacts/bin/classify /home/runner/artifacts/share/vaccel/images/example.jpg 1 https://s3.nbfc.io/torch/mobilenet.pt
57==376==
582025.07.10-20:54:37.78 - <debug> Initializing vAccel
592025.07.10-20:54:37.80 - <info> vAccel 0.7.1-9-b175578f
602025.07.10-20:54:37.80 - <debug> Config:
612025.07.10-20:54:37.80 - <debug>   plugins = libvaccel-torch.so
622025.07.10-20:54:37.80 - <debug>   log_level = debug
632025.07.10-20:54:37.80 - <debug>   log_file = (null)
64[snipped]
652025.07.10-20:55:30.78 - <debug> Found implementation in torch plugin
662025.07.10-20:55:30.78 - <debug> [torch] Loading model from /run/user/0/vaccel/zazTtc/resource.1/mobilenet.pt
67CUDA not available, running in CPU mode
682025.07.10-21:01:14.77 - <debug> [torch] Prediction: banana
69classification tags: banana
70[snipped]
712025.07.10-21:01:23.92 - <debug> Unregistered plugin torch
72==376==
73==376== HEAP SUMMARY:
74==376==     in use at exit: 341,280 bytes in 3,304 blocks
75==376==   total heap usage: 3,167,523 allocs, 3,164,219 frees, 534,094,402 bytes allocated
76==376==
77==376== LEAK SUMMARY:
78==376==    definitely lost: 0 bytes in 0 blocks
79==376==    indirectly lost: 0 bytes in 0 blocks
80==376==      possibly lost: 0 bytes in 0 blocks
81==376==    still reachable: 0 bytes in 0 blocks
82==376==         suppressed: 341,280 bytes in 3,304 blocks
83==376==
84==376== For lists of detected and suppressed errors, rerun with: -s
85==376== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3161 from 3161)
86+ set +x

Note: We’ll talk about the suppressions a bit later

The test took roughly 13 minutes. At this point, we were scratching our heads. Why would a high-end Jetson Orin, with way more cores and RAM, perform so much worse under Valgrind than a humble Raspberry Pi? Time to dig deeper into what’s really going on under the hood…

The Surprise Link to heading

When the results came in, the numbers were still striking: the same Valgrind-wrapped Torch test that took almost an hour on our Jetson Orin finished in just 13 minutes on the Raspberry Pi. The Pi, with far less RAM and CPU muscle, still managed to outperform the Orin by a wide margin under these specific conditions.

This result was the definition of counter-intuitive. Everything we know about hardware says the Orin should wipe the floor with the Pi. Yet, here we were, staring at the Pi’s prompt, wondering if we’d missed something obvious.

Digging Deeper: What’s Really Happening? Link to heading

So, what’s going on? Why does a high-end, multi-core ARM system get crushed by a humble Pi in this scenario? The answer lies at the intersection of Valgrind, multi-threaded workloads, and the quirks of the ARM64 ecosystem.

Thread Count: The Double-Edged Sword Link to heading

Modern CPUs, especially high-end ARM chips like the Orin, have lots of cores, and frameworks like PyTorch are eager to use them all. By default, PyTorch will spawn as many threads as it thinks your system can handle, aiming for maximum parallelism.

But Valgrind, which works by instrumenting every memory access and synchronizing thread activity to catch bugs, doesn’t scale gracefully with thread count. In fact:

  • Each additional thread multiplies Valgrind’s overhead. More threads mean more context switches, more synchronization, and more internal bookkeeping.
  • On platforms where Valgrind’s threading support is less mature (like aarch64), this overhead can balloon out of control.
  • On the Raspberry Pi, with its modest core count, PyTorch only spawns a handful of threads. But on the Orin, with many more cores, PyTorch ramps up the thread count—and Valgrind’s overhead explodes.

The ARM64 Valgrind Quirk Link to heading

The arm64 port of Valgrind is still catching up to its amd64 sibling in terms of optimizations and robustness. Some operations, especially those involving threads and memory, are simply slower to emulate and track on arm64. This compounds the thread explosion problem, making high-core-count systems paradoxically slower under Valgrind.

Dealing with library suppressions on arm64 with Valgrind Link to heading

For instance, when running applications that rely on specific libraries under Valgrind on arm64 systems, developers frequently encounter a barrage of memory-related warnings and errors. Many of these issues are not actual bugs in your code, but rather artifacts of how these libraries manage memory internally, or limitations in Valgrind’s emulation on such architectures.

For instance, OpenSSL is known for its custom memory management strategies. It often allocates memory statically or uses platform-specific tricks, which can confuse Valgrind’s memory checker. For example, you might see reports of “still reachable” memory or even “definitely lost” memory at program exit.

In reality, much of this memory is intentionally held for the lifetime of the process—such as global tables or the state for the random number generator. These are not leaks in the conventional sense, but Valgrind will still flag them, especially if you run with strict leak checking enabled.

On arm64 platforms, the situation can be further complicated. Valgrind may not fully emulate every instruction used by the specific library. This can lead to false positives, such as uninitialized value warnings, or even more dramatic errors like SIGILL (illegal instruction) if Valgrind encounters an unsupported operation.

It’s not uncommon to see a flood of warnings that are, in practice, harmless or simply not actionable unless you’re developing for that specific library itself.

To manage this noise and focus on real issues in our application, we use Valgrind’s suppression mechanism. Suppression files allow us to tell Valgrind to ignore specific known issues, so we can zero in on genuine bugs in our own code.

Suppression entries are typically matched by library object names, so on arm64 we use patterns like /usr/lib/aarch64-linux-gnu/libssh.so* or obj:*libc10*.so*, obj:*libtorch*.so*.

An example suppression snippet (valgrind.supp) looks like the following:

 1[...]
 2{
 3   suppress_libtorch_leaks
 4   Memcheck:Leak
 5   match-leak-kinds: reachable,possible
 6   ...
 7   obj:*libtorch*.so*
 8}
 9{
10   suppress_libtorch_ovelaps
11   Memcheck:Overlap
12   ...
13   obj:*libtorch*.so*
14}
15[...]

It’s important to note that not all problems can be suppressed away. For example, if Valgrind encounters a truly unsupported instruction and throws a SIGILL, a suppression file won’t help; you may need to update Valgrind or avoid that code path. Still, for the majority of benign memory warnings from OpenSSL or Torch, well-crafted suppressions keeps our Valgrind output manageable and meaningful.

Debug Symbol Overhead Link to heading

Another factor: large binaries with lots of debug symbols (common in deep learning stacks) can cause Valgrind to spend an inordinate amount of time just parsing and managing symbol information. The more complex the binary and its dependencies, the longer the startup and runtime overhead. Again, amplified on arm64.

Lessons Learned (and What You Can Do) Link to heading

Limit Thread Count: When running under Valgrind, explicitly set PyTorch to use a single thread OMP_NUM_THREADS=1. This alone can make a world of difference.

Test Small: Use the smallest possible model and dataset for Valgrind runs. Save the big workloads for native or lighter-weight profiling tools.

Expect the Unexpected: Don’t assume that “bigger is better” when debugging with Valgrind – sometimes, less really is more!

Profile Performance Separately: Use Valgrind for correctness and bug-hunting, not for benchmarking or performance profiling.

And here’s the full snippet of the test, on a runner VM on the Jetson Orin, taking less than 6 minutes:

  1$ ninja run-examples-valgrind -C build
  2ninja: Entering directory `build'
  3[0/1] Running external command run-examples-valgrind (wrapped by meson to set env)
  4Arch is 64bit      : true
  5Default config dir : /home/ananos/vaccel-plugin-torch/scripts/common/config
  6Package            : vaccel-torch
  7Package config dir : /home/ananos/vaccel-plugin-torch/scripts/config
  8Package lib dir    : /home/ananos/vaccel-plugin-torch/build/src
  9vAccel prefix      : /home/runner/artifacts
 10vAccel lib dir     : /home/runner/artifacts/lib/aarch64-linux-gnu
 11vAccel bin dir     : /home/runner/artifacts/bin
 12vAccel share dir   : /home/runner/artifacts/share/vaccel
 13
 14
 15Running examples with plugin 'libvaccel-torch.so'
 16+ eval valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --fair-sched=no --suppressions=/home/ananos/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/torch_inference /home/runner/artifacts/share/vaccel/images/example.jpg https://s3.nbfc.io/torch/mobilenet.pt /home/runner/artifacts/share/vaccel/labels/imagenet.txt
 17+ valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --fair-sched=no --suppressions=/home/ananos/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/torch_inference /home/runner/artifacts/share/vaccel/images/example.jpg https://s3.nbfc.io/torch/mobilenet.pt /home/runner/artifacts/share/vaccel/labels/imagenet.txt
 18==1655== Memcheck, a memory error detector
 19==1655== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
 20==1655== Using Valgrind-3.25.1 and LibVEX; rerun with -h for copyright info
 21==1655== Command: /home/runner/artifacts/bin/torch_inference /home/runner/artifacts/share/vaccel/images/example.jpg https://s3.nbfc.io/torch/mobilenet.pt /home/runner/artifacts/share/vaccel/labels/imagenet.txt
 22==1655==
 232025.07.10-20:06:28.83 - <debug> Initializing vAccel
 242025.07.10-20:06:28.85 - <info> vAccel 0.7.1-9-b175578f
 252025.07.10-20:06:28.86 - <debug> Config:
 262025.07.10-20:06:28.86 - <debug>   plugins = libvaccel-torch.so
 272025.07.10-20:06:28.86 - <debug>   log_level = debug
 282025.07.10-20:06:28.86 - <debug>   log_file = (null)
 292025.07.10-20:06:28.86 - <debug>   profiling_enabled = false
 302025.07.10-20:06:28.86 - <debug>   version_ignore = false
 312025.07.10-20:06:28.87 - <debug> Created top-level rundir: /run/user/1000/vaccel/P01ae4
 322025.07.10-20:07:27.35 - <info> Registered plugin torch 0.2.1-3-0b1978fb
 332025.07.10-20:07:27.35 - <debug> Registered op torch_jitload_forward from plugin torch
 342025.07.10-20:07:27.35 - <debug> Registered op torch_sgemm from plugin torch
 352025.07.10-20:07:27.35 - <debug> Registered op image_classify from plugin torch
 362025.07.10-20:07:27.35 - <debug> Loaded plugin torch from libvaccel-torch.so
 372025.07.10-20:07:27.39 - <debug> Initialized resource 1
 38Initialized model resource 1
 392025.07.10-20:07:27.39 - <debug> New rundir for session 1: /run/user/1000/vaccel/P01ae4/session.1
 402025.07.10-20:07:27.39 - <debug> Initialized session 1
 41Initialized vAccel session 1
 422025.07.10-20:07:27.40 - <debug> New rundir for resource 1: /run/user/1000/vaccel/P01ae4/resource.1
 432025.07.10-20:07:27.62 - <debug> Downloading https://s3.nbfc.io/torch/mobilenet.pt
 442025.07.10-20:07:33.90 - <debug> Downloaded: 555.7 KB of 13.7 MB (4.0%) | Speed: 88.84 KB/sec
 452025.07.10-20:07:36.78 - <debug> Downloaded: 13.7 MB of 13.7 MB (100.0%) | Speed: 1.50 MB/sec
 462025.07.10-20:07:36.80 - <debug> Download completed successfully
 472025.07.10-20:07:36.94 - <debug> session:1 Registered resource 1
 482025.07.10-20:07:38.16 - <debug> session:1 Looking for plugin implementing torch_jitload_forward operation
 492025.07.10-20:07:38.16 - <debug> Returning func from hint plugin torch
 502025.07.10-20:07:38.16 - <debug> Found implementation in torch plugin
 512025.07.10-20:07:38.16 - <debug> [torch] session:1 Jitload & Forward Process
 522025.07.10-20:07:38.16 - <debug> [torch] Model: /run/user/1000/vaccel/P01ae4/resource.1/mobilenet.pt
 532025.07.10-20:07:38.17 - <debug> [torch] Loading model from /run/user/1000/vaccel/P01ae4/resource.1/mobilenet.pt
 54CUDA not available, running in CPU mode
 55Success!
 56Result Tensor :
 57Output tensor => type:7 nr_dims:2
 58size: 4000 B
 59Prediction: banana
 602025.07.10-20:08:39.93 - <debug> session:1 Unregistered resource 1
 612025.07.10-20:08:39.94 - <debug> Released session 1
 622025.07.10-20:08:39.94 - <debug> Removing file /run/user/1000/vaccel/P01ae4/resource.1/mobilenet.pt
 632025.07.10-20:08:39.95 - <debug> Released resource 1
 642025.07.10-20:08:48.91 - <debug> Cleaning up vAccel
 652025.07.10-20:08:48.91 - <debug> Cleaning up sessions
 662025.07.10-20:08:48.91 - <debug> Cleaning up resources
 672025.07.10-20:08:48.91 - <debug> Cleaning up plugins
 682025.07.10-20:08:48.92 - <debug> Unregistered plugin torch
 69==1655==
 70==1655== HEAP SUMMARY:
 71==1655==     in use at exit: 304,924 bytes in 3,290 blocks
 72==1655==   total heap usage: 1,780,098 allocs, 1,776,808 frees, 406,800,553 bytes allocated
 73==1655==
 74==1655== LEAK SUMMARY:
 75==1655==    definitely lost: 0 bytes in 0 blocks
 76==1655==    indirectly lost: 0 bytes in 0 blocks
 77==1655==      possibly lost: 0 bytes in 0 blocks
 78==1655==    still reachable: 0 bytes in 0 blocks
 79==1655==         suppressed: 304,924 bytes in 3,290 blocks
 80==1655==
 81==1655== For lists of detected and suppressed errors, rerun with: -s
 82==1655== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3153 from 3153)
 83+ [ 1 = 1 ]
 84+ eval valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --fair-sched=no --suppressions=/home/ananos/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/classify /home/runner/artifacts/share/vaccel/images/example.jpg 1 https://s3.nbfc.io/torch/mobilenet.pt
 85+ valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --fair-sched=no --suppressions=/home/ananos/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/classify /home/runner/artifacts/share/vaccel/images/example.jpg 1 https://s3.nbfc.io/torch/mobilenet.pt
 86==1657== Memcheck, a memory error detector
 87==1657== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
 88==1657== Using Valgrind-3.25.1 and LibVEX; rerun with -h for copyright info
 89==1657== Command: /home/runner/artifacts/bin/classify /home/runner/artifacts/share/vaccel/images/example.jpg 1 https://s3.nbfc.io/torch/mobilenet.pt
 90==1657==
 912025.07.10-20:08:50.40 - <debug> Initializing vAccel
 922025.07.10-20:08:50.42 - <info> vAccel 0.7.1-9-b175578f
 932025.07.10-20:08:50.42 - <debug> Config:
 942025.07.10-20:08:50.42 - <debug>   plugins = libvaccel-torch.so
 952025.07.10-20:08:50.42 - <debug>   log_level = debug
 962025.07.10-20:08:50.42 - <debug>   log_file = (null)
 972025.07.10-20:08:50.42 - <debug>   profiling_enabled = false
 982025.07.10-20:08:50.42 - <debug>   version_ignore = false
 992025.07.10-20:08:50.43 - <debug> Created top-level rundir: /run/user/1000/vaccel/73XJNT
1002025.07.10-20:09:48.93 - <info> Registered plugin torch 0.2.1-3-0b1978fb
1012025.07.10-20:09:48.93 - <debug> Registered op torch_jitload_forward from plugin torch
1022025.07.10-20:09:48.93 - <debug> Registered op torch_sgemm from plugin torch
1032025.07.10-20:09:48.93 - <debug> Registered op image_classify from plugin torch
1042025.07.10-20:09:48.93 - <debug> Loaded plugin torch from libvaccel-torch.so
1052025.07.10-20:09:48.94 - <debug> New rundir for session 1: /run/user/1000/vaccel/73XJNT/session.1
1062025.07.10-20:09:48.95 - <debug> Initialized session 1
107Initialized session with id: 1
1082025.07.10-20:09:48.97 - <debug> Initialized resource 1
1092025.07.10-20:09:48.98 - <debug> New rundir for resource 1: /run/user/1000/vaccel/73XJNT/resource.1
1102025.07.10-20:09:49.19 - <debug> Downloading https://s3.nbfc.io/torch/mobilenet.pt
1112025.07.10-20:09:55.17 - <debug> Downloaded: 816.6 KB of 13.7 MB (5.8%) | Speed: 137.30 KB/sec
1122025.07.10-20:09:57.71 - <debug> Downloaded: 13.7 MB of 13.7 MB (100.0%) | Speed: 1.62 MB/sec
1132025.07.10-20:09:57.73 - <debug> Download completed successfully
1142025.07.10-20:09:57.87 - <debug> session:1 Registered resource 1
1152025.07.10-20:09:57.88 - <debug> session:1 Looking for plugin implementing VACCEL_OP_IMAGE_CLASSIFY
1162025.07.10-20:09:57.88 - <debug> Returning func from hint plugin torch
1172025.07.10-20:09:57.88 - <debug> Found implementation in torch plugin
1182025.07.10-20:09:57.88 - <debug> [torch] Loading model from /run/user/1000/vaccel/73XJNT/resource.1/mobilenet.pt
119CUDA not available, running in CPU mode
1202025.07.10-20:11:31.42 - <debug> [torch] Prediction: banana
121classification tags: banana
122classification imagename: PLACEHOLDER
1232025.07.10-20:11:31.93 - <debug> session:1 Unregistered resource 1
1242025.07.10-20:11:31.93 - <debug> Removing file /run/user/1000/vaccel/73XJNT/resource.1/mobilenet.pt
1252025.07.10-20:11:31.94 - <debug> Released resource 1
1262025.07.10-20:11:31.95 - <debug> Released session 1
1272025.07.10-20:11:44.12 - <debug> Cleaning up vAccel
1282025.07.10-20:11:44.12 - <debug> Cleaning up sessions
1292025.07.10-20:11:44.12 - <debug> Cleaning up resources
1302025.07.10-20:11:44.12 - <debug> Cleaning up plugins
1312025.07.10-20:11:44.12 - <debug> Unregistered plugin torch
132==1657==
133==1657== HEAP SUMMARY:
134==1657==     in use at exit: 306,616 bytes in 3,294 blocks
135==1657==   total heap usage: 3,167,511 allocs, 3,164,217 frees, 533,893,229 bytes allocated
136==1657==
137==1657== LEAK SUMMARY:
138==1657==    definitely lost: 0 bytes in 0 blocks
139==1657==    indirectly lost: 0 bytes in 0 blocks
140==1657==      possibly lost: 0 bytes in 0 blocks
141==1657==    still reachable: 0 bytes in 0 blocks
142==1657==         suppressed: 306,616 bytes in 3,294 blocks
143==1657==
144==1657== For lists of detected and suppressed errors, rerun with: -s
145==1657== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3153 from 3153)
146+ set +x

and the actual test in Figure 3, taking 8 minutes, almost 7 times faster than the original execution:

Figure 3: Fixed arm64 valgrind test

Wrapping Up Link to heading

This experience was a great reminder that debugging tools and parallel workloads don’t always play nicely, especially on less mature platforms. Sometimes, the humble Raspberry Pi will leave a high-end chip in the dust, at least when Valgrind is in the mix.

So next time you’re staring at a progress bar that refuses to budge, remember: more cores might just mean more waiting. And don’t be afraid to try your tests on the “little guy” – you might be surprised by what you find.