If you’ve ever tried to debug a PyTorch program on an ARM64 system using
Valgrind
, you might have stumbled on something really
odd: “Why does it take so long?”. And if you’re like us, you would probably try
to run it locally, on a Raspberry pi, to see what’s going on… And the madness
begins!
TL;DR, as you probably figured out from the title of this post, it’s a
counter-intuitive experience: the more cores your machine has, the slower your
(torch) code seems to run under Valgrind
. Shouldn’t more cores mean more speed?
Let’s dive into why that’s not always the case ;)
The background Link to heading
In an effort to improve our testing infrastructure for
vAccel and make it more robust, we
started cleaning up our examples, unifying the build & test scripts and started
adding more elaborate test cases for both the library and the plugins.
Valgrind
provides a quite decent experience for this, especially to catch
multi-arch errors, memory leaks and dangling pointers (something quite common
when writing in C :D).
The issue Link to heading
While adding the Valgrind
mode of execution in our tests for the vAccel
plugins, we noticed something really weird in the
Torch
case. The test was taking forever!
Specifically, while the equivalent amd64
was taking roughly 4 and a half
minutes (Figure 1), the arm64
run was taking nearly an hour (53 minutes) –
see Figure 2.
Debugging Link to heading
The first thing that came to mind was that there’s something wrong with our
infrastructure. We run self-hosted Github runners, with custom container images
that support the relevant software components we need for each plugin/case. We
run those on our infra, a set of VMs running on top of diverse low-end
bare-metal machines, both amd64
and arm64
. The arm64
runners run on a
couple of Jetson
AGX
Orins
, with 8 cores and 32GB of RAM.
And what’s the first thing to try (especially when debugging on arm64
? A
Raspberry Pi of course!
So getting the runner container image on a Raspberry Pi 5, with 8GB of RAM, spinning up the container, building the library and the plugin, all took roughly 10 minutes. And we’re ready for the test:
1# ninja run-examples-valgrind -C build-container
2ninja: Entering directory `build-container'
3[0/1] Running external command run-examples-valgrind (wrapped by meson to set env)
4Arch is 64bit : true
5[snipped]
6Running examples with plugin 'libvaccel-torch.so'
7+ valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/develop/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --suppressions=/home/ananos/develop/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/torch_inference /home/runner/artifacts/share/vaccel/images/example.jpg https://s3.nbfc.io/torch/mobilenet.pt /home/runner/artifacts/share/vaccel/labels/imagenet.txt
8==371== Memcheck, a memory error detector
9==371== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
10==371== Using Valgrind-3.25.1 and LibVEX; rerun with -h for copyright info
11==371== Command: /home/runner/artifacts/bin/torch_inference /home/runner/artifacts/share/vaccel/images/example.jpg https://s3.nbfc.io/torch/mobilenet.pt /home/runner/artifacts/share/vaccel/labels/imagenet.txt
12==371==
132025.07.10-20:48:01.91 - <debug> Initializing vAccel
142025.07.10-20:48:01.93 - <info> vAccel 0.7.1-9-b175578f
152025.07.10-20:48:01.93 - <debug> Config:
162025.07.10-20:48:01.93 - <debug> plugins = libvaccel-torch.so
172025.07.10-20:48:01.93 - <debug> log_level = debug
182025.07.10-20:48:01.93 - <debug> log_file = (null)
192025.07.10-20:48:01.93 - <debug> profiling_enabled = false
202025.07.10-20:48:01.93 - <debug> version_ignore = false
212025.07.10-20:48:01.94 - <debug> Created top-level rundir: /run/user/0/vaccel/ZpNkGT
222025.07.10-20:48:47.87 - <info> Registered plugin torch 0.2.1-3-0b1978fb
23[snipped]
242025.07.10-20:48:48.07 - <debug> Downloading https://s3.nbfc.io/torch/mobilenet.pt
252025.07.10-20:48:53.18 - <debug> Downloaded: 2.4 KB of 13.7 MB (17.2%) | Speed: 474.96 KB/sec
262025.07.10-20:48:54.93 - <debug> Downloaded: 13.7 MB of 13.7 MB (100.0%) | Speed: 2.01 MB/sec
272025.07.10-20:48:54.95 - <debug> Download completed successfully
282025.07.10-20:48:55.04 - <debug> session:1 Registered resource 1
292025.07.10-20:48:56.37 - <debug> session:1 Looking for plugin implementing torch_jitload_forward operation
302025.07.10-20:48:56.37 - <debug> Returning func from hint plugin torch
31[snipped]
32CUDA not available, running in CPU mode
33Success!
34Result Tensor :
35Output tensor => type:7 nr_dims:2
36size: 4000 B
37Prediction: banana
38[snipped]
39==371== HEAP SUMMARY:
40==371== in use at exit: 339,636 bytes in 3,300 blocks
41==371== total heap usage: 1,779,929 allocs, 1,776,629 frees, 405,074,676 bytes allocated
42==371==
43==371== LEAK SUMMARY:
44==371== definitely lost: 0 bytes in 0 blocks
45==371== indirectly lost: 0 bytes in 0 blocks
46==371== possibly lost: 0 bytes in 0 blocks
47==371== still reachable: 0 bytes in 0 blocks
48==371== suppressed: 339,636 bytes in 3,300 blocks
49==371==
50==371== For lists of detected and suppressed errors, rerun with: -s
51==371== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3160 from 3160)
52+ valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/develop/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --suppressions=/home/ananos/develop/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/classify /home/runner/artifacts/share/vaccel/images/example.jpg 1 https://s3.nbfc.io/torch/mobilenet.pt
53==376== Memcheck, a memory error detector
54==376== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
55==376== Using Valgrind-3.25.1 and LibVEX; rerun with -h for copyright info
56==376== Command: /home/runner/artifacts/bin/classify /home/runner/artifacts/share/vaccel/images/example.jpg 1 https://s3.nbfc.io/torch/mobilenet.pt
57==376==
582025.07.10-20:54:37.78 - <debug> Initializing vAccel
592025.07.10-20:54:37.80 - <info> vAccel 0.7.1-9-b175578f
602025.07.10-20:54:37.80 - <debug> Config:
612025.07.10-20:54:37.80 - <debug> plugins = libvaccel-torch.so
622025.07.10-20:54:37.80 - <debug> log_level = debug
632025.07.10-20:54:37.80 - <debug> log_file = (null)
64[snipped]
652025.07.10-20:55:30.78 - <debug> Found implementation in torch plugin
662025.07.10-20:55:30.78 - <debug> [torch] Loading model from /run/user/0/vaccel/zazTtc/resource.1/mobilenet.pt
67CUDA not available, running in CPU mode
682025.07.10-21:01:14.77 - <debug> [torch] Prediction: banana
69classification tags: banana
70[snipped]
712025.07.10-21:01:23.92 - <debug> Unregistered plugin torch
72==376==
73==376== HEAP SUMMARY:
74==376== in use at exit: 341,280 bytes in 3,304 blocks
75==376== total heap usage: 3,167,523 allocs, 3,164,219 frees, 534,094,402 bytes allocated
76==376==
77==376== LEAK SUMMARY:
78==376== definitely lost: 0 bytes in 0 blocks
79==376== indirectly lost: 0 bytes in 0 blocks
80==376== possibly lost: 0 bytes in 0 blocks
81==376== still reachable: 0 bytes in 0 blocks
82==376== suppressed: 341,280 bytes in 3,304 blocks
83==376==
84==376== For lists of detected and suppressed errors, rerun with: -s
85==376== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3161 from 3161)
86+ set +x
Note: We’ll talk about the suppressions a bit later
The test took roughly 13 minutes. At this point, we were scratching our heads.
Why would a high-end Jetson Orin, with way more cores and RAM, perform so much
worse under Valgrind
than a humble Raspberry Pi? Time to dig deeper into what’s
really going on under the hood…
The Surprise Link to heading
When the results came in, the numbers were still striking: the same
Valgrind
-wrapped Torch test that took almost an hour on our Jetson Orin
finished in just 13 minutes on the Raspberry Pi. The Pi, with far less RAM and
CPU muscle, still managed to outperform the Orin by a wide margin under these
specific conditions.
This result was the definition of counter-intuitive. Everything we know about hardware says the Orin should wipe the floor with the Pi. Yet, here we were, staring at the Pi’s prompt, wondering if we’d missed something obvious.
Digging Deeper: What’s Really Happening? Link to heading
So, what’s going on? Why does a high-end, multi-core ARM system get crushed by
a humble Pi in this scenario? The answer lies at the intersection of Valgrind
,
multi-threaded workloads, and the quirks of the ARM64 ecosystem.
Thread Count: The Double-Edged Sword Link to heading
Modern CPUs, especially high-end ARM chips like the Orin, have lots of cores, and frameworks like PyTorch are eager to use them all. By default, PyTorch will spawn as many threads as it thinks your system can handle, aiming for maximum parallelism.
But Valgrind
, which works by instrumenting every memory access and
synchronizing thread activity to catch bugs, doesn’t scale gracefully with
thread count. In fact:
- Each additional thread multiplies
Valgrind
’s overhead. More threads mean more context switches, more synchronization, and more internal bookkeeping. - On platforms where
Valgrind
’s threading support is less mature (like aarch64), this overhead can balloon out of control. - On the Raspberry Pi, with its modest core count, PyTorch only spawns a
handful of threads. But on the Orin, with many more cores, PyTorch ramps up
the thread count—and
Valgrind
’s overhead explodes.
The ARM64 Valgrind
Quirk
Link to heading
The arm64
port of Valgrind
is still catching up to its amd64
sibling in
terms of optimizations and robustness. Some operations, especially those
involving threads and memory, are simply slower to emulate and track on arm64
.
This compounds the thread explosion problem, making high-core-count systems
paradoxically slower under Valgrind
.
Dealing with library suppressions on arm64 with Valgrind
Link to heading
For instance, when running applications that rely on specific libraries under
Valgrind
on arm64
systems, developers frequently encounter a barrage of
memory-related warnings and errors. Many of these issues are not actual bugs in
your code, but rather artifacts of how these libraries manage memory
internally, or limitations in Valgrind
’s emulation on such architectures.
For instance, OpenSSL is known for its custom memory management strategies. It
often allocates memory statically or uses platform-specific tricks, which can
confuse Valgrind
’s memory checker. For example, you might see reports of
“still reachable” memory or even “definitely lost” memory at program exit.
In reality, much of this memory is intentionally held for the lifetime of the
process—such as global tables or the state for the random number generator.
These are not leaks in the conventional sense, but Valgrind
will still flag
them, especially if you run with strict leak checking enabled.
On arm64
platforms, the situation can be further complicated. Valgrind
may
not fully emulate every instruction used by the specific library. This can lead
to false positives, such as uninitialized value warnings, or even more dramatic
errors like SIGILL
(illegal instruction) if Valgrind
encounters an
unsupported operation.
It’s not uncommon to see a flood of warnings that are, in practice, harmless or simply not actionable unless you’re developing for that specific library itself.
To manage this noise and focus on real issues in our application, we use
Valgrind
’s suppression mechanism. Suppression files allow us
to tell Valgrind
to ignore specific known issues, so we can zero in on genuine
bugs in our own code.
Suppression entries are typically matched by library object names, so on
arm64
we use patterns like /usr/lib/aarch64-linux-gnu/libssh.so*
or
obj:*libc10*.so*
, obj:*libtorch*.so*
.
An example suppression snippet (valgrind.supp
) looks like the following:
1[...]
2{
3 suppress_libtorch_leaks
4 Memcheck:Leak
5 match-leak-kinds: reachable,possible
6 ...
7 obj:*libtorch*.so*
8}
9{
10 suppress_libtorch_ovelaps
11 Memcheck:Overlap
12 ...
13 obj:*libtorch*.so*
14}
15[...]
It’s important to note that not all problems can be suppressed away. For
example, if Valgrind
encounters a truly unsupported instruction and throws a
SIGILL
, a suppression file won’t help; you may need to update Valgrind
or avoid
that code path. Still, for the majority of benign memory warnings from OpenSSL
or Torch, well-crafted suppressions keeps our Valgrind
output manageable
and meaningful.
Debug Symbol Overhead Link to heading
Another factor: large binaries with lots of debug symbols (common in deep
learning stacks) can cause Valgrind
to spend an inordinate amount of time just
parsing and managing symbol information. The more complex the binary and its
dependencies, the longer the startup and runtime overhead. Again, amplified on
arm64
.
Lessons Learned (and What You Can Do) Link to heading
Limit Thread Count: When running under Valgrind
, explicitly set PyTorch to
use a single thread OMP_NUM_THREADS=1
. This alone can make a world of
difference.
Test Small: Use the smallest possible model and dataset for Valgrind
runs.
Save the big workloads for native or lighter-weight profiling tools.
Expect the Unexpected: Don’t assume that “bigger is better” when debugging
with Valgrind
– sometimes, less really is more!
Profile Performance Separately: Use Valgrind
for correctness and bug-hunting,
not for benchmarking or performance profiling.
And here’s the full snippet of the test, on a runner VM on the Jetson Orin, taking less than 6 minutes:
1$ ninja run-examples-valgrind -C build
2ninja: Entering directory `build'
3[0/1] Running external command run-examples-valgrind (wrapped by meson to set env)
4Arch is 64bit : true
5Default config dir : /home/ananos/vaccel-plugin-torch/scripts/common/config
6Package : vaccel-torch
7Package config dir : /home/ananos/vaccel-plugin-torch/scripts/config
8Package lib dir : /home/ananos/vaccel-plugin-torch/build/src
9vAccel prefix : /home/runner/artifacts
10vAccel lib dir : /home/runner/artifacts/lib/aarch64-linux-gnu
11vAccel bin dir : /home/runner/artifacts/bin
12vAccel share dir : /home/runner/artifacts/share/vaccel
13
14
15Running examples with plugin 'libvaccel-torch.so'
16+ eval valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --fair-sched=no --suppressions=/home/ananos/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/torch_inference /home/runner/artifacts/share/vaccel/images/example.jpg https://s3.nbfc.io/torch/mobilenet.pt /home/runner/artifacts/share/vaccel/labels/imagenet.txt
17+ valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --fair-sched=no --suppressions=/home/ananos/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/torch_inference /home/runner/artifacts/share/vaccel/images/example.jpg https://s3.nbfc.io/torch/mobilenet.pt /home/runner/artifacts/share/vaccel/labels/imagenet.txt
18==1655== Memcheck, a memory error detector
19==1655== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
20==1655== Using Valgrind-3.25.1 and LibVEX; rerun with -h for copyright info
21==1655== Command: /home/runner/artifacts/bin/torch_inference /home/runner/artifacts/share/vaccel/images/example.jpg https://s3.nbfc.io/torch/mobilenet.pt /home/runner/artifacts/share/vaccel/labels/imagenet.txt
22==1655==
232025.07.10-20:06:28.83 - <debug> Initializing vAccel
242025.07.10-20:06:28.85 - <info> vAccel 0.7.1-9-b175578f
252025.07.10-20:06:28.86 - <debug> Config:
262025.07.10-20:06:28.86 - <debug> plugins = libvaccel-torch.so
272025.07.10-20:06:28.86 - <debug> log_level = debug
282025.07.10-20:06:28.86 - <debug> log_file = (null)
292025.07.10-20:06:28.86 - <debug> profiling_enabled = false
302025.07.10-20:06:28.86 - <debug> version_ignore = false
312025.07.10-20:06:28.87 - <debug> Created top-level rundir: /run/user/1000/vaccel/P01ae4
322025.07.10-20:07:27.35 - <info> Registered plugin torch 0.2.1-3-0b1978fb
332025.07.10-20:07:27.35 - <debug> Registered op torch_jitload_forward from plugin torch
342025.07.10-20:07:27.35 - <debug> Registered op torch_sgemm from plugin torch
352025.07.10-20:07:27.35 - <debug> Registered op image_classify from plugin torch
362025.07.10-20:07:27.35 - <debug> Loaded plugin torch from libvaccel-torch.so
372025.07.10-20:07:27.39 - <debug> Initialized resource 1
38Initialized model resource 1
392025.07.10-20:07:27.39 - <debug> New rundir for session 1: /run/user/1000/vaccel/P01ae4/session.1
402025.07.10-20:07:27.39 - <debug> Initialized session 1
41Initialized vAccel session 1
422025.07.10-20:07:27.40 - <debug> New rundir for resource 1: /run/user/1000/vaccel/P01ae4/resource.1
432025.07.10-20:07:27.62 - <debug> Downloading https://s3.nbfc.io/torch/mobilenet.pt
442025.07.10-20:07:33.90 - <debug> Downloaded: 555.7 KB of 13.7 MB (4.0%) | Speed: 88.84 KB/sec
452025.07.10-20:07:36.78 - <debug> Downloaded: 13.7 MB of 13.7 MB (100.0%) | Speed: 1.50 MB/sec
462025.07.10-20:07:36.80 - <debug> Download completed successfully
472025.07.10-20:07:36.94 - <debug> session:1 Registered resource 1
482025.07.10-20:07:38.16 - <debug> session:1 Looking for plugin implementing torch_jitload_forward operation
492025.07.10-20:07:38.16 - <debug> Returning func from hint plugin torch
502025.07.10-20:07:38.16 - <debug> Found implementation in torch plugin
512025.07.10-20:07:38.16 - <debug> [torch] session:1 Jitload & Forward Process
522025.07.10-20:07:38.16 - <debug> [torch] Model: /run/user/1000/vaccel/P01ae4/resource.1/mobilenet.pt
532025.07.10-20:07:38.17 - <debug> [torch] Loading model from /run/user/1000/vaccel/P01ae4/resource.1/mobilenet.pt
54CUDA not available, running in CPU mode
55Success!
56Result Tensor :
57Output tensor => type:7 nr_dims:2
58size: 4000 B
59Prediction: banana
602025.07.10-20:08:39.93 - <debug> session:1 Unregistered resource 1
612025.07.10-20:08:39.94 - <debug> Released session 1
622025.07.10-20:08:39.94 - <debug> Removing file /run/user/1000/vaccel/P01ae4/resource.1/mobilenet.pt
632025.07.10-20:08:39.95 - <debug> Released resource 1
642025.07.10-20:08:48.91 - <debug> Cleaning up vAccel
652025.07.10-20:08:48.91 - <debug> Cleaning up sessions
662025.07.10-20:08:48.91 - <debug> Cleaning up resources
672025.07.10-20:08:48.91 - <debug> Cleaning up plugins
682025.07.10-20:08:48.92 - <debug> Unregistered plugin torch
69==1655==
70==1655== HEAP SUMMARY:
71==1655== in use at exit: 304,924 bytes in 3,290 blocks
72==1655== total heap usage: 1,780,098 allocs, 1,776,808 frees, 406,800,553 bytes allocated
73==1655==
74==1655== LEAK SUMMARY:
75==1655== definitely lost: 0 bytes in 0 blocks
76==1655== indirectly lost: 0 bytes in 0 blocks
77==1655== possibly lost: 0 bytes in 0 blocks
78==1655== still reachable: 0 bytes in 0 blocks
79==1655== suppressed: 304,924 bytes in 3,290 blocks
80==1655==
81==1655== For lists of detected and suppressed errors, rerun with: -s
82==1655== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3153 from 3153)
83+ [ 1 = 1 ]
84+ eval valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --fair-sched=no --suppressions=/home/ananos/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/classify /home/runner/artifacts/share/vaccel/images/example.jpg 1 https://s3.nbfc.io/torch/mobilenet.pt
85+ valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --errors-for-leak-kinds=all --max-stackframe=3150000 --keep-debuginfo=yes --error-exitcode=1 --suppressions=/home/ananos/vaccel-plugin-torch/scripts/common/config/valgrind.supp --main-stacksize=33554432 --max-stackframe=4000000 --fair-sched=no --suppressions=/home/ananos/vaccel-plugin-torch/scripts/config/valgrind.supp /home/runner/artifacts/bin/classify /home/runner/artifacts/share/vaccel/images/example.jpg 1 https://s3.nbfc.io/torch/mobilenet.pt
86==1657== Memcheck, a memory error detector
87==1657== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
88==1657== Using Valgrind-3.25.1 and LibVEX; rerun with -h for copyright info
89==1657== Command: /home/runner/artifacts/bin/classify /home/runner/artifacts/share/vaccel/images/example.jpg 1 https://s3.nbfc.io/torch/mobilenet.pt
90==1657==
912025.07.10-20:08:50.40 - <debug> Initializing vAccel
922025.07.10-20:08:50.42 - <info> vAccel 0.7.1-9-b175578f
932025.07.10-20:08:50.42 - <debug> Config:
942025.07.10-20:08:50.42 - <debug> plugins = libvaccel-torch.so
952025.07.10-20:08:50.42 - <debug> log_level = debug
962025.07.10-20:08:50.42 - <debug> log_file = (null)
972025.07.10-20:08:50.42 - <debug> profiling_enabled = false
982025.07.10-20:08:50.42 - <debug> version_ignore = false
992025.07.10-20:08:50.43 - <debug> Created top-level rundir: /run/user/1000/vaccel/73XJNT
1002025.07.10-20:09:48.93 - <info> Registered plugin torch 0.2.1-3-0b1978fb
1012025.07.10-20:09:48.93 - <debug> Registered op torch_jitload_forward from plugin torch
1022025.07.10-20:09:48.93 - <debug> Registered op torch_sgemm from plugin torch
1032025.07.10-20:09:48.93 - <debug> Registered op image_classify from plugin torch
1042025.07.10-20:09:48.93 - <debug> Loaded plugin torch from libvaccel-torch.so
1052025.07.10-20:09:48.94 - <debug> New rundir for session 1: /run/user/1000/vaccel/73XJNT/session.1
1062025.07.10-20:09:48.95 - <debug> Initialized session 1
107Initialized session with id: 1
1082025.07.10-20:09:48.97 - <debug> Initialized resource 1
1092025.07.10-20:09:48.98 - <debug> New rundir for resource 1: /run/user/1000/vaccel/73XJNT/resource.1
1102025.07.10-20:09:49.19 - <debug> Downloading https://s3.nbfc.io/torch/mobilenet.pt
1112025.07.10-20:09:55.17 - <debug> Downloaded: 816.6 KB of 13.7 MB (5.8%) | Speed: 137.30 KB/sec
1122025.07.10-20:09:57.71 - <debug> Downloaded: 13.7 MB of 13.7 MB (100.0%) | Speed: 1.62 MB/sec
1132025.07.10-20:09:57.73 - <debug> Download completed successfully
1142025.07.10-20:09:57.87 - <debug> session:1 Registered resource 1
1152025.07.10-20:09:57.88 - <debug> session:1 Looking for plugin implementing VACCEL_OP_IMAGE_CLASSIFY
1162025.07.10-20:09:57.88 - <debug> Returning func from hint plugin torch
1172025.07.10-20:09:57.88 - <debug> Found implementation in torch plugin
1182025.07.10-20:09:57.88 - <debug> [torch] Loading model from /run/user/1000/vaccel/73XJNT/resource.1/mobilenet.pt
119CUDA not available, running in CPU mode
1202025.07.10-20:11:31.42 - <debug> [torch] Prediction: banana
121classification tags: banana
122classification imagename: PLACEHOLDER
1232025.07.10-20:11:31.93 - <debug> session:1 Unregistered resource 1
1242025.07.10-20:11:31.93 - <debug> Removing file /run/user/1000/vaccel/73XJNT/resource.1/mobilenet.pt
1252025.07.10-20:11:31.94 - <debug> Released resource 1
1262025.07.10-20:11:31.95 - <debug> Released session 1
1272025.07.10-20:11:44.12 - <debug> Cleaning up vAccel
1282025.07.10-20:11:44.12 - <debug> Cleaning up sessions
1292025.07.10-20:11:44.12 - <debug> Cleaning up resources
1302025.07.10-20:11:44.12 - <debug> Cleaning up plugins
1312025.07.10-20:11:44.12 - <debug> Unregistered plugin torch
132==1657==
133==1657== HEAP SUMMARY:
134==1657== in use at exit: 306,616 bytes in 3,294 blocks
135==1657== total heap usage: 3,167,511 allocs, 3,164,217 frees, 533,893,229 bytes allocated
136==1657==
137==1657== LEAK SUMMARY:
138==1657== definitely lost: 0 bytes in 0 blocks
139==1657== indirectly lost: 0 bytes in 0 blocks
140==1657== possibly lost: 0 bytes in 0 blocks
141==1657== still reachable: 0 bytes in 0 blocks
142==1657== suppressed: 306,616 bytes in 3,294 blocks
143==1657==
144==1657== For lists of detected and suppressed errors, rerun with: -s
145==1657== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3153 from 3153)
146+ set +x
and the actual test in Figure 3, taking 8 minutes, almost 7 times faster than the original execution:
Wrapping Up Link to heading
This experience was a great reminder that debugging tools and parallel
workloads don’t always play nicely, especially on less mature platforms.
Sometimes, the humble Raspberry Pi will leave a high-end chip in the dust,
at least when Valgrind
is in the mix.
So next time you’re staring at a progress bar that refuses to budge, remember: more cores might just mean more waiting. And don’t be afraid to try your tests on the “little guy” – you might be surprised by what you find.