That’s what we thought when setting up a BERT-based hate speech classifier.
This was part of a broader experiment using
vAccel
, our hardware acceleration
abstraction for AI inference across the Cloud-Edge-IoT continuum.
We had offloading working locally (on the same physical host & OS), and started experimenting with our transport plugins to make sure everything works smoothly so that we can deploy that as part of a distributed kubernetes setup. First thing to try was localhost, and we expected communication to be lightning fast. Instead, we got… surprises.
The original experiment Link to heading
How BERT works Link to heading
The BERT model (Bidirectional Encoder Representations from Transformers) is a
transformer-based architecture that maps input text to contextual embeddings.
In this example, we’re using a distilled BERT checkpoint, traced via
TorchScript (cnn_trace.pt
), to classify short tweets into three categories:
- offensive-language
- hate-speech
- neither
Each line of input (a tweet) goes through the following stages:
Tokenization The tweet is split into word/subword tokens using a predefined vocabulary and tokenizer (e.g. WordPiece). Each token is mapped to an integer ID.
Embedding + Encoding These token IDs are passed through BERT’s embedding layer and several transformer encoder blocks, generating context-aware representations of each token.
Classification Head For classification, we only use the embedding of the special [CLS] token (added at the start). This vector is fed into a small feed forward layer that outputs logits for each class.
Prediction The class with the highest logit is selected as the prediction.
The model is serialized with TorchScript so that it can be loaded and run from C++ or via runtime frameworks like vAccel. This avoids Python overhead and allows seamless execution across backends (CPU, CUDA, remote offload, etc.).
So if we run this on a subset of an example dataset and see the following:
1$ ./build-stock-cpu/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
2Processing 100 lines from: build-local/tweets_100.txt
3== [Vocab Loaded] ==
4[snipped]
5Line 5: Duration: 141.214 ms Prediction: offensive-language
6Line 6: Duration: 115.528 ms Prediction: offensive-language
7Line 7: Duration: 117.649 ms Prediction: offensive-language
8[snipped]
9Line 10: Duration: 69.3163 ms Prediction: neither
10[snipped]
11Line 99: Duration: 117.06 ms Prediction: offensive-language
12Line 100: Duration: 110.692 ms Prediction: hate-speech
13Average (after 4rd iteration): 92.07 ms
CPU execution on such models seems to take quite some time. If you enable GPU execution we get something really better:
1$ ./build-stock/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
2Processing 100 lines from: build-local/tweets_100.txt
3== [Vocab Loaded] ==
4== [Using GPU] ==
5[snipped]
6Line 5: Duration: 7.98571 ms Prediction: offensive-language
7Line 6: Duration: 7.81541 ms Prediction: offensive-language
8Line 7: Duration: 7.76802 ms Prediction: offensive-language
9[snipped]
10Line 10: Duration: 8.22414 ms Prediction: neither
11[snipped]
12Line 99: Duration: 7.76586 ms Prediction: offensive-language
13Line 100: Duration: 7.8277 ms Prediction: hate-speech
14Average (after 4rd iteration): 7.88 ms
How vAccel facilitates the execution Link to heading
vAccel enables seamless interchange between hardware and transport plugins at runtime. So given a port of this classifier to consume the vAccel API, all we need to do is configure vAccel to use the CPU or the GPU plugin at runtime.
1$ export VACCEL_LOG_LEVEL=3
2$ export VACCEL_PLUGINS=$HOME/vaccel-plugin-torch/build-cpu/src/libvaccel-torch.so
3$ ./build-local/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
42025.06.22-13:44:38.82 - <info> vAccel 0.7.0-7-e67e52b6
52025.06.22-13:44:38.82 - <info> Registered plugin torch 0.2.0-2-f80dd939-dirty
6Processing 100 lines from: build-local/tweets_100.txt
7== [Vocab Loaded] ==
82025.06.22-13:44:38.86 - <warn> Path does not seem to have a `<prefix>://`
92025.06.22-13:44:38.86 - <warn> Assuming build-local/cnn_trace.pt is a local path
10Created new model resource 1
11Initialized vAccel session 1
12[snipped]
13Line 5: Duration: 142.267 ms Prediction: offensive-language
14Line 6: Duration: 116.333 ms Prediction: offensive-language
15Line 7: Duration: 117.776 ms Prediction: offensive-language
16[snipped]
17Line 10: Duration: 69.8447 ms Prediction: neither
18[snipped]
19Line 99: Duration: 118.195 ms Prediction: offensive-language
20Line 100: Duration: 111.499 ms Prediction: hate-speech
21Average (after 4rd iteration): 92.41 ms
and the equivalent GPU execution by just tweaking an environment variable:
1$ export VACCEL_LOG_LEVEL=3
2$ export VACCEL_PLUGINS=$HOME/vaccel-plugin-torch/build-gpu/src/libvaccel-torch.so
3$ ./build-local/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
42025.06.22-13:50:09.38 - <info> vAccel 0.7.0-7-e67e52b6
52025.06.22-13:50:09.39 - <info> Registered plugin torch 0.2.0-2-f80dd939
6Processing 100 lines from: build-local/tweets_100.txt
7== [Vocab Loaded] ==
82025.06.22-13:50:09.43 - <warn> Path does not seem to have a `<prefix>://`
92025.06.22-13:50:09.43 - <warn> Assuming build-local/cnn_trace.pt is a local path
10Created new model resource 1
11Initialized vAccel session 1
12CUDA is available, switching to GPU mode
13[snipped]
14Line 5: Duration: 8.38249 ms Prediction: offensive-language
15Line 6: Duration: 8.20436 ms Prediction: offensive-language
16Line 7: Duration: 8.13868 ms Prediction: offensive-language
17[snipped]
18Line 10: Duration: 8.22329 ms Prediction: neither
19[snipped]
20Line 99: Duration: 8.17031 ms Prediction: offensive-language
21Line 100: Duration: 8.1666 ms Prediction: hate-speech
22Average (after 4rd iteration): 8.27 ms
Table Summary of Results: Link to heading
Configuration | First inference (cold start) | Average inference time [*] |
---|---|---|
Stock PyTorch (CPU) | 531.083 ms | 92.07 ms |
vAccel (CPU) | 532.026 ms | 92.41 ms |
Stock PyTorch (GPU) | 507.915 ms | 7.88 ms |
vAccel (GPU) | 643.077 ms | 8.27 ms |
[*] Excludes the first 4 lines to avoid cold-start effects.
So far so good. The overhead is specific, and relevant to the library calls we do under the hood in vAccel and the actual copy of data when needed.
Remote execution Link to heading
Given we can run this remotely over vAccel, we first test the execution on a single node, for the sake of debugging.
First we spawn the agent:
1$ export VACCEL_PLUGINS=$HOME/vaccel-plugin-torch/build-cpu/src/libvaccel-torch.so
2$ vaccel-rpc-agent -a unix:///tmp/bert.sock
3[2025-06-22T15:01:10Z INFO ttrpc::sync::server] server listen started
4[2025-06-22T15:01:10Z INFO ttrpc::sync::server] server started
5[2025-06-22T15:01:10Z INFO vaccel_rpc_agent] vAccel RPC agent started
6[2025-06-22T15:01:10Z INFO vaccel_rpc_agent] Listening on 'unix:///tmp/bert.sock', press Ctrl+C to exit
Then we specify the RPC
plugin and point it to where the agent listens:
1$ export VACCEL_PLUGINS=libvaccel-rpc.so
2$ export VACCEL_RPC_ADDRESS=unix:///tmp/bert.sock
3$ ./build-local/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
42025.06.22-15:02:36.25 - <info> vAccel 0.7.0-7-e67e52b6
52025.06.22-15:02:36.27 - <info> Registered plugin rpc 0.2.0-1-eca9e440
6Processing 100 lines from: build-local/tweets_100.txt
7== [Vocab Loaded] ==
82025.06.22-15:02:36.31 - <warn> Path does not seem to have a `<prefix>://`
92025.06.22-15:02:36.31 - <warn> Assuming build-local/cnn_trace.pt is a local path
10Created new model resource 1
11Initialized vAccel session 1
12[snipped]
13Line 5: Duration: 147.155 ms Prediction: offensive-language
14Line 6: Duration: 115.631 ms Prediction: offensive-language
15Line 7: Duration: 124.443 ms Prediction: offensive-language
16[snipped]
17Line 10: Duration: 68.7716 ms Prediction: neither
18[snipped]
19Line 99: Duration: 123.694 ms Prediction: offensive-language
20Line 100: Duration: 112.011 ms Prediction: hate-speech
21Average (after 4rd iteration): 92.29 ms
and the GPU equivalent execution:
Agent:
1$ export VACCEL_PLUGINS=$HOME/vaccel-plugin-torch/build-gpu/src/libvaccel-torch.so
2$ vaccel-rpc-agent -a unix:///tmp/bert.sock
3[2025-06-22T15:27:16Z INFO ttrpc::sync::server] server listen started
4[2025-06-22T15:27:16Z INFO ttrpc::sync::server] server started
5[2025-06-22T15:27:16Z INFO vaccel_rpc_agent] vAccel RPC agent started
6[2025-06-22T15:27:16Z INFO vaccel_rpc_agent] Listening on 'unix:///tmp/bert.sock', press Ctrl+C to exit
And the classifier:
1$ ./build-local/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
22025.06.22-15:27:28.47 - <info> vAccel 0.7.0-7-e67e52b6
32025.06.22-15:27:28.49 - <info> Registered plugin rpc 0.2.0-1-eca9e440
4Processing 100 lines from: build-local/tweets_100.txt
5== [Vocab Loaded] ==
62025.06.22-15:27:28.53 - <warn> Path does not seem to have a `<prefix>://`
72025.06.22-15:27:28.53 - <warn> Assuming build-local/cnn_trace.pt is a local path
8Created new model resource 1
9Initialized vAccel session 1
10[snipped]
11Line 5: Duration: 8.62251 ms Prediction: offensive-language
12Line 6: Duration: 8.18224 ms Prediction: offensive-language
13Line 7: Duration: 8.40555 ms Prediction: offensive-language
14[snipped]
15Line 10: Duration: 8.74234 ms Prediction: neither
16[snipped]
17Line 99: Duration: 8.26042 ms Prediction: offensive-language
18Line 100: Duration: 8.38511 ms Prediction: hate-speech
19Average (after 4rd iteration): 8.36 ms
We can see the overhead is negligible, almost identical to the local execution. This is expected.
When we do that over a TCP
socket though, we see a completely different result.
Again, we spawn the agent:
1$ vaccel-rpc-agent -a tcp://0.0.0.0:8192
2[2025-06-22T15:06:45Z INFO ttrpc::sync::server] server listen started
3[2025-06-22T15:06:45Z INFO ttrpc::sync::server] server started
4[2025-06-22T15:06:45Z INFO vaccel_rpc_agent] vAccel RPC agent started
5[2025-06-22T15:06:45Z INFO vaccel_rpc_agent] Listening on 'tcp://0.0.0.0:8192', press Ctrl+C to exit
and point the RPC
plugin to the TCP
endpoint:
1$ export VACCEL_RPC_ADDRESS=tcp://localhost:8192
2$ ./build-local/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
32025.06.22-15:07:29.84 - <info> vAccel 0.7.0-7-e67e52b6
42025.06.22-15:07:29.90 - <info> Registered plugin rpc 0.2.0-1-eca9e440
5Processing 100 lines from: build-local/tweets_100.txt
6== [Vocab Loaded] ==
72025.06.22-15:07:29.95 - <warn> Path does not seem to have a `<prefix>://`
82025.06.22-15:07:29.95 - <warn> Assuming build-local/cnn_trace.pt is a local path
9Created new model resource 1
10Initialized vAccel session 1
11[snipped]
12Line 5: Duration: 187.425 ms Prediction: offensive-language
13Line 6: Duration: 156.973 ms Prediction: offensive-language
14Line 7: Duration: 164.083 ms Prediction: offensive-language
15[snipped]
16Line 10: Duration: 109.588 ms Prediction: neither
17[snipped]
18Line 99: Duration: 159.926 ms Prediction: offensive-language
19Line 100: Duration: 149.851 ms Prediction: hate-speech
20Average (after 4rd iteration): 131.94 ms
We see some overhead, which we could easily account to the network stack (~130ms vs 90ms). Still a bit high, but one could mistake that for TCP/IP stack traversals. However, if we do that using the GPU plugin, things get really weird!
Agent spawn:
1$ export VACCEL_PLUGINS=$HOME/vaccel-plugin-torch/build-gpu/src/libvaccel-torch.so
2$ vaccel-rpc-agent -a tcp://0.0.0.0:8192
3[2025-06-22T15:09:56Z INFO ttrpc::sync::server] server listen started
4[2025-06-22T15:09:56Z INFO ttrpc::sync::server] server started
5[2025-06-22T15:09:56Z INFO vaccel_rpc_agent] vAccel RPC agent started
6[2025-06-22T15:09:56Z INFO vaccel_rpc_agent] Listening on 'tcp://0.0.0.0:8192', press Ctrl+C to exit
1$ export VACCEL_RPC_ADDRESS=tcp://localhost:8192
2$ ./build-local/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
32025.06.22-15:10:00.40 - <info> vAccel 0.7.0-7-e67e52b6
42025.06.22-15:10:00.42 - <info> Registered plugin rpc 0.2.0-1-eca9e440
5Processing 100 lines from: build-local/tweets_100.txt
6== [Vocab Loaded] ==
72025.06.22-15:10:00.46 - <warn> Path does not seem to have a `<prefix>://`
82025.06.22-15:10:00.46 - <warn> Assuming build-local/cnn_trace.pt is a local path
9Created new model resource 1
10Initialized vAccel session 1
11[snipped]
12Line 5: Duration: 1598.79 ms Prediction: offensive-language
13Line 6: Duration: 50.5106 ms Prediction: offensive-language
14Line 7: Duration: 1896.81 ms Prediction: offensive-language
15[snipped]
16Line 10: Duration: 90.4444 ms Prediction: neither
17[snipped]
18Line 99: Duration: 90.4848 ms Prediction: offensive-language
19Line 100: Duration: 90.3126 ms Prediction: hate-speech
20Average (after 4rd iteration): 124.18 ms
Updated Table Summary of Results: Link to heading
Configuration | Average inference time [*] |
---|---|
Stock PyTorch (CPU) | 92.07 ms |
vAccel (CPU) | 92.41 ms |
Stock PyTorch (GPU) | 7.88 ms |
vAccel (GPU) | 8.27 ms |
vAccel remote UNIX (CPU) | 92.29 ms |
vAccel remote UNIX (GPU) | 8.36 ms |
vAccel remote TCP (CPU) | 131.94 ms |
vAccel remote TCP (GPU) | 124.18 ms |
How would that be even possible? 124 ms vs 8.36 ms for the GPU execution?
A Curious Latency Bump Link to heading
When calling the BERT classifier from the client via ttrpc
, we saw minimal
overhead for UNIX
sockets (<~5%) compared to native execution. This is
expected as the data path for RPC in vAccel uses copies. When running over TCP
sockets, we saw an almost 10x difference (~100ms vs ~10ms). This got us
thinking what could have gone wrong in the plugin… The code is identical, the
only thing that is different is the kind of socket…
The classifier code itself was fast (inference ~7-9ms), over UNIX
sockets it
was almost the same (~9-10ms); but with TCP sockets we were getting ~100ms.
Setting the Stage Link to heading
We isolated the issue to the transport mechanism (ttrpc-rust
) so we wrote a
small microbenchmark: a ttrpc
program exchanging empty protobuf
messages in a
tight loop.
The goal was to measure raw round-trip latency—no ML, no I/O, just the transport.
You can find the benchmark here: ttrpc-rs-benchmark
Reproducing the Problem Link to heading
Here’s how we tested:
1$ git clone https://github.com/nubificus/ttrpc-rs-benchmark
2$ cd ttrpc-rs-benchmark
3$ cargo build --release
4$ ./target/release/ttrpc-benchmark
5Running ttrpc-rust latency benchmark with 1000 iterations...
6
7Testing Unix sockets...
8Unix Socket Results:
9 Min: 58.029µs
10 Average: 73.217µs
11 Max: 887.728µs
12 P99: 116.979µs
13
14Testing TCP sockets...
15TCP Socket Results:
16 Min: 81.40514ms
17 Average: 82.004085ms
18 Max: 83.021974ms
19 P99: 82.465309ms
20
21Comparison:
22 Unix sockets are 1120.01x faster than TCP
That’s even worse that what we’ve seen with vAccel.
The Nagle Surprise Link to heading
We generated flamegraphs for both paths, and one thing stood out: send()
was
stalling on the TCP path.
Digging deeper, we figured out the culprit: Nagle’s algorithm.
Nagle tries to reduce small-packet overhead by coalescing writes; but that’s poison for latency-sensitive communication over TCP. Especially when the protocol uses small messages.
Disabling Nagle — Without Touching Code Link to heading
Unfortunately, ttrpc-rust
does not expose a socket config option. But we
prefer not to patch the library.
So we wrote a preloadable shared object, nodelay.so
, that intercepts socket()
and setsockopt()
to enforce TCP_NODELAY
.
You can get it from here, build it like this:
1git clone https://github.com/nubificus/nodelay
2cd nodelay
3make
and preload it like this:
1LD_PRELOAD=./nodelay.so ./target/release/ttrpc-benchmark
This reduces the TCP latency and brings it in par with UNIX
-socket latency.
1$ LD_PRELOAD=../nodelay/nodelay.so ./target/release/ttrpc-benchmark
2Running ttrpc-rust latency benchmark with 1000 iterations...
3
4Testing Unix sockets...
5Unix Socket Results:
6 Min: 55.985µs
7 Average: 69.387µs
8 Max: 373.772µs
9 P99: 101.441µs
10
11Testing TCP sockets...
12[hook] TCP_NODELAY enabled on socket 16
13[hook] TCP_NODELAY enabled on socket 12
14TCP Socket Results:
15 Min: 81.323µs
16 Average: 92.432µs
17 Max: 420.38µs
18 P99: 126.688µs
19
20Comparison:
21 Unix sockets are 1.33x faster than TCP
Running the original example using the nodelay.so
hack
Link to heading
Keeping the same settings as our last execution attempt, only now using the nodelay.so
, we see the following:
Agent spawn:
1$ LD_PRELOAD=../nodelay/nodelay.so vaccel-rpc-agent -a tcp://0.0.0.0:8192
2[hook] TCP_NODELAY enabled on socket 3
3[2025-06-22T15:14:49Z INFO ttrpc::sync::server] server listen started
4[2025-06-22T15:14:49Z INFO ttrpc::sync::server] server started
5[2025-06-22T15:14:49Z INFO vaccel_rpc_agent] vAccel RPC agent started
6[2025-06-22T15:14:49Z INFO vaccel_rpc_agent] Listening on 'tcp://0.0.0.0:8192', press Ctrl+C to exit
1$ LD_PRELOAD=../nodelay/nodelay.so ./build-local/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
22025.06.22-15:16:19.66 - <info> vAccel 0.7.0-7-e67e52b6
32025.06.22-15:16:19.68 - <info> Registered plugin rpc 0.2.0-1-eca9e440
4Processing 100 lines from: build-local/tweets_100.txt
5== [Vocab Loaded] ==
62025.06.22-15:16:19.72 - <warn> Path does not seem to have a `<prefix>://`
72025.06.22-15:16:19.72 - <warn> Assuming build-local/cnn_trace.pt is a local path
8Created new model resource 1
9[hook] TCP_NODELAY enabled on socket 3
10Initialized vAccel session 1
11[snipped]
12Line 5: Duration: 8.61344 ms Prediction: offensive-language
13Line 6: Duration: 8.4136 ms Prediction: offensive-language
14Line 7: Duration: 8.69745 ms Prediction: offensive-language
15[snipped]
16Line 10: Duration: 8.75759 ms Prediction: neither
17[snipped]
18Line 99: Duration: 8.80859 ms Prediction: offensive-language
19Line 100: Duration: 8.30694 ms Prediction: hate-speech
20Average (after 4rd iteration): 8.50 ms
Final Table Summary of Results: Link to heading
Configuration | Average inference time [*] |
---|---|
Stock PyTorch (CPU) | 92.07 ms |
vAccel (CPU) | 92.41 ms |
Stock PyTorch (GPU) | 7.88 ms |
vAccel (GPU) | 8.27 ms |
vAccel remote UNIX (CPU) | 92.29 ms |
vAccel remote UNIX (GPU) | 8.36 ms |
vAccel remote TCP (CPU) | 131.94 ms |
vAccel remote TCP (GPU) | 124.18 ms |
vAccel remote TCP_NODELAY (CPU) | 92.55 ms |
vAccel remote TCP_NODELAY (GPU) | 8.50 ms |
Takeaways Link to heading
- Even on localhost,
TCP
can be surprisingly slow if Nagle’s algorithm is enabled. UNIX
sockets avoid these issues entirely; but come with deployment trade-offs.- Flamegraphs are a powerful way to uncover unexpected bottlenecks.
LD_PRELOAD
is still a useful hack for tweaking behaviors without code changes.
For AI workloads that rely on tight client-server loops (like ML inference offloading), these small optimizations matter.
Resources Link to heading
Appendix I – Flamegraphs Link to heading
To produce a flamegraph follow these steps:
1sudo apt install linux-tools-common linux-tools-$(uname -r) linux-tools-generic
2git clone https://github.com/brendangregg/Flamegraph
3cd Flamegraph
4sudo perf record -F 99 -g -- ./my-program --arg1 myarg1 etc..
5sudo perf script > out.perf
6./stackcollapse-perf.pl out.perf > out.folded
7./flamegraph.pl out.folded > flamegraph.svg
Appendix II – Hardware Testbed Link to heading
Hardware Testbed Summary Link to heading
Component | Specification |
---|---|
OS | Ubuntu 24.04.2 LTS (noble) |
CPU | AMD Ryzen 5 2600, 6 cores / 12 threads @ 3.4 GHz |
RAM | 64 GB DDR4 |
GPU | NVIDIA GeForce RTX 2060 SUPER (8 GB GDDR6) |
CUDA Toolkit | 12.0 (nvcc 12.0.140) |
CUDA Driver | 12.8 (Driver Version: 570.133.20) |
GPU Usage | vaccel-rpc-agent (228 MiB GPU memory used during tests) |
Description Link to heading
All benchmarks were executed on a modern workstation equipped with a 6-core AMD Ryzen CPU and 64 GB of system memory. For GPU-accelerated runs, the machine used an NVIDIA RTX 2060 SUPER with 8 GB of VRAM, running CUDA 12.8. The vAccel RPC agent utilized a small portion of GPU memory during execution. Tests were conducted under Ubuntu 24.04.2 with minimal background processes to ensure measurement consistency.
Appendix III – ttrpc-rust
– Tiny Transport RPC in Rust
Link to heading
As part of the latency benchmarking setup, we used
ttrpc-rust
, a minimalist
transport abstraction for low-latency RPC-style communication. ttrpc-rust
is
the Rust
version of ttrpc
. ttrpc
is GRPC for low-memory environments.
By default, ttrpc-rust supports:
- Unix domain sockets (fast, local IPC)
AF_VSOCK
sockets
For our experiments, we extended it in a downstream
fork to also
support TCP
sockets. In addition to this, we also added an environment
variable to control the TCP_NODELAY
feature, so that we don’t have to do the
nodelay.so
hack. Use with the following variable set:
1export TTRPC_TCP_NODELAY_ENABLED=1