That’s what we thought when setting up a BERT-based hate speech classifier. This was part of a broader experiment using vAccel, our hardware acceleration abstraction for AI inference across the Cloud-Edge-IoT continuum.

We had offloading working locally (on the same physical host & OS), and started experimenting with our transport plugins to make sure everything works smoothly so that we can deploy that as part of a distributed kubernetes setup. First thing to try was localhost, and we expected communication to be lightning fast. Instead, we got… surprises.

The original experiment Link to heading

How BERT works Link to heading

The BERT model (Bidirectional Encoder Representations from Transformers) is a transformer-based architecture that maps input text to contextual embeddings. In this example, we’re using a distilled BERT checkpoint, traced via TorchScript (cnn_trace.pt), to classify short tweets into three categories:

  • offensive-language
  • hate-speech
  • neither

Each line of input (a tweet) goes through the following stages:

  • Tokenization The tweet is split into word/subword tokens using a predefined vocabulary and tokenizer (e.g. WordPiece). Each token is mapped to an integer ID.

  • Embedding + Encoding These token IDs are passed through BERT’s embedding layer and several transformer encoder blocks, generating context-aware representations of each token.

  • Classification Head For classification, we only use the embedding of the special [CLS] token (added at the start). This vector is fed into a small feed forward layer that outputs logits for each class.

  • Prediction The class with the highest logit is selected as the prediction.

The model is serialized with TorchScript so that it can be loaded and run from C++ or via runtime frameworks like vAccel. This avoids Python overhead and allows seamless execution across backends (CPU, CUDA, remote offload, etc.).

So if we run this on a subset of an example dataset and see the following:

 1$ ./build-stock-cpu/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
 2Processing 100 lines from: build-local/tweets_100.txt
 3== [Vocab Loaded] ==
 4[snipped]
 5Line 5: Duration: 141.214 ms Prediction: offensive-language
 6Line 6: Duration: 115.528 ms Prediction: offensive-language
 7Line 7: Duration: 117.649 ms Prediction: offensive-language
 8[snipped]
 9Line 10: Duration: 69.3163 ms Prediction: neither
10[snipped]
11Line 99: Duration: 117.06 ms Prediction: offensive-language
12Line 100: Duration: 110.692 ms Prediction: hate-speech
13Average (after 4rd iteration): 92.07 ms

CPU execution on such models seems to take quite some time. If you enable GPU execution we get something really better:

 1$ ./build-stock/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
 2Processing 100 lines from: build-local/tweets_100.txt
 3== [Vocab Loaded] ==
 4== [Using GPU] ==
 5[snipped]
 6Line 5: Duration: 7.98571 ms Prediction: offensive-language
 7Line 6: Duration: 7.81541 ms Prediction: offensive-language
 8Line 7: Duration: 7.76802 ms Prediction: offensive-language
 9[snipped]
10Line 10: Duration: 8.22414 ms Prediction: neither
11[snipped]
12Line 99: Duration: 7.76586 ms Prediction: offensive-language
13Line 100: Duration: 7.8277 ms Prediction: hate-speech
14Average (after 4rd iteration): 7.88 ms

How vAccel facilitates the execution Link to heading

vAccel enables seamless interchange between hardware and transport plugins at runtime. So given a port of this classifier to consume the vAccel API, all we need to do is configure vAccel to use the CPU or the GPU plugin at runtime.

 1$ export VACCEL_LOG_LEVEL=3
 2$ export VACCEL_PLUGINS=$HOME/vaccel-plugin-torch/build-cpu/src/libvaccel-torch.so
 3$ ./build-local/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
 42025.06.22-13:44:38.82 - <info> vAccel 0.7.0-7-e67e52b6
 52025.06.22-13:44:38.82 - <info> Registered plugin torch 0.2.0-2-f80dd939-dirty
 6Processing 100 lines from: build-local/tweets_100.txt
 7== [Vocab Loaded] ==
 82025.06.22-13:44:38.86 - <warn> Path does not seem to have a `<prefix>://`
 92025.06.22-13:44:38.86 - <warn> Assuming build-local/cnn_trace.pt is a local path
10Created new model resource 1
11Initialized vAccel session 1
12[snipped]
13Line 5: Duration: 142.267 ms Prediction: offensive-language
14Line 6: Duration: 116.333 ms Prediction: offensive-language
15Line 7: Duration: 117.776 ms Prediction: offensive-language
16[snipped]
17Line 10: Duration: 69.8447 ms Prediction: neither
18[snipped]
19Line 99: Duration: 118.195 ms Prediction: offensive-language
20Line 100: Duration: 111.499 ms Prediction: hate-speech
21Average (after 4rd iteration): 92.41 ms

and the equivalent GPU execution by just tweaking an environment variable:

 1$ export VACCEL_LOG_LEVEL=3
 2$ export VACCEL_PLUGINS=$HOME/vaccel-plugin-torch/build-gpu/src/libvaccel-torch.so
 3$ ./build-local/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
 42025.06.22-13:50:09.38 - <info> vAccel 0.7.0-7-e67e52b6
 52025.06.22-13:50:09.39 - <info> Registered plugin torch 0.2.0-2-f80dd939
 6Processing 100 lines from: build-local/tweets_100.txt
 7== [Vocab Loaded] ==
 82025.06.22-13:50:09.43 - <warn> Path does not seem to have a `<prefix>://`
 92025.06.22-13:50:09.43 - <warn> Assuming build-local/cnn_trace.pt is a local path
10Created new model resource 1
11Initialized vAccel session 1
12CUDA is available, switching to GPU mode
13[snipped]
14Line 5: Duration: 8.38249 ms Prediction: offensive-language
15Line 6: Duration: 8.20436 ms Prediction: offensive-language
16Line 7: Duration: 8.13868 ms Prediction: offensive-language
17[snipped]
18Line 10: Duration: 8.22329 ms Prediction: neither
19[snipped]
20Line 99: Duration: 8.17031 ms Prediction: offensive-language
21Line 100: Duration: 8.1666 ms Prediction: hate-speech
22Average (after 4rd iteration): 8.27 ms

Table Summary of Results: Link to heading

ConfigurationFirst inference (cold start)Average inference time [*]
Stock PyTorch (CPU)531.083 ms92.07 ms
vAccel (CPU)532.026 ms92.41 ms
Stock PyTorch (GPU)507.915 ms7.88 ms
vAccel (GPU)643.077 ms8.27 ms

[*] Excludes the first 4 lines to avoid cold-start effects.

So far so good. The overhead is specific, and relevant to the library calls we do under the hood in vAccel and the actual copy of data when needed.

Remote execution Link to heading

Given we can run this remotely over vAccel, we first test the execution on a single node, for the sake of debugging.

First we spawn the agent:

1$ export VACCEL_PLUGINS=$HOME/vaccel-plugin-torch/build-cpu/src/libvaccel-torch.so
2$ vaccel-rpc-agent -a unix:///tmp/bert.sock
3[2025-06-22T15:01:10Z INFO  ttrpc::sync::server] server listen started
4[2025-06-22T15:01:10Z INFO  ttrpc::sync::server] server started
5[2025-06-22T15:01:10Z INFO  vaccel_rpc_agent] vAccel RPC agent started
6[2025-06-22T15:01:10Z INFO  vaccel_rpc_agent] Listening on 'unix:///tmp/bert.sock', press Ctrl+C to exit

Then we specify the RPC plugin and point it to where the agent listens:

 1$ export VACCEL_PLUGINS=libvaccel-rpc.so
 2$ export VACCEL_RPC_ADDRESS=unix:///tmp/bert.sock
 3$ ./build-local/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
 42025.06.22-15:02:36.25 - <info> vAccel 0.7.0-7-e67e52b6
 52025.06.22-15:02:36.27 - <info> Registered plugin rpc 0.2.0-1-eca9e440
 6Processing 100 lines from: build-local/tweets_100.txt
 7== [Vocab Loaded] ==
 82025.06.22-15:02:36.31 - <warn> Path does not seem to have a `<prefix>://`
 92025.06.22-15:02:36.31 - <warn> Assuming build-local/cnn_trace.pt is a local path
10Created new model resource 1
11Initialized vAccel session 1
12[snipped]
13Line 5: Duration: 147.155 ms Prediction: offensive-language
14Line 6: Duration: 115.631 ms Prediction: offensive-language
15Line 7: Duration: 124.443 ms Prediction: offensive-language
16[snipped]
17Line 10: Duration: 68.7716 ms Prediction: neither
18[snipped]
19Line 99: Duration: 123.694 ms Prediction: offensive-language
20Line 100: Duration: 112.011 ms Prediction: hate-speech
21Average (after 4rd iteration): 92.29 ms

and the GPU equivalent execution:

Agent:

1$ export VACCEL_PLUGINS=$HOME/vaccel-plugin-torch/build-gpu/src/libvaccel-torch.so
2$ vaccel-rpc-agent -a unix:///tmp/bert.sock
3[2025-06-22T15:27:16Z INFO  ttrpc::sync::server] server listen started
4[2025-06-22T15:27:16Z INFO  ttrpc::sync::server] server started
5[2025-06-22T15:27:16Z INFO  vaccel_rpc_agent] vAccel RPC agent started
6[2025-06-22T15:27:16Z INFO  vaccel_rpc_agent] Listening on 'unix:///tmp/bert.sock', press Ctrl+C to exit

And the classifier:

 1$ ./build-local/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
 22025.06.22-15:27:28.47 - <info> vAccel 0.7.0-7-e67e52b6
 32025.06.22-15:27:28.49 - <info> Registered plugin rpc 0.2.0-1-eca9e440
 4Processing 100 lines from: build-local/tweets_100.txt
 5== [Vocab Loaded] ==
 62025.06.22-15:27:28.53 - <warn> Path does not seem to have a `<prefix>://`
 72025.06.22-15:27:28.53 - <warn> Assuming build-local/cnn_trace.pt is a local path
 8Created new model resource 1
 9Initialized vAccel session 1
10[snipped]
11Line 5: Duration: 8.62251 ms Prediction: offensive-language
12Line 6: Duration: 8.18224 ms Prediction: offensive-language
13Line 7: Duration: 8.40555 ms Prediction: offensive-language
14[snipped]
15Line 10: Duration: 8.74234 ms Prediction: neither
16[snipped]
17Line 99: Duration: 8.26042 ms Prediction: offensive-language
18Line 100: Duration: 8.38511 ms Prediction: hate-speech
19Average (after 4rd iteration): 8.36 ms

We can see the overhead is negligible, almost identical to the local execution. This is expected.

When we do that over a TCP socket though, we see a completely different result.

Again, we spawn the agent:

1$ vaccel-rpc-agent -a tcp://0.0.0.0:8192
2[2025-06-22T15:06:45Z INFO  ttrpc::sync::server] server listen started
3[2025-06-22T15:06:45Z INFO  ttrpc::sync::server] server started
4[2025-06-22T15:06:45Z INFO  vaccel_rpc_agent] vAccel RPC agent started
5[2025-06-22T15:06:45Z INFO  vaccel_rpc_agent] Listening on 'tcp://0.0.0.0:8192', press Ctrl+C to exit

and point the RPC plugin to the TCP endpoint:

 1$ export VACCEL_RPC_ADDRESS=tcp://localhost:8192
 2$ ./build-local/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
 32025.06.22-15:07:29.84 - <info> vAccel 0.7.0-7-e67e52b6
 42025.06.22-15:07:29.90 - <info> Registered plugin rpc 0.2.0-1-eca9e440
 5Processing 100 lines from: build-local/tweets_100.txt
 6== [Vocab Loaded] ==
 72025.06.22-15:07:29.95 - <warn> Path does not seem to have a `<prefix>://`
 82025.06.22-15:07:29.95 - <warn> Assuming build-local/cnn_trace.pt is a local path
 9Created new model resource 1
10Initialized vAccel session 1
11[snipped]
12Line 5: Duration: 187.425 ms Prediction: offensive-language
13Line 6: Duration: 156.973 ms Prediction: offensive-language
14Line 7: Duration: 164.083 ms Prediction: offensive-language
15[snipped]
16Line 10: Duration: 109.588 ms Prediction: neither
17[snipped]
18Line 99: Duration: 159.926 ms Prediction: offensive-language
19Line 100: Duration: 149.851 ms Prediction: hate-speech
20Average (after 4rd iteration): 131.94 ms

We see some overhead, which we could easily account to the network stack (~130ms vs 90ms). Still a bit high, but one could mistake that for TCP/IP stack traversals. However, if we do that using the GPU plugin, things get really weird!

Agent spawn:

1$ export VACCEL_PLUGINS=$HOME/vaccel-plugin-torch/build-gpu/src/libvaccel-torch.so
2$ vaccel-rpc-agent -a tcp://0.0.0.0:8192
3[2025-06-22T15:09:56Z INFO  ttrpc::sync::server] server listen started
4[2025-06-22T15:09:56Z INFO  ttrpc::sync::server] server started
5[2025-06-22T15:09:56Z INFO  vaccel_rpc_agent] vAccel RPC agent started
6[2025-06-22T15:09:56Z INFO  vaccel_rpc_agent] Listening on 'tcp://0.0.0.0:8192', press Ctrl+C to exit
 1$ export VACCEL_RPC_ADDRESS=tcp://localhost:8192
 2$ ./build-local/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
 32025.06.22-15:10:00.40 - <info> vAccel 0.7.0-7-e67e52b6
 42025.06.22-15:10:00.42 - <info> Registered plugin rpc 0.2.0-1-eca9e440
 5Processing 100 lines from: build-local/tweets_100.txt
 6== [Vocab Loaded] ==
 72025.06.22-15:10:00.46 - <warn> Path does not seem to have a `<prefix>://`
 82025.06.22-15:10:00.46 - <warn> Assuming build-local/cnn_trace.pt is a local path
 9Created new model resource 1
10Initialized vAccel session 1
11[snipped]
12Line 5: Duration: 1598.79 ms Prediction: offensive-language
13Line 6: Duration: 50.5106 ms Prediction: offensive-language
14Line 7: Duration: 1896.81 ms Prediction: offensive-language
15[snipped]
16Line 10: Duration: 90.4444 ms Prediction: neither
17[snipped]
18Line 99: Duration: 90.4848 ms Prediction: offensive-language
19Line 100: Duration: 90.3126 ms Prediction: hate-speech
20Average (after 4rd iteration): 124.18 ms

Updated Table Summary of Results: Link to heading

ConfigurationAverage inference time [*]
Stock PyTorch (CPU)92.07 ms
vAccel (CPU)92.41 ms
Stock PyTorch (GPU)7.88 ms
vAccel (GPU)8.27 ms
vAccel remote UNIX (CPU)92.29 ms
vAccel remote UNIX (GPU)8.36 ms
vAccel remote TCP (CPU)131.94 ms
vAccel remote TCP (GPU)124.18 ms

How would that be even possible? 124 ms vs 8.36 ms for the GPU execution?

A Curious Latency Bump Link to heading

When calling the BERT classifier from the client via ttrpc, we saw minimal overhead for UNIX sockets (<~5%) compared to native execution. This is expected as the data path for RPC in vAccel uses copies. When running over TCP sockets, we saw an almost 10x difference (~100ms vs ~10ms). This got us thinking what could have gone wrong in the plugin… The code is identical, the only thing that is different is the kind of socket…

The classifier code itself was fast (inference ~7-9ms), over UNIX sockets it was almost the same (~9-10ms); but with TCP sockets we were getting ~100ms.

Setting the Stage Link to heading

We isolated the issue to the transport mechanism (ttrpc-rust) so we wrote a small microbenchmark: a ttrpc program exchanging empty protobuf messages in a tight loop.

The goal was to measure raw round-trip latency—no ML, no I/O, just the transport.

You can find the benchmark here: ttrpc-rs-benchmark

Reproducing the Problem Link to heading

Here’s how we tested:

 1$ git clone https://github.com/nubificus/ttrpc-rs-benchmark
 2$ cd ttrpc-rs-benchmark
 3$ cargo build --release
 4$ ./target/release/ttrpc-benchmark
 5Running ttrpc-rust latency benchmark with 1000 iterations...
 6
 7Testing Unix sockets...
 8Unix Socket Results:
 9  Min:     58.029µs
10  Average: 73.217µs
11  Max:     887.728µs
12  P99:     116.979µs
13
14Testing TCP sockets...
15TCP Socket Results:
16  Min:     81.40514ms
17  Average: 82.004085ms
18  Max:     83.021974ms
19  P99:     82.465309ms
20
21Comparison:
22  Unix sockets are 1120.01x faster than TCP

That’s even worse that what we’ve seen with vAccel.

The Nagle Surprise Link to heading

We generated flamegraphs for both paths, and one thing stood out: send() was stalling on the TCP path.

Digging deeper, we figured out the culprit: Nagle’s algorithm.

Nagle tries to reduce small-packet overhead by coalescing writes; but that’s poison for latency-sensitive communication over TCP. Especially when the protocol uses small messages.

Disabling Nagle — Without Touching Code Link to heading

Unfortunately, ttrpc-rust does not expose a socket config option. But we prefer not to patch the library.

So we wrote a preloadable shared object, nodelay.so, that intercepts socket() and setsockopt() to enforce TCP_NODELAY.

You can get it from here, build it like this:

1git clone https://github.com/nubificus/nodelay
2cd nodelay
3make

and preload it like this:

1LD_PRELOAD=./nodelay.so ./target/release/ttrpc-benchmark

This reduces the TCP latency and brings it in par with UNIX-socket latency.

 1$ LD_PRELOAD=../nodelay/nodelay.so ./target/release/ttrpc-benchmark
 2Running ttrpc-rust latency benchmark with 1000 iterations...
 3
 4Testing Unix sockets...
 5Unix Socket Results:
 6  Min:     55.985µs
 7  Average: 69.387µs
 8  Max:     373.772µs
 9  P99:     101.441µs
10
11Testing TCP sockets...
12[hook] TCP_NODELAY enabled on socket 16
13[hook] TCP_NODELAY enabled on socket 12
14TCP Socket Results:
15  Min:     81.323µs
16  Average: 92.432µs
17  Max:     420.38µs
18  P99:     126.688µs
19
20Comparison:
21  Unix sockets are 1.33x faster than TCP

Running the original example using the nodelay.so hack Link to heading

Keeping the same settings as our last execution attempt, only now using the nodelay.so, we see the following:

Agent spawn:

1$ LD_PRELOAD=../nodelay/nodelay.so vaccel-rpc-agent -a tcp://0.0.0.0:8192
2[hook] TCP_NODELAY enabled on socket 3
3[2025-06-22T15:14:49Z INFO  ttrpc::sync::server] server listen started
4[2025-06-22T15:14:49Z INFO  ttrpc::sync::server] server started
5[2025-06-22T15:14:49Z INFO  vaccel_rpc_agent] vAccel RPC agent started
6[2025-06-22T15:14:49Z INFO  vaccel_rpc_agent] Listening on 'tcp://0.0.0.0:8192', press Ctrl+C to exit
 1$ LD_PRELOAD=../nodelay/nodelay.so ./build-local/classifier -m build-local/cnn_trace.pt -v bert_cased_vocab.txt -f build-local/tweets_100.txt
 22025.06.22-15:16:19.66 - <info> vAccel 0.7.0-7-e67e52b6
 32025.06.22-15:16:19.68 - <info> Registered plugin rpc 0.2.0-1-eca9e440
 4Processing 100 lines from: build-local/tweets_100.txt
 5== [Vocab Loaded] ==
 62025.06.22-15:16:19.72 - <warn> Path does not seem to have a `<prefix>://`
 72025.06.22-15:16:19.72 - <warn> Assuming build-local/cnn_trace.pt is a local path
 8Created new model resource 1
 9[hook] TCP_NODELAY enabled on socket 3
10Initialized vAccel session 1
11[snipped]
12Line 5: Duration: 8.61344 ms Prediction: offensive-language
13Line 6: Duration: 8.4136 ms Prediction: offensive-language
14Line 7: Duration: 8.69745 ms Prediction: offensive-language
15[snipped]
16Line 10: Duration: 8.75759 ms Prediction: neither
17[snipped]
18Line 99: Duration: 8.80859 ms Prediction: offensive-language
19Line 100: Duration: 8.30694 ms Prediction: hate-speech
20Average (after 4rd iteration): 8.50 ms

Final Table Summary of Results: Link to heading

ConfigurationAverage inference time [*]
Stock PyTorch (CPU)92.07 ms
vAccel (CPU)92.41 ms
Stock PyTorch (GPU)7.88 ms
vAccel (GPU)8.27 ms
vAccel remote UNIX (CPU)92.29 ms
vAccel remote UNIX (GPU)8.36 ms
vAccel remote TCP (CPU)131.94 ms
vAccel remote TCP (GPU)124.18 ms
vAccel remote TCP_NODELAY (CPU)92.55 ms
vAccel remote TCP_NODELAY (GPU)8.50 ms

Takeaways Link to heading

  • Even on localhost, TCP can be surprisingly slow if Nagle’s algorithm is enabled.
  • UNIX sockets avoid these issues entirely; but come with deployment trade-offs.
  • Flamegraphs are a powerful way to uncover unexpected bottlenecks.
  • LD_PRELOAD is still a useful hack for tweaking behaviors without code changes.

For AI workloads that rely on tight client-server loops (like ML inference offloading), these small optimizations matter.

Resources Link to heading

Appendix I – Flamegraphs Link to heading

To produce a flamegraph follow these steps:

1sudo apt install linux-tools-common linux-tools-$(uname -r) linux-tools-generic
2git clone https://github.com/brendangregg/Flamegraph
3cd Flamegraph
4sudo perf record -F 99 -g -- ./my-program --arg1 myarg1 etc..
5sudo perf script > out.perf
6./stackcollapse-perf.pl out.perf > out.folded
7./flamegraph.pl out.folded > flamegraph.svg

Appendix II – Hardware Testbed Link to heading

Hardware Testbed Summary Link to heading

ComponentSpecification
OSUbuntu 24.04.2 LTS (noble)
CPUAMD Ryzen 5 2600, 6 cores / 12 threads @ 3.4 GHz
RAM64 GB DDR4
GPUNVIDIA GeForce RTX 2060 SUPER (8 GB GDDR6)
CUDA Toolkit12.0 (nvcc 12.0.140)
CUDA Driver12.8 (Driver Version: 570.133.20)
GPU Usagevaccel-rpc-agent (228 MiB GPU memory used during tests)

Description Link to heading

All benchmarks were executed on a modern workstation equipped with a 6-core AMD Ryzen CPU and 64 GB of system memory. For GPU-accelerated runs, the machine used an NVIDIA RTX 2060 SUPER with 8 GB of VRAM, running CUDA 12.8. The vAccel RPC agent utilized a small portion of GPU memory during execution. Tests were conducted under Ubuntu 24.04.2 with minimal background processes to ensure measurement consistency.

Appendix III – ttrpc-rust – Tiny Transport RPC in Rust Link to heading

As part of the latency benchmarking setup, we used ttrpc-rust, a minimalist transport abstraction for low-latency RPC-style communication. ttrpc-rust is the Rust version of ttrpc. ttrpc is GRPC for low-memory environments.

By default, ttrpc-rust supports:

  • Unix domain sockets (fast, local IPC)
  • AF_VSOCK sockets

For our experiments, we extended it in a downstream fork to also support TCP sockets. In addition to this, we also added an environment variable to control the TCP_NODELAY feature, so that we don’t have to do the nodelay.so hack. Use with the following variable set:

1export TTRPC_TCP_NODELAY_ENABLED=1