Lorenzo Piarulli

PhD Student

AI Accelerators Applications Benchmarking

Lorenzo is a PhD student in the HLC lab at Sapienza University of Rome. His research focuses on high-performance computing, with particular interest in network interconnects and MPI runtime optimisation.

Publications

2026

SC26
High-Performance Tensor Formulation of the Viterbi Algorithm for Hidden Semi-Markov Models

Lorenzo Piarulli, Elia Belli, and Daniele De Sensi

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’26), Nov 2026

Abs Bib

Hidden Semi-Markov Models (HSMMs) are fundamental probabilistic models widely adopted across diverse domains, from computational biology to finance and signal processing. The Viterbi algorithm decodes the most likely state sequence given an HSMM and can be applied iteratively for ab initio model learning. However, existing Viterbi implementations remain sequential, and GPU-accelerated solutions are entirely absent, making HSMM decoding impractical for large-scale workloads. We present a tensor-based formulation of the Viterbi algorithm for HSMMs, restructuring the inner loops into tensor operations that naturally map onto SIMD units and massively parallel architectures. Building on this formulation, we provide optimized implementations spanning single- and multi-core CPUs, and, for the first time, GPU. Experimental evaluation demonstrates speedups of up to 14× on a single core, over 200× with multi-core, and over 570× on GPU over the state-of-the-art sequential baseline, establishing a new performance baseline for large-scale HSMM decoding.
@inproceedings{tensor-viterbi, author = {Piarulli, Lorenzo and Belli, Elia and De Sensi, Daniele}, title = {High-Performance Tensor Formulation of the Viterbi Algorithm for Hidden Semi-Markov Models}, year = {2026}, month = nov, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'26)}, doi = {To Appear}, }
SC26
Characterizing the Scalability and Performance of Large-Scale AI Training Under Multi-Tenancy

Jacopo Raffi, Thomas Pasquali, Lorenzo Piarulli, and 6 more authors

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’26), Nov 2026

Abs Bib

Characterising AI workload performance on modern HPC systems requires understanding both their scalability in isolation and their behaviour under concurrent execution. However, the interplay among parallelisation strategies, network congestion, compute capability, and interconnect technologies remains poorly understood. This work investigates the performance and scalability of AI models up to 2620 GPUs. We quantify the communication overheads and their impact across different interconnects by evaluating scale-up, scale-out, and rack-scale configurations under multiple allocation schemes. Finally, we study how multiple concurrent training jobs interfere with each other by designing a realistic noise model. We design a benchmark suite of AI models to evaluate the performance of five distinct parallelisation strategies across different supercomputing clusters, including Alps, Leonardo, LUMI, JUPITER, NVL72 GB300, and DGX A100. Our work provides a systematic characterization of the scalability and execution efficiency of distributed AI training, while offering key insights into performance behavior under realistic multi-tenant scenarios.
@inproceedings{multi-tenant-ai, author = {Raffi, Jacopo and Pasquali, Thomas and Piarulli, Lorenzo and Spiga, Filippo and Faltelli, Marco and Herten, Andreas and Siracusa, Domenico and De Sensi, Daniele and Vella, Flavio}, title = {Characterizing the Scalability and Performance of Large-Scale AI Training Under Multi-Tenancy}, year = {2026}, month = nov, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'26)}, doi = {To Appear}, }

MICPRO

NET4EXA: Toward Ethernet-native interconnects for exascale HPC and AI systems

John Gliksberg, Luca Andreetti, Daniele Di Bari, and 53 more authors

Microprocessors and Microsystems, 2026

Abs Bib

High-Performance Computing (HPC) and Artificial Intelligence (AI) infrastructures are evolving toward highly heterogeneous exascale systems, where communication scalability and data movement increasingly limit overall performance and efficiency. This paper presents NET4EXA, a European initiative targeting the development of BXIv3, the third generation of the Bull eXascale Interconnect (BXI), based on an Ethernet-native architecture combining FPGA-based SmartNICs, custom switching ASICs, and hardware-accelerated communication semantics. The paper reports a selection of mid-project research results spanning multiple layers of the communication stack, including NIC- and switch-level collective offloading, TCP/RDMA acceleration, on-NIC processing, communication–computation overlap through triggered operations, and optimized data movement using IOVEC mechanisms. Additional contributions include OpenSHMEM integration over a Portals-based runtime, congestion-aware routing and mapping strategies for Dragonfly＋ topologies, and benchmarking frameworks for HPC and AI communication workloads. Finally, the paper discusses emerging research directions toward future interconnect architectures based on silicon photonics and co-packaged optics targeting multi-terabit communication systems.

@article{GLIKSBERG2026105296,
  title = {NET4EXA: Toward Ethernet-native interconnects for exascale HPC and AI systems},
  journal = {Microprocessors and Microsystems},
  pages = {105296},
  year = {2026},
  issn = {0141-9331},
  doi = {https://doi.org/10.1016/j.micpro.2026.105296},
  url = {https://www.sciencedirect.com/science/article/pii/S0141933126000530},
  author = {Gliksberg, John and Andreetti, Luca and Bari, Daniele Di and Pagano, Giuseppe and Monterubbiano, Andrea and Turisini, Matteo and Zaourar, Lilia and Benazouz, Mohamed and Pottier, Antony and Takhoubit, Khaled and Jaeger, Julien and Beaulieu, Corentin and Peysieux, Anis and Reynier, Florian and Taboada, Hugo and Bilas, Angelos and Chaix, Fabien and Chrysos, Nikolaos and Mageiropoulos, Evangelos and Moreau, Gilles and Ammendola, Roberto and Biagioni, Andrea and Chiarini, Carlotta and Frezza, Ottorino and {Lo Cicero}, Francesca and Lonardo, Alessandro and Martinelli, Michele and Paolucci, Pier Stanislao and Pastorelli, Elena and Perticaroli, Pierpaolo and Pontisso, Luca and Rossi, Cristian and Simula, Francesco and Vicini, Piero and Mocholi, Samuel Rodrigo and Vallero, Marzio and Fan, Kaijie and Fiore, Sandro Luigi and Granelli, Fabrizio and Pasquali, Thomas and Patrignani, Marco and Pezzuto, Simone and Pichetti, Lorenzo and Potestio, Raffaello and Raffi, Jacopo and Siracusa, Domenico and Tubiana, Luca and Velha, Philippe and Vella, Flavio and {De Sensi}, Daniele and Fazzari, Francesco and Paschali, Vladimiro and Pasqualoni, Saverio and Piarulli, Lorenzo and Pontarelli, Salvatore and Rahmani, Taha Abdelazziz},
  keywords = {High-performance interconnects, Exascale computing, SmartNIC, RDMA, HPC and AI systems},
}

ISC26
PICO: Performance Insights for Collective Operations

Saverio Pasqualoni, Tommaso Bonato, Lorenzo Piarulli, and 3 more authors

In ISC High Performance 2026 Research Paper Proceedings (41th International Conference), Hamburg, Germany, June 22-26, 2026, 2026

Abs arXiv Bib Best Paper Award (1st/35)

Collective operations are cornerstones of both HPC applications and large-scale AI training and inference, yet benchmarking them in a systematic and reproducible way remains difficult on modern systems due to the complexity of their hardware and software stacks. Existing suites primarily report end-to-end timings and offer limited support for controlled algorithm and configuration selection, fine-grained profiling, and capturing the runtime environment. We present PICO (Performance Insights for Collective Operations), an open-source framework that decouples portable experiment setup from platform execution, provides a backend-adaptive parameter selection interface across MPI and NCCL, supplies plain-MPI reference collective implementations, optionally instrumentable, and records the system configuration for reproducible comparisons. Evaluated on three major supercomputers, PICO shows that default collective algorithms and transport settings can be up to 5x slower than the best available choice. It provides diagnostic evidence by isolating topology sensitive algorithmic choices and, through instrumentation, reveals detailed algorithmic breakdowns. To assess end-to-end effects of benchmark-informed tuning and evaluate application-level impacts, we replay open-source LLM training traces in ATLAHS simulator with optimized collective profiles identified by PICO, achieving reductions in training times of up to 44%.
@inproceedings{pico, author = {Pasqualoni, Saverio and Bonato, Tommaso and Piarulli, Lorenzo and Hoefler, Torsten and Canini, Marco and De Sensi, Daniele}, title = {{PICO: Performance Insights for Collective Operations}}, booktitle = {{ISC} High Performance 2026 Research Paper Proceedings (41th International Conference), Hamburg, Germany, June 22-26, 2026}, publisher = {Prometeus GmbH / {IEEE}}, year = {2026}, doi = {To Appear}, }
ISC26
Characterizing the Impact of Congestion in Modern HPC Interconnects

Lorenzo Piarulli, Marco Faltelli, Dirk Pleiter, and 7 more authors

In ISC High Performance 2026 Research Paper Proceedings (41th International Conference), Hamburg, Germany, June 22-26, 2026, 2026

Abs Bib

High-performance computing (HPC) systems increasingly support both scalable AI training and large-scale simulation workloads. Both typically rely heavily on collective communication operations. On modern supercomputers, however, network congestion has emerged as a major limitation, driven by heterogeneous traffic patterns resulting from diverse workload mixes. As system scale and active users continue to grow, understanding how today’s interconnect technologies respond to congestion is essential for establishing realistic performance expectations and informing future system design. This paper presents a comprehensive characterization of congestion behavior across four major HPC fabrics: EDR InfiniBand, HDR InfiniBand, NDR InfiniBand, Cray Slingshot, and emerging Ethernet fabrics. These fabrics span high-performance proprietary interconnects as well as adaptive Ethernet-based designs aligned with emerging standards such as Ultra Ethernet. We evaluate their responses to both steady congestion and a wide range of bursty patterns that vary in duration, intensity, and pause length, capturing the bursty communication typical of AI workloads. Our study covers multiple scales, examining how congestion manifests differently as system size increases and identifying scale-dependent behaviors that influence collective performance. By analyzing the challenges that arise under these controlled stress conditions, we aim to provide a practical overview of congestion issues and possible optimizations. The insights derived from this evaluation can guide researchers and HPC architects in designing more effective congestion-control mechanisms and network load-balancing strategies.
@inproceedings{congestion_isc26, author = {Piarulli, Lorenzo and Faltelli, Marco and Pleiter, Dirk and Sivalingam, Karthee and Zhang, Dancheng and Zhao, Kexue and Turisini, Matteo and Iannone, Francesco and Artigiani, Aldo and De Sensi, Daniele}, title = {{Characterizing the Impact of Congestion in Modern HPC Interconnects}}, booktitle = {{ISC} High Performance 2026 Research Paper Proceedings (41th International Conference), Hamburg, Germany, June 22-26, 2026}, publisher = {Prometeus GmbH / {IEEE}}, year = {2026}, doi = {To Appear}, }

2025

SC25
Bine Trees: Enhancing Collective Operations by Optimizing Communication Locality

Daniele De Sensi, Saverio Pasqualoni, Lorenzo Piarulli, and 5 more authors

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’25), Nov 2025

Abs arXiv Bib Slides

As high-performance computing (HPC) systems grow, optimizing communication locality becomes essential for performance. HPC networks are often oversubscribed, consisting of fully connected groups that are sparsely connected. We introduce Binomial Negabinary (Bine) trees, a novel approach to enhance collective operations by reducing inter-group communication. They minimize the distance between communicating ranks, reducing traffic on global links and alleviating congestion. Unlike traditional hierarchical algorithms, Bine trees are topology-agnostic and do not assume a uniform partition of ranks, making them ideal for production supercomputers with irregular process allocations. We design algorithms for eight collectives, achieving up to 5x speedups and 33% less global traffic on four supercomputers with four different topologies. Our results emphasize their effectiveness in improving performance while reducing the load on global links.
@inproceedings{bine, author = {De Sensi, Daniele and Pasqualoni, Saverio and Piarulli, Lorenzo and Bonato, Tommaso and Ba, Seydou and Turisini, Matteo and Domke, Jens and Hoefler, Torsten}, title = {Bine Trees: Enhancing Collective Operations by Optimizing Communication Locality}, year = {2025}, month = nov, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'25)}, doi = {To Appear}, }