Saverio Pasqualoni

Intern

pasqualoni.1845572@studenti.uniroma1.it Google Scholar

Collective Operations AI Accelerators

Publications

2026

MICPRO

NET4EXA: Toward Ethernet-native interconnects for exascale HPC and AI systems

John Gliksberg, Luca Andreetti, Daniele Di Bari, and 53 more authors

Microprocessors and Microsystems, 2026

Abs Bib

High-Performance Computing (HPC) and Artificial Intelligence (AI) infrastructures are evolving toward highly heterogeneous exascale systems, where communication scalability and data movement increasingly limit overall performance and efficiency. This paper presents NET4EXA, a European initiative targeting the development of BXIv3, the third generation of the Bull eXascale Interconnect (BXI), based on an Ethernet-native architecture combining FPGA-based SmartNICs, custom switching ASICs, and hardware-accelerated communication semantics. The paper reports a selection of mid-project research results spanning multiple layers of the communication stack, including NIC- and switch-level collective offloading, TCP/RDMA acceleration, on-NIC processing, communication–computation overlap through triggered operations, and optimized data movement using IOVEC mechanisms. Additional contributions include OpenSHMEM integration over a Portals-based runtime, congestion-aware routing and mapping strategies for Dragonfly＋ topologies, and benchmarking frameworks for HPC and AI communication workloads. Finally, the paper discusses emerging research directions toward future interconnect architectures based on silicon photonics and co-packaged optics targeting multi-terabit communication systems.

@article{GLIKSBERG2026105296,
  title = {NET4EXA: Toward Ethernet-native interconnects for exascale HPC and AI systems},
  journal = {Microprocessors and Microsystems},
  pages = {105296},
  year = {2026},
  issn = {0141-9331},
  doi = {https://doi.org/10.1016/j.micpro.2026.105296},
  url = {https://www.sciencedirect.com/science/article/pii/S0141933126000530},
  author = {Gliksberg, John and Andreetti, Luca and Bari, Daniele Di and Pagano, Giuseppe and Monterubbiano, Andrea and Turisini, Matteo and Zaourar, Lilia and Benazouz, Mohamed and Pottier, Antony and Takhoubit, Khaled and Jaeger, Julien and Beaulieu, Corentin and Peysieux, Anis and Reynier, Florian and Taboada, Hugo and Bilas, Angelos and Chaix, Fabien and Chrysos, Nikolaos and Mageiropoulos, Evangelos and Moreau, Gilles and Ammendola, Roberto and Biagioni, Andrea and Chiarini, Carlotta and Frezza, Ottorino and {Lo Cicero}, Francesca and Lonardo, Alessandro and Martinelli, Michele and Paolucci, Pier Stanislao and Pastorelli, Elena and Perticaroli, Pierpaolo and Pontisso, Luca and Rossi, Cristian and Simula, Francesco and Vicini, Piero and Mocholi, Samuel Rodrigo and Vallero, Marzio and Fan, Kaijie and Fiore, Sandro Luigi and Granelli, Fabrizio and Pasquali, Thomas and Patrignani, Marco and Pezzuto, Simone and Pichetti, Lorenzo and Potestio, Raffaello and Raffi, Jacopo and Siracusa, Domenico and Tubiana, Luca and Velha, Philippe and Vella, Flavio and {De Sensi}, Daniele and Fazzari, Francesco and Paschali, Vladimiro and Pasqualoni, Saverio and Piarulli, Lorenzo and Pontarelli, Salvatore and Rahmani, Taha Abdelazziz},
  keywords = {High-performance interconnects, Exascale computing, SmartNIC, RDMA, HPC and AI systems},
}

ISC26
PICO: Performance Insights for Collective Operations

Saverio Pasqualoni, Tommaso Bonato, Lorenzo Piarulli, and 3 more authors

In ISC High Performance 2026 Research Paper Proceedings (41th International Conference), Hamburg, Germany, June 22-26, 2026, 2026

Abs arXiv Bib Best Paper Award (1st/35)

Collective operations are cornerstones of both HPC applications and large-scale AI training and inference, yet benchmarking them in a systematic and reproducible way remains difficult on modern systems due to the complexity of their hardware and software stacks. Existing suites primarily report end-to-end timings and offer limited support for controlled algorithm and configuration selection, fine-grained profiling, and capturing the runtime environment. We present PICO (Performance Insights for Collective Operations), an open-source framework that decouples portable experiment setup from platform execution, provides a backend-adaptive parameter selection interface across MPI and NCCL, supplies plain-MPI reference collective implementations, optionally instrumentable, and records the system configuration for reproducible comparisons. Evaluated on three major supercomputers, PICO shows that default collective algorithms and transport settings can be up to 5x slower than the best available choice. It provides diagnostic evidence by isolating topology sensitive algorithmic choices and, through instrumentation, reveals detailed algorithmic breakdowns. To assess end-to-end effects of benchmark-informed tuning and evaluate application-level impacts, we replay open-source LLM training traces in ATLAHS simulator with optimized collective profiles identified by PICO, achieving reductions in training times of up to 44%.
@inproceedings{pico, author = {Pasqualoni, Saverio and Bonato, Tommaso and Piarulli, Lorenzo and Hoefler, Torsten and Canini, Marco and De Sensi, Daniele}, title = {{PICO: Performance Insights for Collective Operations}}, booktitle = {{ISC} High Performance 2026 Research Paper Proceedings (41th International Conference), Hamburg, Germany, June 22-26, 2026}, publisher = {Prometeus GmbH / {IEEE}}, year = {2026}, doi = {To Appear}, }

2025

SC25
Bine Trees: Enhancing Collective Operations by Optimizing Communication Locality

Daniele De Sensi, Saverio Pasqualoni, Lorenzo Piarulli, and 5 more authors

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’25), Nov 2025

Abs arXiv Bib Slides

As high-performance computing (HPC) systems grow, optimizing communication locality becomes essential for performance. HPC networks are often oversubscribed, consisting of fully connected groups that are sparsely connected. We introduce Binomial Negabinary (Bine) trees, a novel approach to enhance collective operations by reducing inter-group communication. They minimize the distance between communicating ranks, reducing traffic on global links and alleviating congestion. Unlike traditional hierarchical algorithms, Bine trees are topology-agnostic and do not assume a uniform partition of ranks, making them ideal for production supercomputers with irregular process allocations. We design algorithms for eight collectives, achieving up to 5x speedups and 33% less global traffic on four supercomputers with four different topologies. Our results emphasize their effectiveness in improving performance while reducing the load on global links.
@inproceedings{bine, author = {De Sensi, Daniele and Pasqualoni, Saverio and Piarulli, Lorenzo and Bonato, Tommaso and Ba, Seydou and Turisini, Matteo and Domke, Jens and Hoefler, Torsten}, title = {Bine Trees: Enhancing Collective Operations by Optimizing Communication Locality}, year = {2025}, month = nov, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'25)}, doi = {To Appear}, }