Octotron: Towards Autonomous and Reliable Operation of Supercomputers
The goal of the Octotron project is to design an approach that guarantees the reliable autonomous operation of large supercomputing centers. The approach is based on a formal model of a supercomputer that describes the proper functioning of its components and their interconnections. The supercomputer compares continually its current state with the information in the model. If the reality (current supercomputer state) deviates from the theory (the supercomputer model), Octotron performs one of the predefined actions: notifying administrators via email and/or SMS, disabling malfunctioning services, restarting software, etc. This approach guarantees not only reliable operation of the existing fleet of systems at a supercomputing center, but also ensures really high-quality maintenance when moving to a new generation of machines. Indeed, once an emergency situation arises, it is reflected in the model, along with the root causes and symptoms of its occurrence, and an adequate reaction is programmed into the model.
Evaluation: MSU HPC Center.
S. I. Sobolev, A. S. Antonov, P. A. Shvets, D. A. Nikitenko, K. S. Stefanov, Vad V. Voevodin, Vl V. Voevodin, and S. A. Zhumatiy. Evaluation of the octotron system on the lomonosov-2 supercomputer. In: Parallel Computing Technologies (PaCT'2018): international conference proceedings (April 2-6, 2018, Rostov-on-Don, Russia), pages 176–184, SUSU Publishing center, 2018. http://omega.sp.susu.ru/pavt2018/short/047.pdf
Alexander Antonov, Dmitry Nikitenko, Pavel Shvets, Sergey Sobolev, Konstantin Stefanov, Vadim Voevodin, Vladimir Voevodin, and Sergey Zhumatiy. An approach for ensuring reliable functioning of a supercomputer based on a formal model. In: Parallel Processing and Applied Mathematics. 11th International Conference, PPAM 2015, Krakow, Poland, September 6-9, 2015. Revised Selected Papers, Part I, volume 9573 of Lecture Notes in Computer Science, pages 12–22. Springer International Publishing, 2016. DOI: 10.1007/978-3-319-32149-3_2
Sergey Sobolev, Konstantin Stefanov, and Vadim Voevodin. Automatic discovery of the communication network topology for building a supercomputer model. In: NUMERICAL COMPUTATIONS: THEORY AND ALGORITHMS (NUMTA–2016): Proceedings of the 2nd International Conference “Numerical Computations: Theory and Algorithms”, volume 1776 of AIP Conference Proceedings, pages 090014–1–090014–4, 2016. DOI: 10.1063/1.4965378