Anomaly detection in the supercomputer job flow
At the moment a lot of supercomputing applications are inefficient in terms of the usage of available resources. To decrease the number of such inefficient applications, a tool for supercomputer task flow analysis and detection of inefficient application runs is needed. In this research machine learning methods are considered to solve this issue. The classification performed by these methods is based on system monitoring data (e.g. CPU load, network usage etc.).
Data mining method for anomaly detection in the supercomputer task flow / V. Voevodin, V. Voevodin, Д. Шайхисламов, D. Nikitenko // NUMERICAL COMPUTATIONS: THEORY AND ALGORITHMS (NUMTA–2016): Proceedings of the 2nd International Conference “Numerical Computations: Theory and Algorithms”. — Vol. 1776 of AIP Conference Proceedings. — 2016. — P. 090015–1–090015–4. DOI: 10.1063/1.4965379
Shaykhislamov D. Using machine learning methods to detect applications with abnormal efficiency // Russian Supercomputing Days. — Springer, 2016. — P. 345–355. DOI 10.1007/978-3-319-55669-7_27
Shaykhislamov D., Voevodin V. An approach for detecting abnormal parallel applications based on time series analysis methods // Parallel Processing and Applied Mathematics. — Vol. 10777 of Lecture Notes in Computer Science. — Springer International Publishing, 2018. — P. 359–369. DOI: 10.1007/978-3-319-78024-5_32