Performance monitoring is a method to debug performance issues in different types of applications. It uses various performance metrics obtained from the servers the application runs on, and also may use metrics which are produced by the application itself.
Distributed Modular MONitoring system (DiMMon) is a performance monitoring system which allows to analyze performance of the supercomputer as a whole and every job which is executed. It is based on the following principles:
• Ability to direct different data flows along different routes, or copy the same data to several recipients for different processing functions.
• Support for the dynamic reconfiguration of the monitoring system operation modes (data transmission routes, data collection parameters, data processing rules).
• Ability to calculate performance metrics for individual jobs while collecting data, without writing it to disk and subsequently reading.
K. Stefanov and Vl Voevodin. Distributed modular monitoring (dimmon) approach to supercomputer monitoring. In Proceedings of the 2015 IEEE International Conference on Cluster Computing, pages 502–503. IEEE Computer Society Conference Publishing Services, 2015. DOI: 10.1109/CLUSTER.2015.83
K. Stefanov, Vl Voevodin, S. Zhumatiy, and Vad Voevodin. Dynamically reconfigurable distributed modular monitoring system for supercomputers (dimmon). In 4th International Young Scientist Conference on Computational Science, volume 66 of Procedia Computer Science, pages 625–634. Elsevier B.V Netherlands, 2015. DOI: 10.1016/j.procs.2015.11.071
Konstantin Stefanov and Alexey Gradskov. Analysis of cpu usage data properties and their possible impact on performance monitoring. Supercomputing Frontiers and Innovations, 3(4):66–73, 2016. DOI: 10.14529/jsfi160405