This title appears in the Scientific Report :
2023
Please use the identifier:
http://dx.doi.org/10.34734/FZJ-2023-03546 in citations.
HPC system and job monitoring with LLview
HPC system and job monitoring with LLview
LLview is a monitoring infrastructure developed by the Jülich Supercomputing Centre with the objective to provide an easy to use and adaptable software suite for monitoring High Performance Computing systems. With the emergence of large heterogeneous machines, in the range of Exascale, the challenge...
Saved in:
Personal Name(s): | Silva, Vitor |
---|---|
Guimaraes, Filipe (Corresponding author) | |
Contributing Institute: |
Jülich Supercomputing Center; JSC |
Imprint: |
2022
|
DOI: |
10.34734/FZJ-2023-03546 |
Conference: | RISC2 webinar series, Online (Germany), |
Document Type: |
Talk (non-conference) |
Research Program: |
A network for supporting the coordination of High-Performance Computing research between Europe and Latin America Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups |
Link: |
Get full text OpenAccess |
Publikationsportal JuSER |
LLview is a monitoring infrastructure developed by the Jülich Supercomputing Centre with the objective to provide an easy to use and adaptable software suite for monitoring High Performance Computing systems. With the emergence of large heterogeneous machines, in the range of Exascale, the challenges of monitoring such huge systems increase significantly. To address that, LLview is under continuous development in order to work for a wide range of hardware systems and software interfaces with negligible overhead and at the same time providing fast, reliable access to job reports, system-wide monitoring data, and real-time system information. That information is provided to system users, project advisors, support teams and system administrators, helping the managing of jobs, identification of performance issues at many levels and also helping the system administrators to find failures and system malfunctions. This webinar gives an overview of the different LLview components and their interaction with each other and the system. Moreover, particular attention is drawn to the system monitoring views and the job reporting features, as they allow to trace the entire life cycle of a job and can help identify problems and bottlenecks at a very early stage. |