
Additionally, the library comes with High Level Operation (HLO) Execution Time Distribution Metrics, offering detailed timing breakdowns of compiled operations, and HLO Queue Size, which monitors execution pipeline congestion.
However, Google isn’t the only AI infrastructure provider that is releasing tools to optimize resources (CPU accelerators, GPUs) performance and usage.
Rival hyperscaler AWS has a host of ways using which enterprises can optimize their cost of running AI workloads while ensuring maximum usage of their resources.
To begin with, it provides Amazon CloudWatch — a service that is capable of providing end-to-end observability on training workloads running on Trainium and Inferentia, including metrics like GPU/accelerator utilization, latency, throughput, and resource availability.