Understanding the Presto Metrics for Monitoring¶
Presto clusters support the Datadog monitoring service. You can configure the Datadog monitoring service at the cluster level as described in Advanced configuration: Modifying Cluster Monitoring Settings, or at the account level. (For more information on configuring the Datadog monitoring service at the account level on AWS, see iam-keys or Managing Roles.)
These are the different Presto metrics that are displayed in the Datadog account and the actions that you can do to remove the cause of errors.
| Presto Metric | Metric Definition | Abnormalities indicated in the Metrics | Actions |
|---|---|---|---|
| presto.Workers | Number of workers that are part of the Presto cluster | presto.Workers is lesser than configured minimum nodes |
Perform these actions:
|
| presto.MaxYoungGenGC-Time | Maximum time spent in YoungGen Garbage Collection (GC) across all nodes of the cluster | Sudden spike in the value indicates a problem | GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups. |
| presto.MaxYoungGenGC-Count | Maximum number of YoungGen GC events across all nodes of the cluster | Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea. | GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups. |
| presto.MaxOldGenGC-Time | Maximum time spent in OldGen GC across all nodes of the cluster | Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea. | GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups. |
| presto.MaxOldGenGC-Count | Maximum number of OldGen GC events across all nodes of the cluster | Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea. | GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups. |
| presto.AveragePlanningTime | Average planning time (in milliseconds) in planning phase of queries | These are the possible abnormalities:
|
Perform these actions:
|
| presto.requestFailures | Number of requests that failed at master while contacting worker nodes during the task execution | There might be a few of these errors due to network congestion but a consistent increase in the value indicates that there is a problem. | Perform these actions:
|
| presto.RUNNING-Queries | Number of Running Queries in the cluster | Not Applicable | Not Applicable |
| presto.FINISHED-Queries | Number of Finished Queries | Not Applicable | Not Applicable |
| presto.FAILED-Queries | Number of Failed Queries | Not Applicable | Not Applicable |
| presto.bytesReadPerSecondPerQuery | Bytes read per second per query (This metric considers only running queries in its calculations and if there are not any, then no data is reported.) | Value going towards 0 indicates that there is an issue Absence of a value does not indicate an issue |
This is most probably due to read operators getting stuck in reading from the cloud object store. Most probable reason for this is a network issue, which can be manually checked from the nodes. If you can manually reach the cloud object store, then create a ticket with Qubole Support. |