Creating warehouse performance notifications

by Joey Lane

What are warehouse performance alerts?

In order to have more confidence in the quality of Slingshot’s schedule recommendations, you can now set custom performance thresholds to alert you (and/or any member of your teams) if a warehouse has experienced unexpected spikes in two key performance metrics: query queuing and query execution time.

You will be able to configure an alert for these metrics using the following three methods:

After applying a recommendation, a modal will pop up asking if you want to set a performance alert on that warehouse. This allows you to ensure the new warehouse schedule is not performing anomalously.
You can create or manage alerts for any warehouse at any time on the Snowflake Warehouses page.
As a tenant admin, you can create alerts for groups of warehouses that are assigned to the same tag(s) on the Configurations page.

How should I determine performance thresholds for a given warehouse?

While determining performance thresholds is not an exact science, here are a few tips to help you ensure you are alerted about performance spikes that need attention (and avoid alerting about the ones that don’t):

Each of the thresholds have different implications for your warehouse performance:
- Query queuing: When a warehouse has a high number of queued queries, there is a delay for when a query is sent to a warehouse and when that query starts executing. This can result in longer wait times for queries to execute after being run. If response time for queries is especially important for a particular warehouse, we recommend setting a lower query queuing threshold.
  - If your warehouse begins to breach the query queuing threshold, temporarily increasing the max cluster count might relieve this issue.
- Query execution time: This is the time it takes a query to execute after it begins running. If a warehouse has an important SLA in place, or a fast response time for queries on a warehouse is needed, we recommend setting a query execution time alert in line with your SLA.
  - If your warehouse begins to breach the query execution time threshold, temporarily increasing the warehouse size or turning on QAS might relieve this issue.

You can use the following queries to find your maximum queuing and execution time over the last 30 days to help inform you thresholds:

Query queuing

set WAREHOUSE_NAME = '<NAME>'; -- change parameter

SELECT $WAREHOUSE_NAME AS WAREHOUSE_NAME, MAX (DELAYED) AS "Query Queue Exceeds" 
FROM (Select DATE(START_TIME) as date, HOUR(START_TIME) as hour, count (query_id) as Delayed
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY 
where QUEUED_OVERLOAD_TIME > 0
and START_TIME > CURRENT_DATE -30
AND WAREHOUSE_NAME = $WAREHOUSE_NAME
GROUP BY ALL)

Query execution time

set WAREHOUSE_NAME = '<NAME>'; -- change parameter

SELECT $WAREHOUSE_NAME AS WAREHOUSE_NAME, MAX(RUN_TIME_IN_MINS) AS "Query Execution Time Exceeds" 
FROM (Select DATE(START_TIME) as date, HOUR(START_TIME) as hour,max (round ( (TOTAL_ELAPSED_TIME/60000),0 )) as RUN_TIME_IN_MINS FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY 
where START_TIME > CURRENT_DATE -30
AND WAREHOUSE_NAME = $WAREHOUSE_NAME
GROUP BY ALL)

For informed thresholds based on a specific warehouse’s historical performance, you can see a warehouse’s historical queuing and execution time on the Warehouse performance dashboard dashboard.
Once you’ve set your thresholds, if alerting is too frequent or infrequent, you can always increase or decrease your thresholds for a warehouse on the Warehouses page.

WITH query_hist AS
(
    SELECT
        tenant_id AS tenantid,
        account_locator,
        region,
        query_parameterized_hash,
        SUM(partitions_scanned) AS partitions_scanned,
        SUM(partitions_total) AS partitions_total,
        SUM(partitions_scanned) / SUM(partitions_total) AS fraction_scanned,
        SUM(bytes_scanned) / POWER(1024, 3) AS bytes_scanned_gb,
        SUM(execution_time) / COUNT(*) AS avg_execution_time
    FROM
        slingshot_db.slingshot.query_history
    WHERE
        tenantid IN ($tenants)
        AND start_time > CURRENT_DATE - 30
    GROUP BY
        tenantid,
        account_locator,
        region,
        query_parameterized_hash
    HAVING
        SUM(bytes_scanned) / POWER(1024, 3) > 1
        AND SUM(partitions_scanned) / SUM(partitions_total) > 0.8
),
query_attr AS
(
    SELECT
        tenantid,
        account_locator,
        region,
        query_parameterized_hash,
        COUNT(*) AS total_executions,
        SUM(COALESCE(credits_attributed_compute, 0) + COALESCE(credits_used_query_acceleration, 0)) AS total_credits
    FROM
        slingshot_db.slingshot.query_attribution_history
    WHERE
        tenantid IN ($tenants)
        AND start_time > CURRENT_DATE - 30
        AND query_parameterized_hash IS NOT NULL
    GROUP BY
        tenantid,
        account_locator,
        region,
        query_parameterized_hash
),
ranked_queries AS
(
    SELECT
        h.tenantid,
        h.account_locator,
        h.region,
        h.query_parameterized_hash,
        h.partitions_scanned,
        h.partitions_total,
        h.fraction_scanned,
        h.bytes_scanned_gb,
        h.avg_execution_time,
        a.total_executions,
        a.total_credits,
        ROW_NUMBER() OVER (PARTITION BY h.tenantid ORDER BY h.bytes_scanned_gb DESC) AS rank
    FROM
        query_hist h
    INNER JOIN
        query_attr a
    ON
        h.tenantid = a.tenantid
        AND h.account_locator = a.account_locator
        AND h.region = a.region
        AND h.query_parameterized_hash = a.query_parameterized_hash
)
SELECT
    account_locator,
    region,
    query_parameterized_hash,
    partitions_scanned,
    partitions_total,
    fraction_scanned,
    bytes_scanned_gb,
    avg_execution_time,
    total_executions,
    total_credits
FROM
    ranked_queries
WHERE
    rank <= 100;