Query across InfluxDb metrics?
I have 3 time-series metrics in an InfluxDb database, akin to:
so to get a time-series set of values, I have a grafana graph that maps:
select sum(value) from myservice_processed where $timeFilter GROUP BY time($interval) fill(null)
…for each of the three values. This gives an idea of how many invocations, successes and failures are occurring every minute. Generally, the sum of processed and error should equal the value of invoked.
Now I want to get a time series value, based upon the above metrics, that gives me the percentage of failures. For example, in any given time interval, I may have 1000 invocations, with 900 processed and 100 errors; I’d like that metric to be 10% for that interval.
For the life of me, I cannot figure out how to do this, and I have begun to suspect it cannot be done, which is mind-boggling to me. Can someone please tell me I’m wrong and show me how to do it?
This is currently not possible since Influxdb does not support aggregations function over multiple series right now (influxdb 1.0)
So far Grafana does not support time serie calculations but we do have an ticket for the issue https://github.com/grafana/grafana/issues/3677
This could be done within InfluxDB by a set of continuous queries.
InfluxDB seems to work on the principle that storage is cheap and unscheduled processor time is expensive. Setting up background continuous calculations that store their results is easy, and it lets a calculation quietly churn in the background. Doing on-the-fly calculations within InfluxDB quickly gets awkward (or impossible, if they span measurements).
Every e.g. five minutes, perform a sum of each metric, grouped by time, and insert the sums into a fourth measurement, called
Instead of having a single field called
myservice_summary will have several fields; one for invoked calls, one for processed calls, and one for calls with errors. Instead of the default name
value, we name the fields something meaningful to the people reading the data.
Note that condensing the data with
GROUP BY time(x) (in this example, every five minutes) also reduces your storage overhead, and the client query time (fewer points to retrieve, transmit, and display on the client). It also reduces storage requirements. It’s common in InfluxDB to use at least two retention policies: raw data gets trimmed within a short time (e.g. 30 days), and the condensed and processed data can remain for much longer (e.g. months, years, …)
Of course, picking too large a
GROUP BY time() interval means coarse resolution that might be bad for fault-finding. e.g. It’s not much use having
GROUP BY time(1d) when you need to know in which hour to start looking for a particular change.
An optimal time grouping window balances meaningful detection of when issues start / stop with speed of client response and storage load. Finding this optimal value is left as an exercise. 🙂
Note that when using the CLI, for each of the three continuous queries below, everything from
CREATE CONTINUOUS QUERY to
END might need to be on one line to avoid syntax errors. I’ve put line breaks in only to improve readability.
The square brackets
[ ] indicate optional parameters. The brackets themselves are not to be literally included.
In this case, you would use the extra tag keys to choose which keys are significant and should be in the new measurement.
CREATE CONTINUOUS QUERY myservice_processed_sum_5m ON your_db_name BEGIN SELECT sum(value) AS processed_sum_5m INTO myservice_summary FROM myservice_processed GROUP BY time(5m)[, other_tag_keys e.g. vendor_id] END CREATE CONTINUOUS QUERY myservice_invoked_sum_5m ON your_db_name BEGIN SELECT sum(value) AS invoked_sum_5m INTO myservice_summary FROM myservice_invoked GROUP BY time(5m)[, other_tag_keys e.g. vendor_id] END CREATE CONTINUOUS QUERY myservice_error_sum ON your_db_name BEGIN SELECT sum(value) AS error_sum_5m INTO myservice_summary FROM myservice_error GROUP BY time(5m)[, other_tag_keys e.g. vendor_id] END
So now we have a new measurement, called
myservice_summary, with three fields:
error_sum_5m (assuming that 5-minute summaries are what you want).
From there, a query for the past 24h of fail percentage would be:
SELECT (error_sum_5m / invoked_sum_5m) * 100.0 AS error_pct_5m FROM myservice_summary WHERE time > now() - 1d [GROUP BY other_tags e.g. vendor_id]
Or in a more tabular format:
SELECT [vendor_id, etc, ](error_sum_5m / invoked_sum_5m) * 100.0 AS error_pct_5m FROM myservice_summary WHERE time > now() - 1d
Using the results stored in
myservice_summary in another CQ is possible, but I’m not 100% sure about avoiding race conditions, i.e. what if the CQ that depends on
myservice_summary executes before a query that populates that measurement?
Hope that helps.
InfluxDB lacks the analytics constructs to do that kind of things. If you want to stick with influxdb you’ll have to implement that in an external layer and feed the data back into influx.