Query across InfluxDb metrics?

Query across InfluxDb metrics?

I have 3 time-series metrics in an InfluxDb database, akin to:
myservice_processed
myservice_invoked
myservice_error

so to get a time-series set of values, I have a grafana graph that maps:
select sum(value) from myservice_processed where $timeFilter GROUP BY time($interval) fill(null)

…for each of the three values. This gives an idea of how many invocations, successes and failures are occurring every minute. Generally, the sum of processed and error should equal the value of invoked.
Now I want to get a time series value, based upon the above metrics, that gives me the percentage of failures. For example, in any given time interval, I may have 1000 invocations, with 900 processed and 100 errors; I’d like that metric to be 10% for that interval.
For the life of me, I cannot figure out how to do this, and I have begun to suspect it cannot be done, which is mind-boggling to me. Can someone please tell me I’m wrong and show me how to do it?

Solutions/Answers:

Solution 1:

This is currently not possible since Influxdb does not support aggregations function over multiple series right now (influxdb 1.0)

So far Grafana does not support time serie calculations but we do have an ticket for the issue https://github.com/grafana/grafana/issues/3677

Related:  Integrating Grafana into a web app

Solution 2:

This could be done within InfluxDB by a set of continuous queries.

InfluxDB seems to work on the principle that storage is cheap and unscheduled processor time is expensive. Setting up background continuous calculations that store their results is easy, and it lets a calculation quietly churn in the background. Doing on-the-fly calculations within InfluxDB quickly gets awkward (or impossible, if they span measurements).

Strategy

Every e.g. five minutes, perform a sum of each metric, grouped by time, and insert the sums into a fourth measurement, called myservice_summary.

Instead of having a single field called value, myservice_summary will have several fields; one for invoked calls, one for processed calls, and one for calls with errors. Instead of the default name value, we name the fields something meaningful to the people reading the data.

Note that condensing the data with GROUP BY time(x) (in this example, every five minutes) also reduces your storage overhead, and the client query time (fewer points to retrieve, transmit, and display on the client). It also reduces storage requirements. It’s common in InfluxDB to use at least two retention policies: raw data gets trimmed within a short time (e.g. 30 days), and the condensed and processed data can remain for much longer (e.g. months, years, …)

Related:  Influxdb for a financial application

Of course, picking too large a GROUP BY time() interval means coarse resolution that might be bad for fault-finding. e.g. It’s not much use having GROUP BY time(1d) when you need to know in which hour to start looking for a particular change.

An optimal time grouping window balances meaningful detection of when issues start / stop with speed of client response and storage load. Finding this optimal value is left as an exercise. 🙂

Example

Note that when using the CLI, for each of the three continuous queries below, everything from CREATE CONTINUOUS QUERY to END might need to be on one line to avoid syntax errors. I’ve put line breaks in only to improve readability.

The square brackets [ ] indicate optional parameters. The brackets themselves are not to be literally included.

In this case, you would use the extra tag keys to choose which keys are significant and should be in the new measurement.

CREATE CONTINUOUS QUERY myservice_processed_sum_5m ON your_db_name
BEGIN
    SELECT sum(value) AS processed_sum_5m 
    INTO myservice_summary 
    FROM myservice_processed GROUP BY time(5m)[, other_tag_keys e.g. vendor_id]
END 

CREATE CONTINUOUS QUERY myservice_invoked_sum_5m ON your_db_name
BEGIN
    SELECT sum(value) AS invoked_sum_5m 
    INTO myservice_summary 
    FROM myservice_invoked GROUP BY time(5m)[, other_tag_keys e.g. vendor_id]
END 

CREATE CONTINUOUS QUERY myservice_error_sum ON your_db_name
BEGIN
    SELECT sum(value) AS error_sum_5m 
    INTO myservice_summary 
    FROM myservice_error GROUP BY time(5m)[, other_tag_keys e.g. vendor_id]
END

So now we have a new measurement, called myservice_summary, with three fields: processed_sum_5m, invoked_sum_5m, and error_sum_5m (assuming that 5-minute summaries are what you want).

Related:  InfluxDB : single or multiple measurement

From there, a query for the past 24h of fail percentage would be:

SELECT (error_sum_5m / invoked_sum_5m) * 100.0 
    AS error_pct_5m
    FROM myservice_summary
    WHERE time > now() - 1d
    [GROUP BY other_tags e.g. vendor_id]

Or in a more tabular format:

SELECT [vendor_id, etc, ](error_sum_5m / invoked_sum_5m) * 100.0 
    AS error_pct_5m
    FROM myservice_summary
    WHERE time > now() - 1d

Using the results stored in myservice_summary in another CQ is possible, but I’m not 100% sure about avoiding race conditions, i.e. what if the CQ that depends on myservice_summary executes before a query that populates that measurement?

Hope that helps.

Solution 3:

InfluxDB lacks the analytics constructs to do that kind of things. If you want to stick with influxdb you’ll have to implement that in an external layer and feed the data back into influx.

References