This article is part of a series where we look at doing data science work within ClickHouse. Articles in this series include forecasting, anomaly detection, linear regression and time series classification.
Though this type of analysis would more typically take place outside of ClickHouse in a programming language such as Python or R, our preference is to take things as far as possible using just the database.
By taking this approach, we can rely on the power of ClickHouse to process large datasets with high performance, and reduce or even totally avoid the amount of code that we need to write. This also means that we can work with smaller in-memory datasets on the client side and potentially avoid the need for distributed computation using frameworks such as Spark.
About This Example
In this article, we will demonstrate a simple time series classification technique using Clickhouse machine learning functions.
We will look at time series of electical power demand in Italy, and automatically classify them into two categories - either fall / winter (between October and March) or days in the spring / summer (between April and September). In the dataset, the -1 label is summer, and +1 is winter.
Obviously, power demand is very different in the peak of summer and peak of winter. However, in Spring and Autumn where the seasons cross, the classification task becomes more difficult. There are also anomalies through the year such as cool days in the summer and warm days in the winter. Intuitively we should therefore expect the model to perform fairly well, but it is not an entirely trivial exercise.
Regardless of the problem, it is an interesting demonstration of the technique and how to work with time series data in ClickHouse, so on we go.
Dataset
We use the Italy Power Demand dataset from the UCR Time Series Classification Archive as the basis for our investigation.
Each time series in the dataset represents one day (24 hours) of electrical power demand in Italy. The dataset has the following columns:
- Split - Whether the time series belongs to the training set (train) or to the test set (test).
- Label - The time series binary label, either -1 or 1 representing summer and winter respectively.
- h1-h24 - the time series values over 24 hours.
A sample of the dataset is shown below:
SELECT
split,
label,
h1
FROM power_demand
LIMIT 5
Query id: 25dc4b0d-9d0c-4cf8-a638-507540e31499
┌─split─┬─label─┬─────────h1─┐
│ test │ -1 │ 0.47297301 │
│ test │ -1 │ -1.0987371 │
│ test │ -1 │ -1.0294417 │
│ test │ -1 │ 0.29127426 │
│ test │ -1 │ -0.8146062 │
└───────┴───────┴────────────┘
5 rows in set. Elapsed: 0.002 sec. Processed 1.10 thousand rows, 31.85 KB (565.67 thousand rows/s., 16.44 MB/s.)
Peak memory usage: 1.25 MiB.
If we take the average hourly power demand across the two periods and plot them then we get the following visualisation. The blue line represents summer (-1) and the orange line represents winter (1).
We can see that power demand in the winter is slightly higher than the summer, particularly in the evneing, but the two lines do cross and average out to be relatively close which again demonstrates that this is not a trivial classification problem.
Data Preparation
In our previous examples we had to manually partition our dataset between test and train. In this instance though, our dataset is already split.
We can also see that the data is balanced as the two labels (-1 and 1) appear with similar frequencies, both in the training set and in the test set.
SELECT
split,
label,
count(*)
FROM power_demand
GROUP BY
split,
label
Query id: 148ec1da-8fa8-4ddf-8080-189d52b1ec04
┌─split─┬─label─┬─count()─┐
│ test │ 1 │ 513 │
│ test │ -1 │ 516 │
│ train │ 1 │ 34 │
│ train │ -1 │ 33 │
└───────┴───────┴─────────┘
The time series are already standardized as they have mean equal to zero and standard deviation equal to one, and therefore no additional preprocessing is required.
Model Training
We use a logistic regression model to predict the time series labels from the time series values. We fit the model to the training set using Clickhouse's stochasticLogisticRegression function and save the results to a model table:
CREATE OR REPLACE VIEW power_demand_model AS
AS SELECT
stochasticLogisticRegressionState(0.1, 0.0, 4, 'SGD')(
label, h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11, h12, h13, h14, h15, h16, h17, h18, h19, h20, h21, h22, h23, h24
)
AS
state
FROM
power_demand
WHERE split = 'train'
Inference
After fitting the model to the training set, we can use it to make prediction on the test set.
Note that the model returns the predicted probabilities and not the class labels. In order to transform the predicted probabilities into class labels, we compare them with the standard decision threshold of 0.5: if the predicted probability is greater than 0.5 then the predicted class label is 1, otherwise it's -1.
CREATE OR REPLACE VIEW power_demand_results AS
WITH (
SELECT state
FROM power_demand_model
) AS model
SELECT
label AS ACTUAL_LABEL,
evalMLMethod(model, h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11, h12, h13, h14, h15, h16, h17, h18, h19, h20, h21, h22, h23, h24) AS PREDICTED_PROBABILITY,
if(PREDICTED_PROBABILITY > 0.5, 1., -1.) AS PREDICTED_LABEL
FROM power_demand
This generates the following results view which maps the predicted probability to a label, and then compares it with the actual observed result in our test partition:
SELECT *
FROM power_demand_results
LIMIT 5
Query id: 5570eca8-dc09-4d1f-a4a6-00e1446553a8
┌─ACTUAL_LABEL─┬─PREDICTED_PROBABILITY─┬─PREDICTED_LABEL─┐
│ -1 │ 0.23409648001131916 │ -1 │
│ -1 │ 0.3862228028377256 │ -1 │
│ -1 │ 0.3815130710249039 │ -1 │
│ -1 │ 0.4265333624840515 │ -1 │
│ -1 │ 0.39001302830997614 │ -1 │
└──────────────┴───────────────────────┴─────────────────┘
As shown in the plot below, the model classifies most time series correctly, given that the predicted probability is lower than 0.5 for most of the time series in the negative class (-1) and greater than 0.5 for most of the time series in the positive class (1).
Model Evaluation
We find that the model achieves a 97% accuracy on the test set.
Considering the points made above and how closely the two time series average across the two labels, we find this to be quite remarkable performance.
SELECT avg(if(ACTUAL_LABEL = PREDICTED_LABEL, 1., 0.))
FROM power_demand_results
Query id: 88f1e866-ca4b-4746-8b47-5f5b8bbe7b82
┌─avg(if(equals(ACTUAL_LABEL, PREDICTED_LABEL), 1., 0.))─┐
│ 0.967930029154519 │
└────────────────────────────────────────────────────────┘
1 row in set. Elapsed: 0.011 sec. Processed 2.19 thousand rows, 467.03 KB (192.64 thousand rows/s., 41.04 MB/s.)
Peak memory usage: 13.49 MiB.
Conclusion
In this article we have demonstrated how we could take time series data and classify it against two labels - winter or summer. We trained with half of the dataset, and tested with the other half and acheived an impressive 97% accuracy.
Though the classification task sounded trivial on the surface, we explained how this became harder when we considered spring and autumn, and the fact that the two labelled periods averaged out to have very similar power consumption.
Again, this analysis was completed entirely within ClickHouse, with only a little Python used to render visualisations in the notebook.
Sample Notebook
A notebook describing the full worked example can be found at this URL.