Developing machine learning models can require and produce significant amounts of data. This includes training datasets, results of experiments, and data that is captured during the subsequent inference process which will be used to monitor how models are performing in production.
For much of this data tends to be stored on cloud object stores such as AWS S3, we believe that the machine learning workflow can be enhanced by including a high performing OLAP database such as ClickHouse.
Data Engineering - By using ClickHouse, data engineers can can give data scientists access to their data as clean and well structured tables instead of unstructured CSVs or Parquets;
Faster Exploratory Analysis - Data Scientists will begin the machine learning development process by performing exploratory analysis against their datasets, querying the data from multiple angles to understand it's shape and characteristics. This can benefit.
Performance - Scalability and performance for managing very large training datasets and batch inference results;
Feature Store - A database such as ClickHouse can act as a feature store, which involves saving data for reuse and reproducibility;
Batch Inference Results - ClickHouse is a good location for storing batch inference results which can be rendered on a dashboard or accessed form an application efficiently;
Monitoring and Observability - The ability to capture and store observability type data regarding deployed models;
Retrieval Augmented Generation - RAG involves enhancing the results of language models with details specific to your organisation. A high performing database with vector capabilities to support RAG scenarios whereby our models are connected into databases.
In short, the entire machine learning lifecycle benefits by being backed by a high performance SQL based database such as ClickHouse. This is not currently something which has received much attention, but with the growth of AI and ML, we suspect that more people will come to the same conclusion.