In this lesson we will:
- Learn about the Snowflake billing model and credit system;
- Explain the various cost visiblity and cost control features of Snowflake;
- Share some high level information about how minimise and control costs.
Consumption Based Pricing
Snowflake offers a genuine consumption based pricing model where you pay only for the services that you use.
There are three classes of resource which you will use and pay for:
Compute
When you execute queries or updates, you need processing power to carry out the requests. You will pay for this compute on a per second basis, with the actual rate per second being based on the amount of compute you need - e.g. an XL sized server for 180 seconds.
In practice, the compute is usually the largest cost in running a Snowflake deployment.
Storage
The data in your Snowflake instance will need to be saved on some persistent store which exists even when you are not running any compute resources. You will pay for this storage on a per byte basis - e.g. 108 GB of data.
Cloud Services
In order to work with your account, you will need to carry out various activities such as login, adminster users and run scheduled jobs. These are billed by Snowflake as a seperate line item.
Credit System
As discussed in a previous lesson, Snowflake is billed using a credit system. As you use Snowflake you will be using credits for your compute, storage and cloud service usage. These will then be billed at a per credit cost at month end.
The per credit cost is a function of the tier you are on, the cloud provider and the cloud provider region.
For instance, at the time of writing, an Enterprise user with their Snowflake instance deployed in AWS us-east region would be paying the following per credit cost:
Tier | Credit Cost |
---|---|
Standard | $2.60 |
Enterprise | $3.90 |
Business Critical | $5.20 |
Whereas the same deployment in Google Cloud Platform London region would cost:
Syntax | Description |
---|---|
Standard | $2.70 |
Enterprise | $4.00 |
Business Critical | $5.40 |
This making GCP slightly more expensive with regards to your Snowflake credits if you need to host in that environment.
Virtual Warehouse Costs
Your compute costs will likely be the largest part of your Snowflake deployment. This is because storage is cheap, and because Snowflake pass along their storage costs without making a significant profit.
Snowflake has a number of different tiers of data warehouse which are structured on a T-Shirt size basis, each of which has a credit-per-hour associated cost.
Syntax | Credits Per Hour |
---|---|
X-Small | 1 |
Small | 2 |
Medium | 4 |
Large | 8 |
X-Large | 16 |
2X-Large | 32 |
3X-Large | 64 |
4X-Large | 128 |
5X-Large | 256 |
6X-Large | 512 |
Combining the two tables above, we can work out our costs per hour of running a 2X-Large warehouse in AWS eu-west:
Tier | Credit Cost | Warehouse Size | Warehouse Credits Per Hour | Warehouse Cost Per Hour |
---|---|---|---|---|
Standard | $2.60 | 2X-Large | 32 | 83.20 |
Enterprise | $3.90 | 2X-Large | 32 | 124.80 |
Business Critical | $5.20 | 2X-Large | 32 | 166.40 |
And the same for GCP:
Tier | Credit Cost | Warehouse Size | Warehouse Credits Per Hour | Warehouse Cost Per Hour |
---|---|---|---|---|
Standard | $2.70 | 2X-Large | 32 | 86.40 |
Enterprise | $4.00 | 2X-Large | 32 | 128.00 |
Business Critical | $5.40 | 2X-Large | 32 | 172.80 |
This shows that your actual costs are a function of the choices you make and your actual consumpion.
This can be challenging for businesses, as their final monthly price could be unpredictable depending on how much your users use Snowflake in reality. This is a common risk of consumption based pricing models.
Bespoke Arrangements
For larger deployments, it is possible to achieve discounts through direct negotiation with Snowflake. This will likely require pre-payment or agreements for compute and storage consumption.
Snowflake Tables For Billing
Snowflake makes available a lot of information about accured bills through internal tables. We will now take a quick tour of these to highlight the main ones:
WAREHOUSE_METERING_HISTORY
select * from WAREHOUSE_METERING_HISTORY
The WAREHOUSE_METERING_HISTORY view shows a record of when warehouses were started and stopped, and how many credits were used for compute and cloud services in the period.
Note that this table can take up to 3 hours to be updated.
QUERY_HISTORY
select * from QUERY_HISTORY
The QUERY_HISTORY table can be joined with WAREHOUSE_METERING_HISTORY table to understand which queries are ultimately consuming most compute and therefore generating the highest costs.
DATABASE_STORAGE_USAGE_HISTORY
select * from DATABASE_STORAGE_USAGE_HISTORY
The DATABASE_STORAGE_USAGE_HISTORY gives us information about storage costs.
Controlling Costs
Snowflake provides several specific features that can help you control costs. These features include:
Multi-cluster warehouse: Snowflake allows you to create multiple clusters within a single warehouse, each with its own set of compute resources. This allows you to scale up or down your compute resources as needed, and to allocate resources to specific workloads or users. By doing so, you can avoid over-provisioning resources and paying for more compute resources than you actually need.
Automatic scaling: Snowflake automatically scales compute resources up or down based on your workload. This means that you don't need to manually provision or de-provision compute resources to handle spikes or dips in workload. Automatic scaling allows you to optimize resource utilization and avoid paying for idle resources.
Resource monitoring: Snowflake provides detailed resource monitoring and usage reporting, which allows you to track your resource usage and costs over time. You can use this information to identify areas where you can optimize your resource usage and reduce costs.
Time travel: Snowflake's time travel feature allows you to access historical data from your data warehouse. This feature can help you avoid the need to create and maintain separate data copies for historical analysis, which can save you storage costs.
Automatic query optimization: Snowflake's query optimization engine automatically optimizes queries to minimize resource usage. This can help you reduce query runtime and avoid over-provisioning compute resources.
Data sharing: Snowflake's data sharing feature allows you to share data securely with other Snowflake accounts, without the need for data replication. This can help you avoid data duplication and reduce storage costs.
Spark provides several built-in functions for performing aggregation queries on your data. Here are some examples of how to use Spark to perform common aggregation queries like group by, count, sum, and others:
Group by and count:
sql Copy code from pyspark.sql.functions import col
Read data from a CSV file
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
Group by a column and count the number of occurrences in each group
df.groupBy("column_name").count().show() Group by and sum:
sql Copy code
Group by a column and sum the values in another column
df.groupBy("column_name").agg({"other_column": "sum"}).show() Group by and average:
sql Copy code
Group by a column and calculate the average of the values in another column
df.groupBy("column_name").agg({"other_column": "avg"}).show() Group by multiple columns:
sql Copy code
Group by multiple columns and count the number of occurrences in each group
df.groupBy("column_name_1", "column_name_2").count().show() Using SQL-like syntax:
sql Copy code
Register the DataFrame as a temporary view
df.createOrReplaceTempView("my_table")
Perform a group by query using SQL-like syntax
spark.sql("SELECT column_name, COUNT(*) FROM my_table GROUP BY column_name").show()