Understanding Google BigQuery Columnar Storage Architecture
Traditional RDBMS systems use a Row-level storage - The entire row is stored together, so it can be retrieved. A columnar database stores data by columns rather than by rows, which makes it suitable for analytical query processing, and thus for data warehouses
- BigQuery stores data in a proprietary columnar format called Capacitor, which has a number of benefits for data warehouse workloads.
- Each column in the table is stored in a separate file block and all the columns are stored in a single capacitor file, which are compressed and encrypted on disk.
- BigQuery dynamically uses query access patterns to determine the optimal number of physical shards
Colossus - Google’s distributed file system
The data persistence layer is provided by Google’s distributed file system, Colossus, where data is automatically compressed, encrypted, replicated, and distributed. Colossus ensures durability using erasure encoding to store redundant chunks of data on multiple physical disks
Long-Term Storage vs Active Storage
There are two storage classes (Matters during billing)
If you have a table or partition that is not modified for 90 consecutive days, it is considered long term storage and the price of storage for that table automatically drops by 50%. Discount is applied on a per-table, per-partition basis. If you modify the data in the table, the 90-day counter resets
If you have a table or partition modified in the last 90 days, it is considered as active storage and incurs a standard monthly storage charge
NOTE: Reading a table doesn't reset the storage clock. You can read the table as many times, but only when you write into a partition the 90 day clock is reset
Data Organization in Bigquery
The data is organized in Tables, Views, Datasets and Projects. Dataset is synonymous to Database. Project is a collection of Datasets and other GCP resources. The entire stack is tightly coupled with IAM access controls
Best file format to Load data into Bigquery
Google Supports multiple source file formats. You can load CSV, JSON, GZIP, AVRO and ARROW file formats into Bigquery. There is no priority of one format over another, but Google seems to prefer AVRO source files wherever possible
Understanding Bigquery Storage Pricing
Although the Bigquery stores the data in compressed format. While billing the customers, Google charges for the uncompressed size (At the time of writing this tutorial). Meaning if you have a table with 100GB compressed size (350GB uncompressed), You will have to pay for the 350GB storage cost.
You can find more on the Bigquery Storage pricing here