Learn Bigquery
Introduction
Architecture
Architecture Compute Storage
Unique Features
Enncryption Column Level Security Time Travel Caching Compression
Database Management
Create Table Create View Create Procedure
Best Practices
Execution plans
Interview Questions
Top Questions
03 Dec 20 · mCollins · #Bigquery ·   Bookmark  

Understanding Google BigQuery Columnar Storage Architecture


Traditional RDBMS systems use a Row-level storage - The entire row is stored together, so it can be retrieved.  A columnar database stores data by columns rather than by rows, which makes it suitable for analytical query processing, and thus for data warehouses

  • BigQuery stores data in a proprietary columnar format called Capacitor, which has a number of benefits for data warehouse workloads. 
  • Each column in the table is stored in a separate file block and all the columns are stored in a single capacitor file, which are compressed and encrypted on disk. 
  • BigQuery dynamically uses query access patterns to determine the optimal number of physical shards

Colossus - Google’s distributed file system

The data persistence layer is provided by Google’s distributed file system, Colossus, where data is automatically compressed, encrypted, replicated, and distributed. Colossus ensures durability using erasure encoding to store redundant chunks of data on multiple physical disks

Long-Term Storage vs Active Storage

There are two storage classes (Matters during billing)

Long-Term Storage

If you have a table or partition that is not modified for 90 consecutive days, it is considered long term storage and the price of storage for that table automatically drops by 50%. Discount is applied on a per-table, per-partition basis. If you modify the data in the table, the 90-day counter resets

Active Storage

If you have a table or partition modified in the last 90 days, it is considered as active storage and incurs a standard monthly storage charge

NOTE: Reading a table doesn't reset the storage clock. You can read the table as many times, but only when you write into a partition the 90 day clock is reset

Data Organization in Bigquery


The data is organized in Tables, Views, Datasets and Projects. Dataset is synonymous to Database. Project is a collection of Datasets and other GCP resources. The entire stack is tightly coupled with IAM access controls

Best file format to Load data into Bigquery

Google Supports multiple source file formats. You can load CSV, JSON, GZIP, AVRO and ARROW file formats into Bigquery. There is no priority of one format over another, but Google seems to prefer AVRO source files wherever possible

Understanding Bigquery Storage Pricing

Although the Bigquery stores the data in compressed format. While billing the customers, Google charges for the uncompressed size (At the time of writing this tutorial). Meaning if you have a table with 100GB compressed size (350GB uncompressed), You will have to pay for the 350GB storage cost.

You can find more on the Bigquery Storage pricing here