The Ultimate BigQuery Cheat Sheet: Commands for Data Engineers

Google BigQuery is one of the most popular cloud-based data warehouses, widely used for large-scale analytics. As a Data Engineer, knowing the right commands not only saves time but also ensures efficient resource utilization and cost optimization.

Here are the Top 20 BigQuery commands with examples you should add to your toolkit.


1. Show All Datasets in a Project

SELECT schema_name  
FROM `<project_id>.INFORMATION_SCHEMA.SCHEMATA`;

2. List All Tables in a Dataset

SELECT table_name  
FROM `<project_id>.<dataset>.INFORMATION_SCHEMA.TABLES`;

3. Get Table Metadata (Size, Rows, Created Date)

SELECT table_name, row_count, size_bytes/1024/1024 AS size_mb, creation_time  
FROM `<project_id>.<dataset>.INFORMATION_SCHEMA.TABLES`;

4. Database (Dataset) Size

SELECT SUM(size_bytes)/1024/1024/1024 AS dataset_size_gb  
FROM `<project_id>.<dataset>.INFORMATION_SCHEMA.TABLES`;

5. Query Execution History

SELECT user_email, query, start_time, end_time, total_bytes_processed/1024/1024/1024 AS processed_gb  
FROM `<project_id>.region-<location>.INFORMATION_SCHEMA.JOBS`  
WHERE state='DONE'  
ORDER BY start_time DESC  
LIMIT 10;

6. Long-Running Queries

SELECT query, (end_time - start_time) AS duration, total_bytes_processed/1024/1024/1024 AS processed_gb  
FROM `<project_id>.region-<location>.INFORMATION_SCHEMA.JOBS`  
WHERE state='DONE'  
ORDER BY duration DESC  
LIMIT 5;

7. Identify Cached Queries

SELECT query, cache_hit  
FROM `<project_id>.region-<location>.INFORMATION_SCHEMA.JOBS`  
WHERE state='DONE'  
ORDER BY start_time DESC  
LIMIT 10;

8. Explain Query Execution Plan

EXPLAIN
SELECT COUNT(*) FROM `<project_id>.<dataset>.<table>`;

9. Estimate Query Cost Before Running

EXPLAIN
SELECT * FROM `<project_id>.<dataset>.<table>` WHERE column_name='value';

10. Preview Table Data Without Charges

SELECT * FROM `<project_id>.<dataset>.<table>` LIMIT 10;

11. Partition Details of a Table

SELECT partition_id, total_rows, total_bytes/1024/1024 AS size_mb  
FROM `<project_id>.<dataset>.INFORMATION_SCHEMA.PARTITIONS`  
WHERE table_name='your_table';

12. Clustering Details of a Table

SELECT clustering_ordinal_position, column_name  
FROM `<project_id>.<dataset>.INFORMATION_SCHEMA.COLUMNS`  
WHERE table_name='your_table';

13. Active Sessions Monitoring

SELECT user_email, state, creation_time  
FROM `<project_id>.region-<location>.INFORMATION_SCHEMA.SESSIONS`;

14. Find Duplicate Rows

SELECT column1, COUNT(*)  
FROM `<project_id>.<dataset>.<table>`  
GROUP BY column1  
HAVING COUNT(*) > 1;

15. Detect Missing Values

SELECT COUNT(*)  
FROM `<project_id>.<dataset>.<table>`  
WHERE column_name IS NULL;

16. Terminate a Running Query

(BigQuery doesn’t support KILL directly, but you can stop jobs via the console or API.)

bq cancel -j <job_id>

17. Index-Like Optimization (Using Clustering)

CREATE OR REPLACE TABLE `<project_id>.<dataset>.<table>`  
PARTITION BY DATE(timestamp_column)  
CLUSTER BY column1, column2 AS  
SELECT * FROM `<project_id>.<dataset>.<raw_table>`;

18. Get Column Data Types

SELECT column_name, data_type  
FROM `<project_id>.<dataset>.INFORMATION_SCHEMA.COLUMNS`  
WHERE table_name='your_table';

19. Check User Permissions

SELECT grantee, privilege_type  
FROM `<project_id>.<dataset>.INFORMATION_SCHEMA.SCHEMA_PRIVILEGES`;

20. Top Users by Query Usage

SELECT user_email, COUNT(*) AS query_count, SUM(total_bytes_processed)/1024/1024/1024 AS total_gb  
FROM `<project_id>.region-<location>.INFORMATION_SCHEMA.JOBS`  
WHERE state='DONE'  
GROUP BY user_email  
ORDER BY total_gb DESC;

Conclusion:
BigQuery is powerful, but cost and performance optimization rely heavily on understanding how your queries work. These commands will give you visibility into your datasets, queries, and overall usage—helping you become a more efficient Data Engineer.

Leave a Reply

Your email address will not be published. Required fields are marked *