BigQuery’s robust functionality often goes unnoticed, with analysts sticking to conventional SQL queries. However, beneath the surface lies a set of advanced functions that can unlock unparalleled insights and streamline workflows. These features offer capabilities beyond standard SQL, making BigQuery a powerful tool for tackling complex data challenges.
This article explores five underrated BigQuery functions, highlighting their syntax, practical examples, and business use cases. Alongside these functions, we’ll provide actionable insights for addressing common ML transformation issues, helping you enhance decision-making and make the most of BigQuery’s full potential.
BigQuery goes beyond traditional data analysis, offering a suite of advanced functions to handle complex challenges easily. These tools empower data analysts to uncover hidden insights, improve model performance, and address data inconsistencies directly within the platform.
By applying these functions, businesses can make smarter, data-driven decisions without relying on additional tools.
BigQuery’s advanced features cater to diverse analytical needs from predictive modeling to geospatial analysis. This section highlights key functions that simplify workflows and elevate data analysis to the next level.
BigQuery’s advanced capabilities extend far beyond basic queries, offering powerful yet underutilized functions that can transform how analysts work with data. In this section, we explore five rarely known BigQuery functions, their syntax, and practical business applications.
ML.DESCRIBE_DATA generates descriptive statistics for table columns or subqueries, including min, max, mean, median, unique values, and quantiles. Ideal for feature preprocessing or model monitoring, it’s essential for analyzing training or serving data in ML workflows.
This is the syntax for the ML.DESCRIBE_DATA function, which generates detailed statistics for a table or query to support feature analysis and data preprocessing.
1ML.DESCRIBE_DATA(
2 { TABLE `project_id.dataset.table` | (query_statement) },
3 STRUCT(
4 [num_quantiles AS num_quantiles]
5 [, num_array_length_quantiles AS num_array_length_quantiles]
6 [, top_k AS top_k])
7)
Here:
Let's assume we have a dataset containing historical sales data, including columns for weekly sales, advertising spend, and seasonal tags. Using ML.DESCRIBE_DATA, you can analyze this data to identify key patterns and ensure data readiness for your forecasting model.
1SELECT *
2FROM ML.DESCRIBE_DATA(
3 TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.Sales_Data`,
4 STRUCT(5 AS num_quantiles, 3 AS top_k)
5);
Here:
This analysis helps validate whether predictors like advertising spend align with sales trends and allows you to confirm the influence of seasonality, ensuring reliable input for the forecasting model.
The ML.VALIDATE_DATA_SKEW function identifies data skew by comparing a model's training and serving data statistics. It highlights anomalous differences, enabling effective model monitoring and ensuring consistency between training and real-world data for reliable predictions. However it requires a trained model to predict statistics.
First, here is a syntax for training model creation:
1CREATE OR REPLACE MODEL `your_project.your_dataset.your_model_name`
2OPTIONS(
3 model_type = 'logistic_reg', -- Logistic Regression for binary classification
4 input_label_cols = ['target_column'], -- Column to predict
5 auto_class_weights = TRUE, -- Automatically balances class weights
6 data_split_method = 'RANDOM', -- Splits data randomly
7 data_split_eval_fraction = 0.2 -- Uses 20% of data for evaluation
8) AS
9SELECT
10 feature_1,
11 feature_2,
12 feature_3,
13 ...,
14 target_column -- Binary target variable (0 or 1)
15FROM `your_project.your_dataset.your_table`;
Then the following can be used as a syntax for the ML.VALIDATE_DATA_SKEW function, which identifies and analyzes data skew by comparing statistics between training and serving datasets.
1ML.VALIDATE_DATA_SKEW(
2 MODEL `project_id.dataset.model`,
3 { TABLE `project_id.dataset.table` | (query_statement) },
4 STRUCT(
5 [categorical_default_threshold AS categorical_default_threshold]
6 [, categorical_metric_type AS categorical_metric_type]
7 [, numerical_default_threshold AS numerical_default_threshold]
8 [, numerical_metric_type AS numerical_metric_type]
9 [, thresholds AS thresholds])
10)
Here:
Imagine you’ve deployed a customer churn prediction model for an e-commerce platform, and you want to validate that the serving data aligns with the training data used to build the model.
1CREATE MODEL `owox-d-ikrasovytskyi-001.OWOX_Demo.training_model`
2OPTIONS(
3 model_type = 'logistic_reg', -- Logistic Regression for binary classification
4 input_label_cols = ['HighValueTransaction'], -- Column to predict
5 auto_class_weights = TRUE, -- Automatically balances class weights
6 data_split_method = 'RANDOM', -- Splits data randomly
7 data_split_eval_fraction = 0.2 -- Uses 20% of data for evaluation
8) AS
9SELECT
10 transaction_id,
11 customer_id,
12 transaction_amount,
13 product_category,
14 customer_location,
15 -- Define a binary label: High-value transactions (e.g., above $300)
16 CASE
17 WHEN transaction_amount > 300 THEN 1
18 ELSE 0
19 END AS HighValueTransaction
20FROM `owox-d-ikrasovytskyi-001.OWOX_Demo.Training_dataset`;
Here’s how you can use ML.VALIDATE_DATA_SKEW to identify any discrepancies.
1SELECT *
2FROM ML.VALIDATE_DATA_SKEW(
3 MODEL `owox-d-ikrasovytskyi-001.OWOX_Demo.training_model`,
4 TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.Serving_Dataset`,
5 STRUCT(0.2 AS categorical_default_threshold)
6);
Here:
ML.VALIDATE_DATA_DRIFT function detects data drift by comparing statistics between two serving datasets. It identifies anomalous differences, ensuring that data changes over time don’t negatively impact model performance or decision-making processes.
This is the syntax for the ML.VALIDATE_DATA_DRIFT function, which compares two serving datasets to identify and analyze data drift.
1ML.VALIDATE_DATA_DRIFT(
2 { TABLE `project_id.dataset.base_table` | (base_query_statement) },
3 { TABLE `project_id.dataset.study_table` | (study_query_statement) },
4 STRUCT(
5 [num_histogram_buckets AS num_histogram_buckets]
6 [, num_quantiles_histogram_buckets AS num_quantiles_histogram_buckets]
7 [, num_values_histogram_buckets AS num_values_histogram_buckets]
8 [, num_rank_histogram_buckets AS num_rank_histogram_buckets]
9 [, categorical_default_threshold AS categorical_default_threshold]
10 [, categorical_metric_type AS categorical_metric_type]
11 [, numerical_default_threshold AS numerical_default_threshold]
12 [, numerical_metric_type AS numerical_metric_type]
13 [, thresholds AS thresholds])
14)
Here:
Imagine you’re running an online marketplace and monitoring user behavior data to ensure your recommendation engine remains effective. You want to compare a snapshot of last month’s serving data with the current serving data to identify shifts in user preferences or behavior patterns.
1SELECT *
2FROM ML.VALIDATE_DATA_DRIFT(
3 TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.User_Behavior_Data_Previous`,
4 TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.User_Behavior_Data_Current`,
5 STRUCT(0.2 AS categorical_default_threshold)
6);
Here:
This analysis helps detect significant changes in user behavior, enabling timely updates to your recommendation model for improved relevance and customer experience.
ML.TFDV_DESCRIBE function generates detailed statistics for table columns, including distributions, missing values, and data types. It’s ideal for assessing data quality and preparing features for machine learning models with precision and efficiency.
This is the syntax for the ML.TFDV_DESCRIBE function that generates detailed statistics for table columns to assess data quality and prepare features for machine learning.
1ML.TFDV_DESCRIBE(
2 { TABLE `project_id.dataset.table` | (query_statement) },
3 STRUCT(
4 [num_histogram_buckets AS num_histogram_buckets]
5 [, num_quantiles_histogram_buckets AS num_quantiles_histogram_buckets]
6 [, num_values_histogram_buckets AS num_values_histogram_buckets]
7 [, num_rank_histogram_buckets AS num_rank_histogram_buckets])
8)
Here:
Let’s assume you’re onboarding a new sales dataset and need to ensure it aligns with the schema expected by a trained sales forecasting model. This example analyzes the schema and distribution of the dataset, using 10 buckets for rank histograms to evaluate categorical features like product categories.
1SELECT *
2FROM ML.TFDV_DESCRIBE(
3 TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.New_Sales_Data`,
4 STRUCT(10 AS num_rank_histogram_buckets)
5);
Here:
This process helps verify that the dataset matches the expected schema, minimizing the risks of inconsistencies in the sales forecasting model’s performance.
The ML.TFDV_VALIDATE function compares statistics between training and serving data or two serving datasets to detect anomalies. It replicates TensorFlow’s validate_statistics API, supporting model monitoring by identifying inconsistencies that may impact model reliability.
This is the syntax for the ML.TFDV_VALIDATE function compares statistics between datasets and detects anomalies for effective model monitoring.
1ML.TFDV_VALIDATE(
2 base_statistics,
3 study_statistics
4 [, detection_type]
5 [, categorical_default_threshold]
6 [, categorical_metric_type]
7 [, numerical_default_threshold]
8 [, numerical_metric_type]
9 [, thresholds]
10)
Here:
In this example, we validate a new customer transaction dataset against the schema used during training to detect any anomalies or missing fields. Using the ML.TFDV_VALIDATE function, we compare the statistics and schema of the serving data with the training dataset. First, we created the table where we would store the statistics comparison data.
1CREATE TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.serve_stats` (
2 timestamp TIMESTAMP,
3 dataset_feature_statistics_list JSON
4);
Before deploying it for predictions, let’s assume you are validating a new customer transaction dataset against the expected schema used during training to detect anomalies or missing fields. This example uses ML.TFDV_VALIDATE to identify issues and apply custom thresholds for specific features.
1DECLARE training_stats JSON;
2DECLARE serving_stats JSON;
3
4-- Generate schema and statistics for training data
5SET training_stats = (
6 SELECT *
7 FROM ML.TFDV_DESCRIBE(
8 TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.Training_dataset`
9 )
10);
11
12-- Generate schema and statistics for serving data
13SET serving_stats = (
14 SELECT *
15 FROM ML.TFDV_DESCRIBE(
16 TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.Serving_Dataset`
17 )
18);
19
20-- Validate serving data against training schema
21SELECT ML.TFDV_VALIDATE(
22 training_stats,
23 serving_stats,
24 'SKEW',
25 0.3,
26 'L_INFTY',
27 0.3,
28 'JENSEN_SHANNON_DIVERGENCE',
29 [('transaction_amount', 0.2), ('product_category', 0.5)]
30);
31
32-- Store validation statistics for tracking
33INSERT `owox-d-ikrasovytskyi-001.OWOX_Demo.serve_stats`
34 (timestamp, dataset_feature_statistics_list)
35SELECT CURRENT_TIMESTAMP() AS timestamp, training_stats;
Here:
This process ensures the new transaction data aligns with the expected schema and highlights potential issues before model predictions.
Data transformations play a critical role in machine learning workflows, but they can often introduce errors or inconsistencies. Here are some common issues and how to address them.
⚠️ Problem: Mismatched data types occur when incompatible types, like text and numbers, are used in operations. This often results in errors or incorrect outcomes during data transformations.
✅ Solution: Use functions like CAST() or SAFE_CAST() in SQL to convert types or Python’s astype() method for proper alignment. Ensuring type consistency avoids errors and supports accurate model training.
⚠️Problem: Missing values in datasets can disrupt data transformations, lead to errors during model training, and impact predictions. If left unaddressed, they compromise data quality and reduce the reliability of machine learning models.
✅ Solution: Handle missing data by removing rows/columns with dropna(), using imputation (mean, median, mode), applying forward/backward fill for time-series, or predicting values with interpolation.
⚠️Problem: Features with inconsistent scales can bias models or degrade performance, especially in algorithms like SVM or KNN that rely on distance calculations.
✅Solution: Use scaling methods like Scikit-learn’s StandardScaler or MinMaxScaler to normalize data. These ensure consistent scaling across features, improving model accuracy and reliability.
⚠️Problem: Identifying the root cause of errors during complex data transformations can be challenging, hindering debugging and troubleshooting processes.
✅Solution: Use GoogleSQL’s ERROR function to produce custom error messages during query execution, enabling precise identification of unexpected values or scenarios. Combine ERROR with conditional logic (e.g., CASE or IF) to isolate issues effectively.
⚠️Problem: Missing values, treated as NULL in BigQuery ML, can lead to incomplete model training or errors during predictions. They may result from unseen or unrecorded data, causing unreliable model outcomes.
✅Solution: Use imputation techniques to replace missing values with substitutes like mean, median, or mode for numerical data and most frequent values for categorical data.
⚠️Problem: Repeating or redundant transformation steps in data workflows can lead to inefficiencies, increased processing time, and cluttered code.
✅Solution: Streamline processes by using SQL’s TRANSFORM clause or frameworks like Spark to create reusable, efficient transformation pipelines. This reduces duplication, improves readability, and ensures scalable workflows.
BigQuery's versatile functions simplify complex data analysis, helping you perform advanced calculations, manipulate data, and extract actionable insights.
The OWOX Reports bridges the gap between BigQuery’s powerful analytics and the simplicity of Google Sheets. It allows users to query BigQuery data directly from Google Sheets, eliminating the need for complex scripts or manual data exports.
Additionally, OWOX Reports empowers analysts to visualize BigQuery results in an intuitive, spreadsheet-friendly format. You can easily create dashboards, share insights with stakeholders, and customize reports without extensive technical expertise.
By integrating OWOX Reports into your workflow, you save time, reduce errors, and unlock the full potential of BigQuery in a familiar environment.
Functions like ML.PREDICT, ML.EVALUATE, and ML.RECOMMENDATIONS simplify machine learning tasks directly in BigQuery, often bypassing the need for external tools.
BigQuery ML functions automate model evaluation, predictions, and data validation, streamlining workflows, reducing processing time, and enabling faster insights without additional infrastructure.
Yes, functions like ML.TFDV_DESCRIBE, ML.TFDV_VALIDATE, and ML.VALIDATE_DATA_SKEW supports data quality analysis, anomaly detection, and ensuring consistency in machine learning datasets.
Use debugging functions like ERROR to generate custom error messages and log transformation steps with conditional logic (e.g., CASE, IF) to pinpoint errors efficiently.