All resources

Uncover 5 Underrated BigQuery Functions Every Data Analyst Must Know

BigQuery’s robust functionality often goes unnoticed, with analysts sticking to conventional SQL queries. However, beneath the surface lies a set of advanced functions that can unlock unparalleled insights and streamline workflows. These features offer capabilities beyond standard SQL, making BigQuery a powerful tool for tackling complex data challenges.

i-radius

This article explores five underrated BigQuery functions, highlighting their syntax, practical examples, and business use cases. Alongside these functions, we’ll provide actionable insights for addressing common ML transformation issues, helping you enhance decision-making and make the most of BigQuery’s full potential.

Understanding the Power of BigQuery’s Advanced Functions

BigQuery goes beyond traditional data analysis, offering a suite of advanced functions to handle complex challenges easily. These tools empower data analysts to uncover hidden insights, improve model performance, and address data inconsistencies directly within the platform. 

By applying these functions, businesses can make smarter, data-driven decisions without relying on additional tools.

BigQuery’s advanced features cater to diverse analytical needs from predictive modeling to geospatial analysis. This section highlights key functions that simplify workflows and elevate data analysis to the next level.

5 Rarely Known BigQuery Functions You Need to Know

BigQuery’s advanced capabilities extend far beyond basic queries, offering powerful yet underutilized functions that can transform how analysts work with data. In this section, we explore five rarely known BigQuery functions, their syntax, and practical business applications.

ML.DESCRIBE_DATA

ML.DESCRIBE_DATA generates descriptive statistics for table columns or subqueries, including min, max, mean, median, unique values, and quantiles. Ideal for feature preprocessing or model monitoring, it’s essential for analyzing training or serving data in ML workflows.

ML.DESCRIBE_DATA Syntax

This is the syntax for the ML.DESCRIBE_DATA function, which generates detailed statistics for a table or query to support feature analysis and data preprocessing.

1ML.DESCRIBE_DATA(  
2  { TABLE `project_id.dataset.table` | (query_statement) },  
3  STRUCT(  
4    [num_quantiles AS num_quantiles]  
5    [, num_array_length_quantiles AS num_array_length_quantiles]  
6    [, top_k AS top_k])  
7)

Here:

  • TABLE project_id.dataset.table: Specifies the table to analyze. Alternatively, you can use a query_statement to provide a subquery for analysis.
  • STRUCT: A structure that defines optional parameters for the analysis.
  • num_quantiles: The number of quantiles to compute for numerical columns. 
  • num_array_length_quantiles: Calculates quantiles for array lengths in columns containing array data.
  • top_k: Limits the number of top elements returned for categorical data, such as the most frequent values in a column.

ML.DESCRIBE_DATA Example

Let's assume we have a dataset containing historical sales data, including columns for weekly sales, advertising spend, and seasonal tags. Using ML.DESCRIBE_DATA, you can analyze this data to identify key patterns and ensure data readiness for your forecasting model.

1SELECT *  
2FROM ML.DESCRIBE_DATA(  
3  TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.Sales_Data`,  
4  STRUCT(5 AS num_quantiles, 3 AS top_k)  
5);

SQL query result using ML.DESCRIBE_DATA to display descriptive statistics.‍ i-shadow
Analyzing Historical Sales Data with ML.DESCRIBE_DATA. i-shadow

Here:

  • owox-d-ikrasovytskyi-001.OWOX_Demo.Training_dataset: Represents the dataset containing columns such as weekly sales, marketing expenses, and seasonality indicators.
  • 5 AS num_quantiles: Calculates five quantiles for numerical columns like weekly sales and advertising spend to assess their distribution.
  • 3 AS top_k: Returns the top three most frequent values for categorical columns, such as seasonality tags (e.g., "Holiday", "Off-Season", "Peak").

This analysis helps validate whether predictors like advertising spend align with sales trends and allows you to confirm the influence of seasonality, ensuring reliable input for the forecasting model.

ML.VALIDATE_DATA_SKEW

The ML.VALIDATE_DATA_SKEW function identifies data skew by comparing a model's training and serving data statistics. It highlights anomalous differences, enabling effective model monitoring and ensuring consistency between training and real-world data for reliable predictions. However it requires a trained model to predict statistics.

ML.VALIDATE_DATA_SKEW Syntax

First, here is a syntax for training model creation:

1CREATE OR REPLACE MODEL `your_project.your_dataset.your_model_name`
2OPTIONS(
3  model_type = 'logistic_reg',  -- Logistic Regression for binary classification
4  input_label_cols = ['target_column'],  -- Column to predict
5  auto_class_weights = TRUE,  -- Automatically balances class weights
6  data_split_method = 'RANDOM',  -- Splits data randomly
7  data_split_eval_fraction = 0.2  -- Uses 20% of data for evaluation
8) AS
9SELECT 
10  feature_1, 
11  feature_2, 
12  feature_3, 
13  ..., 
14  target_column  -- Binary target variable (0 or 1)
15FROM `your_project.your_dataset.your_table`;

Then the following can be used as a syntax for the ML.VALIDATE_DATA_SKEW function, which identifies and analyzes data skew by comparing statistics between training and serving datasets.

1ML.VALIDATE_DATA_SKEW(  
2  MODEL `project_id.dataset.model`,  
3  { TABLE `project_id.dataset.table` | (query_statement) },  
4  STRUCT(  
5    [categorical_default_threshold AS categorical_default_threshold]  
6    [, categorical_metric_type AS categorical_metric_type]  
7    [, numerical_default_threshold AS numerical_default_threshold]  
8    [, numerical_metric_type AS numerical_metric_type]  
9    [, thresholds AS thresholds])  
10)

Here:

  • MODEL project_id.dataset.model: Specifies the ML model whose data skew is being analyzed.
  • TABLE project_id.dataset.table: Represents the serving dataset or a query result for comparison.
  • STRUCT: Defines optional parameters.
  • categorical_default_threshold: Sets the threshold for categorical data skew.
  • categorical_metric_type: Chooses the metric type for categorical data (e.g., "JS Divergence").
  • numerical_default_threshold: Sets the threshold for numerical data skew.
  • numerical_metric_type: Selects the metric type for numerical data (e.g., "Wasserstein Distance").
  • thresholds: Customizes thresholds for specific features to refine the analysis.

ML.VALIDATE_DATA_SKEW Example

Imagine you’ve deployed a customer churn prediction model for an e-commerce platform, and you want to validate that the serving data aligns with the training data used to build the model.

1CREATE MODEL `owox-d-ikrasovytskyi-001.OWOX_Demo.training_model`
2OPTIONS(
3  model_type = 'logistic_reg',  -- Logistic Regression for binary classification
4  input_label_cols = ['HighValueTransaction'],  -- Column to predict
5  auto_class_weights = TRUE,  -- Automatically balances class weights
6  data_split_method = 'RANDOM',  -- Splits data randomly
7  data_split_eval_fraction = 0.2  -- Uses 20% of data for evaluation
8) AS
9SELECT 
10  transaction_id, 
11  customer_id, 
12  transaction_amount, 
13  product_category, 
14  customer_location,
15  -- Define a binary label: High-value transactions (e.g., above $300)
16  CASE 
17    WHEN transaction_amount > 300 THEN 1 
18    ELSE 0 
19  END AS HighValueTransaction
20FROM `owox-d-ikrasovytskyi-001.OWOX_Demo.Training_dataset`;

Creation of a logistic regression model in ML.VALIDATE_DATA_SKEW. i-shaow

Here’s how you can use ML.VALIDATE_DATA_SKEW to identify any discrepancies.

1SELECT *  
2FROM ML.VALIDATE_DATA_SKEW(  
3  MODEL `owox-d-ikrasovytskyi-001.OWOX_Demo.training_model`,  
4  TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.Serving_Dataset`,  
5  STRUCT(0.2 AS categorical_default_threshold)  
6);

Validating Data Skew Churn Model with ML.VALIDATE_DATA_SKEW. i-shadow

 

Here:

  • MODEL: Creates a logistic regression model predicting HighValueTransaction.
  • model_type = 'logistic_reg': Logistic regression for binary classification.
  • input_label_cols = ['HighValueTransaction']: Column to predict.
  • auto_class_weights = TRUE: Balances class weights automatically.
  • data_split_method = 'RANDOM': Random data split for training and evaluation.
  • data_split_eval_fraction = 0.2: 20% data for evaluation.
  • SELECT Statement: Retrieves transaction_id, customer_id, transaction_amount, product_category, customer_location, and defines HighValueTransaction based on transaction_amount.
  • FROM: Dataset used for training the model.

ML.VALIDATE_DATA_DRIFT

ML.VALIDATE_DATA_DRIFT function detects data drift by comparing statistics between two serving datasets. It identifies anomalous differences, ensuring that data changes over time don’t negatively impact model performance or decision-making processes.

ML.VALIDATE_DATA_DRIFT Syntax

This is the syntax for the ML.VALIDATE_DATA_DRIFT function, which compares two serving datasets to identify and analyze data drift.

1ML.VALIDATE_DATA_DRIFT(  
2  { TABLE `project_id.dataset.base_table` | (base_query_statement) },  
3  { TABLE `project_id.dataset.study_table` | (study_query_statement) },  
4  STRUCT(  
5    [num_histogram_buckets AS num_histogram_buckets]  
6    [, num_quantiles_histogram_buckets AS num_quantiles_histogram_buckets]  
7    [, num_values_histogram_buckets AS num_values_histogram_buckets]  
8    [, num_rank_histogram_buckets AS num_rank_histogram_buckets]  
9    [, categorical_default_threshold AS categorical_default_threshold]  
10    [, categorical_metric_type AS categorical_metric_type]  
11    [, numerical_default_threshold AS numerical_default_threshold]  
12    [, numerical_metric_type AS numerical_metric_type]  
13    [, thresholds AS thresholds]) 
14)

Here:

  • TABLE base_table and study_table: The two datasets to compare for drift; either tables or query results can be used.
  • STRUCT: Optional parameters for drift analysis.
  • num_histogram_buckets: Number of buckets for histogram-based comparisons.
  • num_quantiles_histogram_buckets: Number of quantiles for histogram comparison.
  • num_values_histogram_buckets: Number of histogram buckets for unique values.
  • num_rank_histogram_buckets: Number of histogram buckets for rank-based analysis.
  • categorical_default_threshold: Threshold for drift detection in categorical features.
  • categorical_metric_type: Metric type for categorical data, like "JS Divergence."
  • numerical_default_threshold: Threshold for numerical data drift detection.
  • numerical_metric_type: Metric for numerical drift, like "Wasserstein Distance."
  • thresholds: Specific thresholds for individual features to refine drift analysis.

ML.VALIDATE_DATA_DRIFT Example

Imagine you’re running an online marketplace and monitoring user behavior data to ensure your recommendation engine remains effective. You want to compare a snapshot of last month’s serving data with the current serving data to identify shifts in user preferences or behavior patterns.

1SELECT *  
2FROM ML.VALIDATE_DATA_DRIFT(  
3  TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.User_Behavior_Data_Previous`,  
4  TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.User_Behavior_Data_Current`,  
5  STRUCT(0.2 AS categorical_default_threshold)  
6);

  

Validating user behavior patterns with ML.VALIDATE_DATA_DRIFT. i-shadow

Here:

  • SELECT : Retrieves all columns from the result of the ML.VALIDATE_DATA_DRIFT function.
  • TABLE: Specifies the previous dataset for comparison to check data drift.
  • TABLE: Specifies the current dataset to compare against the previous dataset.
  • STRUCT(0.2 AS categorical_default_threshold): Sets the threshold for categorical data drift to 0.2.

This analysis helps detect significant changes in user behavior, enabling timely updates to your recommendation model for improved relevance and customer experience.

ML.TFDV_DESCRIBE

ML.TFDV_DESCRIBE function generates detailed statistics for table columns, including distributions, missing values, and data types. It’s ideal for assessing data quality and preparing features for machine learning models with precision and efficiency.

ML.TFDV_DESCRIBE Syntax

This is the syntax for the ML.TFDV_DESCRIBE function that generates detailed statistics for table columns to assess data quality and prepare features for machine learning.

1ML.TFDV_DESCRIBE(  
2  { TABLE `project_id.dataset.table` | (query_statement) },  
3  STRUCT(  
4    [num_histogram_buckets AS num_histogram_buckets]  
5    [, num_quantiles_histogram_buckets AS num_quantiles_histogram_buckets]  
6    [, num_values_histogram_buckets AS num_values_histogram_buckets]  
7    [, num_rank_histogram_buckets AS num_rank_histogram_buckets])  
8)

Here:

  • TABLE project_id.dataset.table: Specifies the table or subquery for which statistics are generated.
  • STRUCT: Defines optional parameters for the statistical analysis:
  • num_histogram_buckets: Sets the number of buckets for value distribution histograms.
  • num_quantiles_histogram_buckets: Specifies the number of quantiles for histogram distribution.
  • num_values_histogram_buckets: Limits the number of histogram buckets for unique values.
  • num_rank_histogram_buckets: Sets the number of rank-based histogram buckets.

ML.TFDV_DESCRIBE Example

Let’s assume you’re onboarding a new sales dataset and need to ensure it aligns with the schema expected by a trained sales forecasting model. This example analyzes the schema and distribution of the dataset, using 10 buckets for rank histograms to evaluate categorical features like product categories.

1SELECT *  
2FROM ML.TFDV_DESCRIBE(  
3  TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.New_Sales_Data`,  
4  STRUCT(10 AS num_rank_histogram_buckets)  
5);

 

Validating Sales Data Schema with ML.TFDV_DESCRIBE. i-shadow

Here:

  • *SELECT : Retrieves all columns from the result of the ML.TFDV_DESCRIBE function.
  • TABLE: Specifies the dataset to describe and analyze.
  • STRUCT(10 AS num_rank_histogram_buckets): Sets the number of histogram buckets for ranking to 10.

This process helps verify that the dataset matches the expected schema, minimizing the risks of inconsistencies in the sales forecasting model’s performance.

ML.TFDV_VALIDATE

The ML.TFDV_VALIDATE function compares statistics between training and serving data or two serving datasets to detect anomalies. It replicates TensorFlow’s validate_statistics API, supporting model monitoring by identifying inconsistencies that may impact model reliability.

ML.TFDV_VALIDATE Syntax

This is the syntax for the ML.TFDV_VALIDATE function compares statistics between datasets and detects anomalies for effective model monitoring.

1ML.TFDV_VALIDATE(  
2  base_statistics,  
3  study_statistics  
4  [, detection_type]  
5  [, categorical_default_threshold]  
6  [, categorical_metric_type]  
7  [, numerical_default_threshold]  
8  [, numerical_metric_type]  
9  [, thresholds]  
10)

Here:

  • base_statistics: The statistics of the baseline dataset, such as training data.
  • study_statistics: The statistics of the dataset being compared, such as serving data.
  • detection_type (optional): Specifies the type of anomaly detection, such as "FEATURE_SCHEMA" or "DISTRIBUTION_CHANGE."
  • categorical_default_threshold (optional): Sets the default threshold for detecting anomalies in categorical features.
  • categorical_metric_type (optional): Defines the metric used for categorical comparison, like "JS Divergence."
  • numerical_default_threshold (optional): Sets the threshold for detecting anomalies in numerical features.
  • numerical_metric_type (optional): Specifies the metric for numerical feature comparison, such as "Wasserstein Distance."
  • thresholds (optional): Custom thresholds for specific features to refine anomaly detection.

ML.TFDV_VALIDATE Example

In this example, we validate a new customer transaction dataset against the schema used during training to detect any anomalies or missing fields. Using the ML.TFDV_VALIDATE function, we compare the statistics and schema of the serving data with the training dataset. First, we created the table where we would store the statistics comparison data.

1CREATE TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.serve_stats` (
2  timestamp TIMESTAMP,
3  dataset_feature_statistics_list JSON
4);

SQL query to create a table where ML.TFDV_VALIDATE data will be stored after validation. i-shadow

Before deploying it for predictions, let’s assume you are validating a new customer transaction dataset against the expected schema used during training to detect anomalies or missing fields. This example uses ML.TFDV_VALIDATE to identify issues and apply custom thresholds for specific features.

1DECLARE training_stats JSON;
2DECLARE serving_stats JSON;
3
4-- Generate schema and statistics for training data
5SET training_stats = (
6  SELECT *  
7  FROM ML.TFDV_DESCRIBE(  
8    TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.Training_dataset`
9  )
10);
11
12-- Generate schema and statistics for serving data
13SET serving_stats = (
14  SELECT *  
15  FROM ML.TFDV_DESCRIBE(  
16    TABLE `owox-d-ikrasovytskyi-001.OWOX_Demo.Serving_Dataset`
17  )
18);
19
20-- Validate serving data against training schema
21SELECT ML.TFDV_VALIDATE(  
22  training_stats,  
23  serving_stats,  
24  'SKEW',  
25  0.3,  
26  'L_INFTY',  
27  0.3,  
28  'JENSEN_SHANNON_DIVERGENCE',  
29  [('transaction_amount', 0.2), ('product_category', 0.5)]  
30);
31
32-- Store validation statistics for tracking
33INSERT `owox-d-ikrasovytskyi-001.OWOX_Demo.serve_stats`  
34  (timestamp, dataset_feature_statistics_list)  
35SELECT CURRENT_TIMESTAMP() AS timestamp, training_stats;

Reviewing  Customer Transaction Data with ML.TFDV_VALIDATE. i-shadow

Here:

  • DECLARE training_stats JSON: Declares a variable training_stats to hold the schema and statistics for the training data.
  • DECLARE serving_stats JSON: Declares a variable serving_stats to hold the schema and statistics for the serving data.
  • SET training_stats = (SELECT * FROM ML.TFDV_DESCRIBE): Retrieves the schema and statistics for the training dataset Training_dataset and stores it in training_stats.
  • SET serving_stats = (SELECT * FROM ML.TFDV_DESCRIBE): Retrieves the schema and statistics for the serving dataset Serving_Dataset and stores it in serving_stats.
  • SELECT ML.TFDV_VALIDATE: Compares the serving_stats against training_stats and validates the schema using different metrics
  • Custom thresholds: Sets feature-specific thresholds, where transaction_amount has a threshold of 0.2 and product_category has a threshold of 0.5.
  • INSERT: Inserts the training_stats along with the current timestamp into the serve_stats table for tracking purposes.

This process ensures the new transaction data aligns with the expected schema and highlights potential issues before model predictions.

Troubleshooting Transformation Issues in ML Functions

Data transformations play a critical role in machine learning workflows, but they can often introduce errors or inconsistencies. Here are some common issues and how to address them.

Data Type Mismatches

⚠️ Problem: Mismatched data types occur when incompatible types, like text and numbers, are used in operations. This often results in errors or incorrect outcomes during data transformations. 

✅ Solution: Use functions like CAST() or SAFE_CAST() in SQL to convert types or Python’s astype() method for proper alignment. Ensuring type consistency avoids errors and supports accurate model training.

Missing or Null Values

⚠️Problem: Missing values in datasets can disrupt data transformations, lead to errors during model training, and impact predictions. If left unaddressed, they compromise data quality and reduce the reliability of machine learning models.

✅ Solution: Handle missing data by removing rows/columns with dropna(), using imputation (mean, median, mode), applying forward/backward fill for time-series, or predicting values with interpolation. 

Improper Feature Scaling

⚠️Problem: Features with inconsistent scales can bias models or degrade performance, especially in algorithms like SVM or KNN that rely on distance calculations.

✅Solution: Use scaling methods like Scikit-learn’s StandardScaler or MinMaxScaler to normalize data. These ensure consistent scaling across features, improving model accuracy and reliability.

Unclear Error Sources During Transformations

⚠️Problem: Identifying the root cause of errors during complex data transformations can be challenging, hindering debugging and troubleshooting processes. 

✅Solution: Use GoogleSQL’s ERROR function to produce custom error messages during query execution, enabling precise identification of unexpected values or scenarios. Combine ERROR with conditional logic (e.g., CASE or IF) to isolate issues effectively. 

Missing Values in Data

⚠️Problem: Missing values, treated as NULL in BigQuery ML, can lead to incomplete model training or errors during predictions. They may result from unseen or unrecorded data, causing unreliable model outcomes.

✅Solution: Use imputation techniques to replace missing values with substitutes like mean, median, or mode for numerical data and most frequent values for categorical data. 

Redundant or Repetitive Transformation Steps

⚠️Problem: Repeating or redundant transformation steps in data workflows can lead to inefficiencies, increased processing time, and cluttered code.

✅Solution: Streamline processes by using SQL’s TRANSFORM clause or frameworks like Spark to create reusable, efficient transformation pipelines. This reduces duplication, improves readability, and ensures scalable workflows.

Unlock the Power of BigQuery Functions

BigQuery's versatile functions simplify complex data analysis, helping you perform advanced calculations, manipulate data, and extract actionable insights. 

Gain Deeper Insights with the OWOX Reports Extension for Google Sheets

The OWOX Reports bridges the gap between BigQuery’s powerful analytics and the simplicity of Google Sheets. It allows users to query BigQuery data directly from Google Sheets, eliminating the need for complex scripts or manual data exports.

Additionally, OWOX Reports empowers analysts to visualize BigQuery results in an intuitive, spreadsheet-friendly format. You can easily create dashboards, share insights with stakeholders, and customize reports without extensive technical expertise.

By integrating OWOX Reports into your workflow, you save time, reduce errors, and unlock the full potential of BigQuery in a familiar environment.

FAQ

What are some underrated BigQuery functions that analysts often overlook?
How can BigQuery functions improve data analysis efficiency according to ML?
Are there specific BigQuery functions for advanced machine-learning tasks?
How do I debug issues in BigQuery ML transformations?

You might also like

2,000 companies rely on us

Oops! Something went wrong while submitting the form...