BigQuery ML in Action: Transforming Data into Decisions

Wondering how to turn your data into decisions quickly and easily? BigQuery ML gives data analysts and business intelligence teams the ability to build, train, and deploy machine learning models directly within their data environment, streamlining the process from raw data to actionable insights - without the need for complex setups.

i-radius

By integrating with Vertex AI, BigQuery ML improves predictive accuracy, simplifies data workflows, and helps organizations efficiently manage machine learning projects at scale.

What is BigQuery ML?

BigQuery ML integrates machine learning directly into Google BigQuery, allowing users to create and deploy models without needing to move data to external services. This setup speeds up development by letting data analysts and machine learning practitioners work within the familiar BigQuery environment.

By eliminating data transfers to other Google Cloud AI services, BigQuery ML simplifies workflows, boosts productivity, and streamlines the process of developing and deploying machine learning models.

Key Advantages of BigQuery ML

BigQuery ML simplifies machine learning by enabling data analysts to build models within Google BigQuery using SQL, directly handling large data structures.

  1. BigQuery ML simplifies machine learning by allowing data analysts to build and run models using SQL. While its SQL capabilities are powerful, certain models and advanced use cases may still require Python or other languages and tools, especially when integrating with Vertex AI for enhanced functionalities.
  2. Empowers data analysts to leverage AI and ML directly within their familiar business intelligence tools and spreadsheets.
  3. Democratizes access to machine learning, enabling predictive analytics to guide decision-making across the organization.
  4. Eliminates the need to move data, speeding up model development and reducing complexity.
  5. Accelerates deployment by bringing machine learning to the data, eliminating the need for external tools and data transfers.

Understanding BigQuery ML Models and Their Types

BigQuery ML supports various model types for different use cases, including:

  1. Internally Trained Models: Businesses can create machine learning models within BigQuery for faster insights, leveraging models like Linear and Logistic Regression, K-Means Clustering, Matrix Factorization, and Time Series forecasting. Note that BigQuery ML’s current feature set does not include algorithms like Principal Component Analysis (PCA) or Contribution Analysis; these may require external libraries or custom code for implementation.

  2. Externally Trained Models (via Vertex AI): BigQuery ML integrates with Vertex AI, offering flexibility for advanced models such as DNN, Wide & Deep, Autoencoder, Boosted Tree, Random Forest, and AutoML. These models, however, are exclusive to Vertex AI and cannot be directly utilized within BigQuery ML. They must be trained and deployed on Vertex AI, with predictions imported back into BigQuery for analysis.

  3. Remote Models: Created in Vertex AI, these models are referenced in BigQuery without incurring processing charges.

  4. Imported Models: Pre-trained models like ONNX, TensorFlow, and XGBoost can be imported into BigQuery for seamless predictions.

How to Create ML Models in BigQuery

A model in BigQuery ML represents learned patterns from training data. It analyzes data, identifies trends, and makes predictions or classifications based on historical data to inform future outcomes.

Prerequisites for BigQuery ML:

#1: GCP Account: Sign up and create a project.

#2: BigQuery API: Enable the BigQuery API in your project via the API Library. 

Additionally, ensure that billing is active for your project and that you have the necessary roles – such as BigQuery Data Editor and BigQuery Job User – to create and manage models effectively. 

These roles provide the permissions required to handle data and execute machine learning jobs within BigQuery.

Step 1: Select a BigQuery Dataset

In the BigQuery console, start by selecting or creating a dataset to hold your data and machine learning models. 

If you need to create a new one - click on your project name and choose the "Create dataset" option.

BigQuery console displaying a list of datasets with the 'Create dataset' option highlighted for adding a new dataset. i-shadow

For this article, we'll use the pre-cleaned and processed 'Google Analytics Sample' dataset from the BigQuery Public Datasets, adding it to our project for convenience.

Google Analytics Sample dataset selected from BigQuery Public Datasets for machine learning model training. i-shadow

Step 2: Build Your First Machine Learning Model

To create a logistic regression model in BigQuery, use the CREATE MODEL statement. 

Example:

This example demonstrates how to define and train a model to predict whether a transaction occurred based on user session data from the Google Analytics sample dataset.

1CREATE OR REPLACE MODEL `bqml.new_sample_model` OPTIONS(model_type='logistic_reg') 
2AS SELECT IF(totals.transactions IS NULL, 0, 1) 
3AS purchase_made, -- Target variable (binary classification) IFNULL(device.browser, "") 
4AS browser, -- Input feature: Browser type device.deviceCategory 
5AS device_category, -- Input feature: Device category (mobile, desktop, tablet) IFNULL(geoNetwork.region, "") 
6AS region, -- Input feature: Region IFNULL(totals.hits, 0) 
7AS total_hits -- Input feature: Total hits count 
8FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` 
9WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20171231' LIMIT 150000;

Here:

  • CREATE OR REPLACE MODEL bqml.new_sample_model: Creates a logistic regression model in BigQuery.
  • purchase_made: Target variable, assigns 0 for no purchase, 1 for purchase.
  • browser, device_category, region, total_hits: Input features for the model.
  • Dataset: Google Analytics public dataset.
  • Date Filter: Filters data between January 1, 2017, and December 31, 2017.
  • LIMIT 150000: Restricts data to 150,000 rows for model training.

Steps on How to Train and Deploy a BigQuery ML Model

To train and deploy a BigQuery ML model, first prepare and load your dataset. Then, use SQL to create and train the model. Once trained, evaluate its performance and deploy it for predictions.

Step 1: Prepare Your Training Dataset

Before training your model, ensure your dataset is clean and organized. Create a BigQuery dataset for storing your data and models and load relevant data (e.g., Google Analytics 4). Ensure proper data location settings (US/EU) for compliance. 

After loading the data, you can perform exploratory data analysis to gain insights or visualize it using Google Data Studio to better understand it. This prepares your dataset for the next steps in model training.

Step 2: Feature Engineering using the TRANSFORM Statement

In BigQuery ML, you can enhance model training using the TRANSFORM clause for feature engineering and data preprocessing. Instead of relying solely on raw data, 

In BigQuery ML, the TRANSFORM clause is used in the CREATE MODEL statement to define data transformations directly in SQL.. 

This enables both feature selection and engineering, providing the model with more relevant and refined features. The TRANSFORM clause helps data analysts and BI professionals ensure that their models use the most meaningful data, leading to improved predictions and analysis accuracy.

BigQuery ML CREATE MODEL statement using the TRANSFORM clause to perform feature engineering and data preprocessing, refining input features for improved model accuracy. i-shadow

Step 3: Train Your Model

To train a model in BigQuery ML, start by defining it with a standard SQL query.. First, specify the model’s name, then define the features and any necessary data transformations using the TRANSFORM clause. 

igQuery ML SQL query defining a machine learning model with a specified name, selected features, and data transformations using the TRANSFORM clause. i-shadow

Next, select the model type and set parameters, including the target column. Finally, execute an SQL query to select the input data from a table

Model training time varies depending on complexity, from minutes for simple models to hours for more advanced models. 

BigQuery ML training progress displayed in the TRAINING tab, showing estimated duration based on model complexity i-shadow

Track progress in the BigQuery UI's TRAINING tab.

BigQuery ML training process running, with progress tracked in the TRAINING tab. i-shadow

Step 4: Evaluate the Model Performance

BigQuery automatically calculates performance metrics based on the model type, such as precision, recall, accuracy, F1 score, log loss, or ROC AUC. You can view these metrics in the EVALUATION tab within the BigQuery console, or access them via SQL queries like ML.EVALUATE. 

BigQuery ML EVALUATION tab showing performance metrics such as accuracy, precision, recall, and F1 score.i-shadow

These metrics help assess model performance. If the results are unsatisfactory, you can return to the model training step to fine-tune and improve performance. 

BigQuery ML model evaluation metrics displaying three performance graphs: (1) Precision-recall by threshold curve, (2) Precision-recall curve, and (3) ROC curve. i-shadow

Step 5: Model Deployment to Predict Outcomes

Once you're satisfied with your model's performance, it's time to use it for predictions on new, unseen data. In BigQuery, prepare a dataset with new data that wasn’t used during training, such as using a date range outside the training period. 

For example, predict the probability of the addedToCart event using data from December 31, 2020. Alternatively, you can export the model for deployment in production systems to generate online predictions.

BigQuery ML SQL query using ML.PREDICT to estimate the probability of an event during a session, based on a logistic regression model  i-shadow

Overview of BigQuery ML Pricing Models

BigQuery ML offers two model types: built-in and external. Built-in models, like linear regression and K-means, are trained within BigQuery, while external models (e.g., deep neural networks and AutoML) are trained using Vertex AI. 

Pricing depends on the model type and usage, either through editions or on-demand. Prediction and evaluation functions are executed within BigQuery, with costs based on the model type and resources used.

BigQuery ML Editions Pricing

BigQuery ML provides Enterprise and Enterprise Plus Editions with a slot-based pricing model for customers using BigQuery ML on dedicated resources, ideal for customers seeking predictable costs compared to the on-demand model. Pricing and resource management differ for built-in and external models.

Built-in Models
Built-in models, such as linear regression or K-means, use BigQuery reservations that share resources with analytical queries, ensuring efficient slot utilization for both machine learning and data analysis tasks.

External Models
External models, trained outside BigQuery, require ML_EXTERNAL reservations to avoid resource conflicts. This ensures external workloads do not affect other processes, maintaining cost parity with BigQuery slots and external Google Cloud services.

BigQuery ML On-Demand Pricing

BigQuery ML on-demand pricing is determined by the type of operation, including model creation, evaluation, inspection, and prediction. For iterative models, the CREATE MODEL statement is limited to 50 iterations, applying to both on-demand and edition pricing.

For time series models with auto-arima enabled, multiple candidate models are evaluated during training. The number of models can be controlled by the AUTO_ARIMA_MAX_ORDER setting. This impacts model creation costs, but for evaluation, inspection, and prediction, only the selected model is used, following standard query pricing.

BigQuery ML Remote Models

BigQuery ML allows customers to create remote models targeting Vertex AI, Cloud AI APIs (e.g., Vision, NLP, Translate), or prediction endpoints. Pricing includes charges for BigQuery processing and additional costs for remote endpoints, such as text generation, translation, or image annotation. 

For supervised tuning with Vertex AI LLMs, costs include BigQuery data processing, but GPU/TPU usage is billed separately. Use billing labels like billing_service = 'bigquery_ml' to track these charges across services.

BigQuery ML Dry Run

For certain BigQuery ML model types, due to the complexity of their underlying algorithms, the bytes processed for billing may not be calculated until after the training process is complete

This is because estimating the initial resource usage is more complex for these models, resulting in billing adjustments being made once training has finished.

Cost Optimization Tips for BigQuery ML

When using BigQuery ML, it's essential to optimize costs while maintaining performance. This can be achieved by selecting the appropriate pricing model, leveraging built-in features like dry runs for accurate cost estimation, managing model training efficiently, and using editions for predictable pricing. 

Use BigQuery ML Editions for Predictable Pricing

Opt for BigQuery ML Enterprise or Enterprise Plus Editions for predictable, capacity-based pricing instead of on-demand pricing. These editions allocate compute slots, avoiding unexpected costs during frequent, large-scale model training. 

Built-in models use shared BigQuery reservations, while external models require ML_EXTERNAL reservations, ensuring efficient resource management and cost parity with Google Cloud services.

Take Advantage of Dry Runs

Use dry runs to estimate data processed and potential costs before executing queries or training models. This allows you to identify costly mistakes in advance, manage resources efficiently, and optimize BigQuery ML usage without incurring unexpected expenses.

Register your models in the Vertex AI Model Registry

Register your models in the Vertex AI Model Registry to monitor, compare, and manage versions efficiently. This prevents unnecessary retraining, reduces additional costs, and ensures better control over model lifecycle management within BigQuery ML.

What is Vertex AI for BigQuery?

Vertex AI is Google’s platform for developing, deploying, and managing AI/ML models. It integrates with BigQuery to prepare training data, manage features in the Vertex AI Feature Store, and support various training methods like AutoML and custom training. Models trained in BigQuery ML can be registered in Vertex AI for easier management and deployment.

Benefits of Managing BigQuery ML Models in Vertex AI

  1. Online Model Serving: Register models in Vertex AI to enable real-time predictions through endpoints.
  2. MLOps Capabilities: Automate model monitoring, retraining, and integration of BigQuery ML jobs for continuous accuracy.

Steps for Integrating BigQuery ML with Vertex AI

You can register BigQuery ML models with the Vertex AI Model Registry to manage them alongside Vertex AI models without exporting. This allows you to version, evaluate, and deploy models for online predictions through a unified interface, eliminating the need for a separate serving container or complex setup.

Enable the Vertex AI API

Before registering BigQuery ML models in the Vertex AI Model Registry, you must enable the Vertex AI API in your Google Cloud project. This can be done through the APIs & Services page in the Google Cloud Console, or by running the following Google Cloud CLI command:

1gcloud --project PROJECT_ID services enable aiplatform.googleapis.com

Grant IAM Permissions for Vertex AI

To register BigQuery ML models in the Vertex AI Model Registry, the service or user account must have the Vertex AI Administrator role (roles/aiplatform.admin). You can grant this role using the Google Cloud CLI.

For a service account, use the following command:

1gcloud projects add-iam-policy-binding PROJECT_ID 
2--member=serviceAccount:SERVICE_ACCOUNT_EMAIL 
3--role=roles/aiplatform.admin --condition=None

For a user account, use the following command:

1gcloud projects add-iam-policy-binding PROJECT_ID 
2--member=user:USER_ACCOUNT --role=roles/aiplatform.admin 
3--condition=None

Register BigQuery ML Model to Vertex AI Model Registry During Creation

When creating a BigQuery ML model, use the MODEL_REGISTRY option in the CREATE MODEL statement to register it in the Vertex AI Model Registry. You can specify model IDs and version aliases with the VERTEX_AI_MODEL_ID and VERTEX_AI_MODEL_VERSION_ALIASES options. Once registered, models can be deployed, managed, and monitored, with Vertex Explainable AI offering feature-based insights.

Example:
We'll create a BigQuery time series forecasting model and register it to the Vertex AI Model Registry during its creation.

Here's a step-by-step guide:

Step 1: Import Time Series Data

First, upload the dataset in BigQuery and create a table. You can also use an existing dataset.

  1. Navigate to BigQuery in the console.
  2. Create a dataset or use an existing one.
  3. Upload the CSV file and name the table as per the requirement. Here, we are naming it transportation_revenue_insights. Let the schema be auto detected.

Step 2: Create a Time Series Model

Using BigQuery ML syntax, we will create an ARIMA time series model:

1CREATE OR REPLACE MODEL
2  `demo.transportation_revenue_insights` 
3OPTIONS(MODEL_TYPE='ARIMA',
4    TIME_SERIES_TIMESTAMP_COL='service_date',
5    TIME_SERIES_DATA_COL='total_rides',
6    HOLIDAY_REGION='us',
7    MODEL_REGISTRY='vertex_ai',
8    VERTEX_AI_MODEL_ID='cta_ridership_forecast') AS
9SELECT
10  service_date, total_rides
11FROM
12  `demo.transportation_revenue_insights`;

Here:

  • MODEL_REGISTRY='vertex_ai': Registers the model in the Vertex AI Model Registry.
  • VERTEX_AI_MODEL_ID: Specifies a custom Vertex AI Model ID for easy management.

Step 3: Evaluate the Model

Once the model is trained, evaluate it with this query:

1SELECT
2  *
3FROM
4  ML.EVALUATE(MODEL `transportation_revenue_insights`);

This returns model parameters, evaluation metrics, and the best ARIMA model based on AIC.

Step 4: Forecast with the Model

Use the ML.FORECAST function to forecast ridership for the next 7 days:

1SELECT
2  *
3FROM
4  ML.FORECAST(MODEL `transportation_revenue_insights`,
5    STRUCT(7 AS horizon));

This query generates a 7-day forecast along with confidence intervals.

By following these steps, you not only create and evaluate a time series model but also register it in Vertex AI for easy management and deployment.

Register an Existing BigQuery ML Model to Vertex AI

To register an existing BigQuery ML model in the Vertex AI Model Registry, use the ALTER MODEL statement.

1ALTER MODEL [IF EXISTS] <model_name> SET OPTIONS 
2(vertex_ai_model_id="VERTEX_AI_MODEL_ID");

  • If you use a new model ID, it registers as version 1. To add it as a new version of an existing model, specify the existing model's ID.
  • This allows you to manage and compare versions in the Model Registry.

Manage Time Series Models in Vertex AI Model Registry

Once your time series forecasting model is registered in the Vertex AI Model Registry, you can deploy it for predictions on new time series data, monitor the model’s performance and predictions, and use Vertex Explainable AI to gain insights into the key factors influencing the forecasts. This ensures better understanding and control over the model’s outputs.

Specify a Vertex AI Model ID and Aliases

When working with BigQuery ML models registered in Vertex AI, you can assign a model ID and aliases to simplify management and versioning.

Specify a Vertex AI Model ID

When creating a model, use the VERTEX_AI_MODEL_ID option to assign a model ID. Ensure it's lowercase. If unspecified, the BigQuery ML model ID will be used. To update, you must delete and recreate the model.

Specify a Vertex AI Model Alias

Use VERTEX_AI_MODEL_VERSION_ALIASES to create model aliases. These allow easy reference to specific model versions without needing the exact version ID.

BigQuery ML vs. Traditional Machine Learning Approaches

BigQuery ML streamlines model building by integrating with BigQuery, while traditional ML requires more complex workflows and external tools. 

Here is a detailed comparison of the two:

Feature/Aspect BigQuery ML Traditional Machine Learning
Data Processing Works directly within BigQuery; no need to export data to external environments Requires data export to external environments for processing
Model Development SQL-based model creation, accessible to data analysts Requires programming knowledge in languages like Python, R
Complexity Simplified with SQL queries and built-in functions Involves specialized libraries (e.g., TensorFlow, scikit-learn)
Environment Integrated within Google Cloud and BigQuery; models trained directly within BigQuery Requires external ML platforms for model training
Deployment Built-in deployment capabilities with Vertex AI integration Involves creating custom deployment pipelines
Speed to Production Faster, with no need to move data and minimal setup Slower due to multiple steps in the process
Prediction Type Batch and online predictions via Vertex AI Typically requires custom APIs for deployment and real-time predictions
User Skill Level Accessible for SQL users; no need for advanced ML skills Requires knowledge of ML frameworks and programming

By simplifying model creation and deployment within a familiar environment, BigQuery ML offers a more efficient, accessible approach to machine learning compared to traditional workflows.

Unlock In-Depth Insights with the OWOX Reports Extension for Google Sheets

The OWOX Reports Extension for Google Sheets empowers users to transform raw data into actionable insights with ease. By connecting Google Sheets directly to your data sources, such as Google BigQuery and Google Analytics, it enables users to automate report generation, streamline data analysis, and eliminate the need for manual data imports. 

With intuitive dashboards, charts, and reports, business users and data analysts can track key metrics, uncover trends, and make data-driven decisions without leaving Google Sheets. The extension simplifies data management, making it accessible to non-technical users while providing in-depth insights.

FAQ

What is BigQuery ML?
What types of machine learning models can you build with BigQuery ML?
Can I integrate BigQuery ML with Vertex AI for advanced use cases?
What is the difference between built-in and external models in BigQuery ML?
How is BigQuery ML priced?
How can I reduce costs when running machine learning models in BigQuery?

You might also like

2,000 companies rely on us

Oops! Something went wrong while submitting the form...