Overview of Big Query’s Unique feature, BQML with a regression model example

February 2, 2023
Posted by: Mamakar Jetti
Category: Data Analytics

In this Blog you are going to see what Big Query is, its best feature of Big Query (BQML), Areas of BQML, with clear example to understand its easiness of building machine learning model with simple SQL code.

The blog will go through the following topics:

What is Big Query?
Best features of Big Query?
Why BQML? Areas of BQML?
Regression model to show efficiency of BQML

Let’s dive into the article.,

What is Big Query?

With built-in technologies like machine learning, business intelligence and geospatial analysis, Big Query is a managed service data management warehouse that can enable you to manage and analyse your data. With no need for infrastructure administration, Big Query’s serverless architecture enables you to leverage SQL queries to tackle the most critical issues facing your company. You may query data in terabytes in a matter of seconds and petabytes of data in a matter of minutes thanks to Big Query’s robust, distributed analytical engine.

Best features of Big Query?

Built-in ML Integration (BQ ML), Multi cloud Functionality (BQ Omni), Geospatial Analysis (BQ GIS), Foundation for BI (BQ BI Engine), Free Access (BQ Sandbox), Automated Data Transfer (BQ Data Transfer Service). These are the amazing features of BQ, in this blog we will discuss the most amazing feature of Big Query which is Big Query ML.

*An amazing feature of Big Query is Big Query ML,

Big Query ML allows you to use standard SQL queries to develop and run machine learning models in Big Query. Machine learning on huge datasets requires extensive programming and ML framework skills. These criteria restrict solution development within each organization to a small group of people, and they exclude data analysts who understand the data but lack machine learning and programming skills. This is where Big Query ML comes in handy; it allows data analysts to employ machine learning using their existing SQL tools and skills. Big Query ML allows analysts to create and evaluate machine learning models in Big Query with large volumes of data.

For more information on Big Query Machine Learning services and solutions

Why BQML?

The major advantages I’ve identified using BQML

There is no need to read your data from local memory because, like any other ML language, BQML can subsample your dataset, but BQML can also train your model directly in your database.
Working in SQL can help you collaborate more easily if you’re working in a team and the majority of your teammates don’t know Python, R, or your favourite modelling language.
Because your model will be in the same location as your data, you can serve it immediately after it has been trained and make predictions directly from it.

Areas we can use BQML

Retail Industry (Demand forecasting, Customer segmentation, Propensity to purchase or propensity to click on item, Product recommendations by emails and ads).
Logistics Industry (Time estimation of package delivery, Predictive maintenance).
Finance Industry (Product recommendations by emails and ads, Product recommendations by emails and ads, Product recommendations by emails and ads, Product recommendations by emails and ads).
Gaming Industry (Content recommendation, Predicting churn customers).

Another blog worth reading: Databricks Overview, Why Databricks, and More

Regression model to show efficiency of BQML

For this we will build a linear regression model to predict the house prices in the USA, as it is best fit to predict the value of one variable using another. Also, for understanding about model working in the article I am using example of regression model as it is simpler to communicate how the model itself works and interpret results.
With the USA housing dataset, we will see how efficient and easy Big Query ML feature is to build machine learning linear regression model with SQL code.

Step 1: Creating the Model

CREATE OR REPLACE MODEL

`testproject-351804.regression.house_prices2` OPTIONS(model_type = ‘linear_reg’, input_label_cols = [‘price’],l2_reg = 1, early_stop = false, max_iterations = 12, optimize_strategy = ‘batch_gradient_descent’) ASSELECT avg_house_age, avg_rooms, avg_bedrooms, avg_income, population, price/100000 AS priceFROM `regression.usa_housing_train`

SELECT avg_house_age, avg_rooms, avg_bedrooms, avg_income, population, price/100000 AS price FROM `regression.usa_housing_train

Model creation

The above code will create and train the model.
With the simple CREATE MODEL function we can create the ML model, you need to specify the OPTIONS, we need basically only model_type and input_label_cols(predicting variable) to create the model but why I used other OPTIONS, you will see in evaluation section.

Step 2: Evaluating the Model

SELECT * FROM ML.EVALUATE(MODEL `regression.house_prices2`,TABLE ` testproject- 351804._8b41b9f5a2e85d72c62e834e3e9dd60a58ba542d.anoncb5de70d_1e3d_4213_8c5d_bb10d6b9385b_imported_data_split_eval_data`)

Model Evaluation

We have to see how well our model is performing by using ML.EVALUATE funtion, So now we will see why I used other OPTIONS in creating the model,
First I created a model in BigQuery ML, with model options model_type= ‘linear_reg’ and input_label_cols = ‘price’ but while evaluating the model “r square” is only 0.3 which I felt less accurate and I came to know that model is overfitt by seeing huge difference between the training loss and evaluation loss.
So, as a solution I added options in creating model, used L2 regularization to overcome overfitt and generalize the model to adapt the data points and changed values for three times to made it generalize and after the r square is 0.92 with above 90% accuracy.

*We need to look upon R-Squared, which is Coefficient of determination. Higher is better.

Step 3: Predicting the Model

The model’s prediction process is as simple as calling ML.PREDICT

SELECT * FROM ML.PREDICT (Model `regression.house_prices2`,TABLE `regression.usa_housing_predict`)

Model Prediction

See, how efficient is Big Query ML feature of Big Query, it predicted the house prices basing upon the trained data of avg_house_age, avg_rooms, avg_bedrooms, avg_income, avg_population.

Summary

Now you know how to create linear regression models in BigQuery ML. We have discussed how to build a model, assess it, apply it to make predictions, and analyse model coefficients.

In next coming blogs you will see other unique features of Big Query like Geospatial Analytics and Array/Structs.

Happy Reading

Hope you find this useful.