Unlocking BigQuery Collaboration: User-Defined Functions and Version Control at Arpeely

Or Yarnitzky
Nov 4, 2024
9 min read

Updated: Dec 16, 2024

At Arpeely, we help our customers grow by leveraging data science, A/B testing, and machine learning to optimize user engagement. We rely on Google BigQuery to efficiently store, process, and analyze the massive amounts of data that drive our insights.

Treating Queries as Code

Queries run the world in Arpeely. They are responsible for ETL processes, model training, dashboards, production bidding and more.

To make BigQuery work for us — and to keep our data analysis processes efficient — we treat our queries with the same level of care that developers do with their code. That means version control, continuous integration (CI), linting, and all the good practices you’d expect in a well-maintained software project. Why? Because queries are not just one-off scripts; they are key pieces of our data infrastructure. If you've ever had to troubleshoot a complex SQL query at 2 a.m., you know how much of a lifesaver it is when things are documented, tested, and organized.

In this post, we'll share how we approach data analysis at Arpeely by treating our queries as code, using automation to manage our data infrastructure, and leveraging User-Defined Functions (UDFs - we’ll get to these later) to simplify complex data transformations. We'll also highlight some of the specific tools and techniques we've found most helpful along the way.

UDFs - Your Company’s Common Library in BigQuery

UDFs let us extend the capabilities of BigQuery with custom functions written in SQL or JavaScript. Over time, we’ve built some pretty handy UDFs that have made our lives a lot easier, especially when working with large datasets or running repeatable processes.

UDFs not only save time. Like classes, they can create an abstraction and form uniformity within your team. One can create simple utils like what’s the proper way to convert currencies and complex abstractions like calculating statistical significance for an A/B test dataset. If processes change, these UDFs can be altered by members of your team without the need to change.

Everyone in the company knows that the 'func' dataset is used for common tools, making it easy for both developers and data analysts to find useful utility functions that others may have written before them. It also encourages adding their own abstractions for future use across the company. Just like with code, everyone is encouraged to contribute and improve our library of UDFs, which saves us time and prevents us from reinventing the wheel.

CI/CD with BigQuery

Automation plays a huge role in how we manage our BigQuery environment. Our queries, table schemas and UDFs live in parallel in BigQuery and in Git. Using TeamCity as our CI/CD infrastructure and the Python Client for Google BigQuery, we've implemented a couple of CI jobs to help us maintain and keep the Git and BigQuery tables, views, and UDFs in sync.

First, we’ve implemented a CI step that runs whenever someone changes a query and opens a pull request (PR). This CI step attempts to deploy the modified table, view, or UDF into a “temp” dataset. By doing this, we can verify the syntax and ensure that all dependencies—like datasets and referenced tables or views—are correctly set up. This kind of preemptive testing helps catch issues early, making our deployment process smoother and reduces the chances of nasty surprises in production.

After the PR is merged to Git’s main branch, a second CI job deploys the changed files over the existing BigQuery entities - creating tables, views and UDFs if they don't exist or editing them if they do, and editing the schema of tables if needed. This way, we maintain a single source of truth for our data infrastructure in Git, which is crucial for collaboration and minimizing errors.

This automated workflow not only saves time but also reduces the manual overhead of managing changes across our BigQuery environment. It’s much easier to iterate on queries, tables, and UDFs when you know that the deployment process is seamless and reliable. Plus, by treating our queries as code, we make it possible for anyone on the team to contribute improvements, knowing that their changes will be tested and version-controlled just like any other software project.

UDFs - How to Source Control Them

Creating and managing a view in Python using the BigQuery Python client is straightforward:


from google.cloud import bigquery

view_id = "your_project.your_dataset.your_view"
view_query = """
	SELECT column1, column2
	FROM `your_project.your_dataset.your_table`
	WHERE column1 IS NOT NULL
"""

view = bigquery.Table(view_id)
view.view_query = view_query

client = bigquery.Client()

view = client.create_table(view, exists_ok=True) 
print(f"Successfully created view: {view.table_id}")

Creating a UDF is a bit more tricky, because of the lack of an API to create a new UDF by name. Therefore, we execute the query:

create or replace function <directory>.<file name><file contents>

with the placeholders replaced appropriately for each version-controlled UDF. For an example, to deploy a function which accepts two integers and add them together, we first create a file functions/adder.udf with the content:


-- <parameters> [optional returns clause] as <function body>
(left int, right int) returns int as (left + right)

We run the query:


create or replace function functions.adder(left int, right int) returns int as (left + right)

Examples of Our Most Used UDFs

The following examples showcase specific UDFs that we've developed, highlighting how they help us solve common problems we encounter when working with our data, allowing us to focus on development of new algorithms.

Filtering by Time: Filtering by time is a common operation in our data workflow as most of our tables are partitioned by ingestion time. This allows for efficient filtering, allowing our queries to digest only the most relevant data. To ease the workflow with time filtering, we created functions like since_minutes, since_hours, and since_days to simplify it.

Here's is the implementation of the since_hours function:


(time timestamp, hours int) as (time >= timestamp_sub(current_timestamp(), interval hours hour))

And here's an example usage of it:


with sample_data as (
  select timestamp '2024-10-01 10:00:00' as event_time
)
select
  event_time
from sample_data
where
  since_hours(event_time, 5);
  -- Shorter than: 
  -- event_time >= timestamp_sub(current_timestamp(), INTERVAL 5 hour)

This function allows us to easily filter records that fall within a given number of hours from the current timestamp. Creating these functions saves us a lot of time and hassle, especially when it comes to remembering the specific syntax of timestamp_sub.

String Processing and Manipulation: We often need to perform simple processing and string manipulations, so we've created several useful functions to handle these tasks. For example, we use the function to_adid to convert an array of bytes into a formatted advertising ID. Another couple of useful functions are the parse_url_params and url_param functions, url_decode, which extract query-params from a URL. Here's the implementation of parse_url_params:


(current_url string) returns array<struct<key string, val string>> as ((
select 
array_agg(struct(
	-- Before the = is the key, it must exist
    split(x,'=')[offset(0)] as key, 
    -- After the = is the value, maybe missing
    split(x,'=')[safe_offset(1)] as val
  ))
-- Split the URLs and extract the array of query params
from unnest(split(split(current_url,'?')[safe_offset(1)],'&')) as x
))

This function allows us to parse URLs, which can be particularly handy when dealing with data coming from web tracking or APIs. Here's an example usage of parse_url_params:


with sample_data as (
  select 'https://www.example.com/search?q=test&lang=en' as url
)
select parse_url_params(url) as query_params
from sample_data

This query will output the array:


[ 
	{"key": "q", "value": "test"},  
	{"key": "lang", "value": "en"}
]

Similarly, url_param allows to extract the value of a specific query param:


(url string, param string) returns string as ((
	select val 
	from unnest(func.parse_url_params(url)) 
	where key = param 
	limit 1
))

A similar example to before:


with sample_data as (
  select 'https://www.example.com/search?q=test&lang=en&sort=desc' as url
)
select
  url_param(url, 'lang') as query_params
from sample_data;

This query will output ”en”.

This relatively simple example shows that by creating small units of abstraction, we can build more complex functions that will help us to work more efficiently throughout the entire company.

JSON Processing: Another powerful feature of BigQuery is the ability to implement User-Defined Functions (UDFs) using JavaScript. This is particularly useful for processing JSON data, where more complex logic can be applied. We will encounter the json_extract_keys and json_extract_values functions from the BigQuery Utils project, which help extract keys and values from a JSON object. Another useful function we've developed is merge_json, which allows us to merge two JSON objects together. This makes it easy to work with multiple sources of JSON data that need to be combined.

The declaration of merge_json:


(sourceJson JSON, targetJson JSON) returns JSON
language js as 
"""
  <JS code>
""";

And its implementation:


// Utility function for merging two objects together
function mergeDeep(source, target) {
    for (const key in source) {
       // Key is not an object, we don't need to deep-merge
       if (typeof source[key] !== 'object' || Array.isArray(source[key])) {
            target[key] = source[key];
            continue
        }
        
        // key is an object, create intermediate object in target
        if (!target[key]) {
            target[key] = {};
        }
        mergeDeep(source[key], target[key]);
    }
    return target;
}

if (targetJson === null) {
    return sourceJson
}

if (sourceJson === null) {
    return targetJson
}

return mergeDeep(sourceJson, targetJson);

Here's an example usage of merge_json:


with sample_data as (
  select 
    JSON '{"person": {"name": "Alice", "age": 30}, "location": {"city": "Wonderland"}}' as json1, 
    JSON '{"person": {"age": 35, "profession": "Explorer"}, "location": {"country": "Dreamland"}}' as json2
)
select
  merge_json(json1, json2) as merged_json
from sample_data;

This query will output the merged JSON object:


{
  "person": {
    "name": "Alice",
    "age": 35,
    "profession": "Explorer"
  },
  "location": {
    "city": "Wonderland",
    "country": "Dreamland"
  }
}

Array Manipulation

Another powerful feature of BigQuery is its support for array data types. Unlike many other SQL databases, BigQuery allows us to work directly with arrays, which can significantly simplify data processing and transformations. Arrays are particularly useful when dealing with repeated fields or when you need to perform aggregations without excessive joins. To simplify working with arrays, one useful function we created is the slice function:


-- ANY TYPE allows our function to be "generic", accepting any type of array
(arr ANY TYPE, start int64, stop int64) as (
ARRAY(
	select element 
	from unnest(arr) as element -- for all element in the array
	with offset
	-- return only those in the specified range
	where offset between start and stop - 1 
  )
);

An example usage of this function:


with sample_data as (
 	select 
	[1, 2, 3, 4, 5] as int_array, 
	["apple", "banana", "cherry", "date"] as str_array
)
select
  func.slice(int_array, 1, 4) as sliced_int_array,
  func.slice(str_array, 2, 4) as sliced_str_array
from sample_data;

returns the output:


{
  "sliced_int_array": [2, 3, 4],
  "sliced_str_array": ["cherry", "date"]
}

ETL Marker: The last example we want to share is the etl_marker function. Looking just at its definition may create some confusion: (day date) as (day)

The goal of this function is not to stand by itself, but rather to be used as part of our internal ETL infrastructure. Specifically, when performing data rollup and aggregation, we wanted to be able to aggregate older data if needed while keeping all of our views and queries functional for the latest day (which is the usual granularity for our rollup operations).

When executing a rollup, we replace calls to etl_marker with the alternative calls to autogen_etl_marker, which is defined by the statement:

create temp function autogen_rollup_marker(d date) as (date(<rollup date>))

Here, <rollup date> is dynamically built based on the rollup job configuration. This approach allows us to adjust the aggregation period dynamically while ensuring that our views and queries continue to work seamlessly for the most recent day.

Leveraging BigQuery Utils

One great resource that helped us along the way is the BigQuery Utils project by Google Cloud. It’s a collection of tools, utilities, and even some inspiration for how we can make our SQL queries more maintainable. If you haven't checked it out yet, we highly recommend giving it a look.

We’ve already seen the function merge_json which helps us to work with nested or dynamic JSON data. A couple of similar functions from the BigQuery Utils project are bqutil.fn.json_extract_keys and bqutil.fn.json_extract_values. These functions allow you to extract the keys and values from a JSON object. This functionality is extremely useful, and Google eventually added the JSON_KEYS function which behaves similarly to json_extract_keys, but it is still in preview at the time of writing.

Conclusion

In conclusion, our journey with BigQuery has taught us that treating queries as code and investing in tools like UDFs really pays off. It makes our work more collaborative, less error-prone, and much easier to maintain. Whether you’re just starting out with BigQuery or looking for ways to level up your SQL game, we hope this has given you a bit of inspiration to try out some of these practices. And remember—always version control your queries; your future self will thank you!