Strategies for easier Google Analytics + BigQuery Analysis & Custom Dimensions

I’m a huge fan of BigQuery, and for Google Analytics 360 users you can automatically have your data go directly into BigQuery. It enabled advanced analysis not possible through the standard reporting tools, and you can join in other datasets you have in BigQuery with your Google Analytics data.

With BigQuery you can have nested structures, so each record in the Google Analytics dataset a visit. Nested inside each visit record is a hits nested record, which contains every pageview and event that occurred during that visit.

Let’s pull some data to show what I mean. Normally in BigQuery, the datasetname is the Google Analytics View ID, and the table name structure is ga_sessions_<date> for a single day’s data. An example dataset/table combo would be 123456789.ga_sessions_20180501. You can use Google’s Sample GA Dataset if you want to play around with a dataset, though there are some limitations (which I’ll explain later). Here is a query to pull the visitId and device.operatingSystem for each visit that happened on May 1st, 2018:

SELECT 
  VISIT.visitId,
  VISIT.device.operatingSystem
FROM `123456789.ga_sessions_20180501` VISIT
ORDER BY VISIT.visitStartTime ASC
Results showing visit data

If we want to view things at a hit level, we need to use the feature called UNNEST. This “un-nests” the data and creates a row for each visit & hit. It would be very similar to using a JOIN if these were stored in separate tables.

Here is an example of using UNNEST:

SELECT 
  VISIT.visitId,
  VISIT.device.operatingSystem,
  HIT.type,
  HIT.page.pageTitle,
  SUBSTR(HIT.page.pagePath, 0, 20) AS pagePath
FROM `123456789.ga_sessions_20180501` VISIT,
  UNNEST(VISIT.hits) HIT
ORDER BY VISIT.visitStartTime ASC, 
  HIT.hitNumber ASC

Things become tricky with another level of nesting: custom dimensions. Because you can have up to 200 custom dimensions with Google Analytics 360, instead of having one record with 200 columns they store them in an array. Here is an example of how they are structured:

{
  "customDimensions": [{
      "index": "4",
      "value": "Homepage"
    },
    {
      "index": "18",
      "value": "McKenzie Romero"
    },
    {
      "index": "27",
      "value": "courts"
    }
  ]
}

The challenge with using UNNEST in this scenario is each record will have an individual custom dimension:

SELECT 
  VISIT.visitId,
  VISIT.device.operatingSystem,
  HIT.type,
  HIT.page.pagePath,
  CD.index,
  CD.value
FROM `123456789.ga_sessions_20180501` VISIT,
  UNNEST(VISIT.hits) HIT,
  UNNEST(HIT.customDimensions) CD
ORDER BY VISIT.visitStartTime ASC, 
  HIT.hitNumber ASC,
  CAST(CD.index AS INT64) ASC

This isn’t inherently bad, except when you want to filter or select multiple custom dimensions. With BigQuery’s Standard SQL there isn’t a clear, easy way of doing this. We’ve come up with two basic strategies:

Strategy #1: WITH and JOIN statements w/ pure SQL

You can solve the problem with nothing but SQL. Our strategy is to query the custom dimensions as if they were a separate table and then join them in. To do this we’ll use the less known (but awesome) WITH statement.

Using some KSL.com News data as a sample, let’s look at pageviews and events that occurred on pages where McKenzie Romero is the author (custom dimension 18) and see what topic they belong in (custom dimension 27). Here is the SQL query:

WITH CD AS (
  SELECT 
    WITH_VISIT.fullVisitorId,
    WITH_VISIT.visitId,
    WITH_HIT.hitNumber,
    WITH_CD.index,
    WITH_CD.value
FROM `123456789.ga_sessions_20180501` WITH_VISIT,
  UNNEST(WITH_VISIT.hits) WITH_HIT,
  UNNEST(WITH_HIT.customDimensions) WITH_CD
WHERE WITH_CD.index IN (18, 27)
) 
SELECT 
  VISIT.visitId,
  VISIT.device.operatingSystem,
  HIT.type,
  SUBSTR(HIT.page.pageTitle, 0, 30),
  CD_18.value AS customDimension18_author,
  CD_27.value AS customDimension27_topic
FROM `123456789.ga_sessions_20180501` VISIT,
  UNNEST(VISIT.hits) HIT
  LEFT JOIN CD CD_18 ON VISIT.fullVisitorId = CD_18.fullVisitorId 
    AND VISIT.visitId = CD_18.visitId 
    AND HIT.hitNumber = CD_18.hitNumber 
    AND CD_18.index = 18
  LEFT JOIN CD CD_27 ON VISIT.fullVisitorId = CD_27.fullVisitorId 
    AND VISIT.visitId = CD_27.visitId 
    AND HIT.hitNumber = CD_27.hitNumber 
    AND CD_27.index = 27
WHERE CD_18.value LIKE "%McKenzie Romero%"

It works but isn’t straight forward-looking. Before the SELECT we define the table CD using a WITH. We select the data from the VISIT and HIT tables, and then we join in CD using the fullVisitorId , visitId , hitNumber , and index of the custom dimension. We do this for every custom dimension we use, so if we need to use a dozen custom dimensions then this SQL statement is going to get long and complicated. This can lead to easy errors that are hard to track down.

One easy-to-miss optimization is limiting in the WITH statement which indexes you want. If you forget this part, your WITH statement is going to do a lot more work than it needs to and will run much slower.

Pure SQL Strategy Pros & Cons:

Pros:

  • Pure SQL, don’t need to know another language.
  • Fast, performant.

Cons:

  • Complicated SQL (depending on your definition of complicated)
  • Error-prone

Strategy #2: Using JavaScript with User-defined Functions

So what if instead of using WITH and JOIN instead we use a User Defined Function? With this approach, we could pass the entire customDimensionobject and desired index value into the function and let it return the result.

Here is an example:

CREATE TEMPORARY FUNCTION getCustomDimension(cd ARRAY<STRUCT>, index INT64)
  RETURNS STRING
  LANGUAGE js AS """
    for(var i = 0; i < cd.length; i++) {
      var item = cd[i];
      if(item.index == index) {
        return item.value
      }
    }
    return '';
  """;

SELECT 
  VISIT.visitId,
  VISIT.device.operatingSystem,
  HIT.type,
  SUBSTR(HIT.page.pageTitle, 0, 30) AS pageTitle,
  getCustomDimension(HIT.customDimensions, 18) AS customDimension18_author,
  getCustomDimension(HIT.customDimensions, 27) AS customDimension27_topic
FROM `123456789.ga_sessions_20180501` VISIT,
  UNNEsT(VISIT.hits) HIT
WHERE getCustomDimension(HIT.customDimensions, 18) LIKE "%McKenzie Romero%"

This allows us to treat custom dimensions as if they were columns on the hitobject. This makes for easier ad-hoc analysis and reduces errors. It does come at a cost: these queries run slower. In our testing, I ran these same queries against an entire’s month of data, processing about 40GB of data:

  • Pure Standard SQL: avg 19.1 seconds
  • User Defined Function: avg 69.9 seconds (3.5 times slower)

There are definite tradeoffs between the approaches. I would use user-defined functions with initial and ad-hoc analysis. Then, when making regular running reports or queries, I would switch to using the pure SQL methodology. Either of these patterns will allow you to more easily use custom dimensions when analyzing your Google Analytics data.

Also, there are probably even better methods out there for querying against custom dimensions. If you know of one, please, don’t hesitate to share!

— Carmony

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close