I’m a huge fan of BigQuery, and for Google Analytics 360 users you can automatically have your data go directly into BigQuery. It enabled advanced analysis not possible through the standard reporting tools, and you can join in other datasets you have in BigQuery with your Google Analytics data.
With BigQuery you can have nested structures, so each record in the Google Analytics dataset
a visit. Nested inside each visit record is a hits
nested record, which contains every pageview and event that occurred during that visit.
Let’s pull some data to show what I mean. Normally in BigQuery, the dataset
name is the Google Analytics View ID, and the table name structure is ga_sessions_<date>
for a single day’s data. An example dataset/table combo would be 123456789.ga_sessions_20180501
. You can use Google’s Sample GA Dataset if you want to play around with a dataset, though there are some limitations (which I’ll explain later). Here is a query to pull the visitId
and device.operatingSystem
for each visit that happened on May 1st, 2018:
SELECT VISIT.visitId, VISIT.device.operatingSystem FROM `123456789.ga_sessions_20180501` VISIT ORDER BY VISIT.visitStartTime ASC

If we want to view things at a hit level, we need to use the feature called UNNEST. This “un-nests” the data and creates a row for each visit & hit. It would be very similar to using a JOIN if these were stored in separate tables.
Here is an example of using UNNEST:
SELECT VISIT.visitId, VISIT.device.operatingSystem, HIT.type, HIT.page.pageTitle, SUBSTR(HIT.page.pagePath, 0, 20) AS pagePath FROM `123456789.ga_sessions_20180501` VISIT, UNNEST(VISIT.hits) HIT ORDER BY VISIT.visitStartTime ASC, HIT.hitNumber ASC

Things become tricky with another level of nesting: custom dimensions. Because you can have up to 200 custom dimensions with Google Analytics 360, instead of having one record with 200 columns they store them in an array. Here is an example of how they are structured:
{ "customDimensions": [{ "index": "4", "value": "Homepage" }, { "index": "18", "value": "McKenzie Romero" }, { "index": "27", "value": "courts" } ] }
The challenge with using UNNEST in this scenario is each record will have an individual custom dimension:
SELECT VISIT.visitId, VISIT.device.operatingSystem, HIT.type, HIT.page.pagePath, CD.index, CD.value FROM `123456789.ga_sessions_20180501` VISIT, UNNEST(VISIT.hits) HIT, UNNEST(HIT.customDimensions) CD ORDER BY VISIT.visitStartTime ASC, HIT.hitNumber ASC, CAST(CD.index AS INT64) ASC

This isn’t inherently bad, except when you want to filter or select multiple custom dimensions. With BigQuery’s Standard SQL there isn’t a clear, easy way of doing this. We’ve come up with two basic strategies:
Strategy #1: WITH and JOIN statements w/ pure SQL
You can solve the problem with nothing but SQL. Our strategy is to query the custom dimensions as if they were a separate table and then join them in. To do this we’ll use the less known (but awesome) WITH
statement.
Using some KSL.com News data as a sample, let’s look at pageviews and events that occurred on pages where McKenzie Romero is the author (custom dimension 18) and see what topic they belong in (custom dimension 27). Here is the SQL query:
WITH CD AS ( SELECT WITH_VISIT.fullVisitorId, WITH_VISIT.visitId, WITH_HIT.hitNumber, WITH_CD.index, WITH_CD.value FROM `123456789.ga_sessions_20180501` WITH_VISIT, UNNEST(WITH_VISIT.hits) WITH_HIT, UNNEST(WITH_HIT.customDimensions) WITH_CD WHERE WITH_CD.index IN (18, 27) ) SELECT VISIT.visitId, VISIT.device.operatingSystem, HIT.type, SUBSTR(HIT.page.pageTitle, 0, 30), CD_18.value AS customDimension18_author, CD_27.value AS customDimension27_topic FROM `123456789.ga_sessions_20180501` VISIT, UNNEST(VISIT.hits) HIT LEFT JOIN CD CD_18 ON VISIT.fullVisitorId = CD_18.fullVisitorId AND VISIT.visitId = CD_18.visitId AND HIT.hitNumber = CD_18.hitNumber AND CD_18.index = 18 LEFT JOIN CD CD_27 ON VISIT.fullVisitorId = CD_27.fullVisitorId AND VISIT.visitId = CD_27.visitId AND HIT.hitNumber = CD_27.hitNumber AND CD_27.index = 27 WHERE CD_18.value LIKE "%McKenzie Romero%"

It works but isn’t straight forward-looking. Before the SELECT
we define the table CD
using a WITH
. We select the data from the VISIT
and HIT
tables, and then we join in CD
using the fullVisitorId
, visitId
, hitNumber
, and index
of the custom dimension. We do this for every custom dimension we use, so if we need to use a dozen custom dimensions then this SQL statement is going to get long and complicated. This can lead to easy errors that are hard to track down.
One easy-to-miss optimization is limiting in the WITH
statement which indexes you want. If you forget this part, your WITH
statement is going to do a lot more work than it needs to and will run much slower.
Pure SQL Strategy Pros & Cons:
Pros:
- Pure SQL, don’t need to know another language.
- Fast, performant.
Cons:
- Complicated SQL (depending on your definition of complicated)
- Error-prone
Strategy #2: Using JavaScript with User-defined Functions
So what if instead of using WITH
and JOIN
instead we use a User Defined Function? With this approach, we could pass the entire customDimension
object and desired index value into the function and let it return the result.
Here is an example:
CREATE TEMPORARY FUNCTION getCustomDimension(cd ARRAY<STRUCT>, index INT64) RETURNS STRING LANGUAGE js AS """ for(var i = 0; i < cd.length; i++) { var item = cd[i]; if(item.index == index) { return item.value } } return ''; """; SELECT VISIT.visitId, VISIT.device.operatingSystem, HIT.type, SUBSTR(HIT.page.pageTitle, 0, 30) AS pageTitle, getCustomDimension(HIT.customDimensions, 18) AS customDimension18_author, getCustomDimension(HIT.customDimensions, 27) AS customDimension27_topic FROM `123456789.ga_sessions_20180501` VISIT, UNNEsT(VISIT.hits) HIT WHERE getCustomDimension(HIT.customDimensions, 18) LIKE "%McKenzie Romero%"

This allows us to treat custom dimensions as if they were columns on the hit
object. This makes for easier ad-hoc analysis and reduces errors. It does come at a cost: these queries run slower. In our testing, I ran these same queries against an entire’s month of data, processing about 40GB of data:
- Pure Standard SQL: avg 19.1 seconds
- User Defined Function: avg 69.9 seconds (3.5 times slower)
There are definite tradeoffs between the approaches. I would use user-defined functions with initial and ad-hoc analysis. Then, when making regular running reports or queries, I would switch to using the pure SQL methodology. Either of these patterns will allow you to more easily use custom dimensions when analyzing your Google Analytics data.
Also, there are probably even better methods out there for querying against custom dimensions. If you know of one, please, don’t hesitate to share!
— Carmony