Privacy With Google Analytics

When I started this blog, I automatically included Google Analytics because it was convenient and useful. I’ve recently been thinking more about privacy, so I’ve been looking for alternatives. I ended up coming up with the following solution that still uses Google Analytics, but sends less user information (and loads faster).

How Analytics.js Works

The most common way to use GA is to include a script like this somewhere on the page:

(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

ga('create', 'UA-XXXXX-Y', 'auto');
ga('send', 'pageview');

This code creates a GA object and a script tag, which then loads analytics.js. When the script loads, functions are called to collect information from the browser, set some cookies, and then send a pageview. This pageview GET request is the most important part – it’s where all the data is collected. From a privacy standpoint, another relevant part is the auto argument, which tells GA to automatically set cookies to track users. One of these cookies persists for 2 years, and is meant to track users across multiple sessions.

Google does offer ways to increase privacy with custom settings on the GA object. For example, the following settings disable cookies, force SSL, and store fewer digits of the IP address:

ga('create', 'UA-XXXXX-Y', {
   'storage': 'none',
   'storeGac': false});
ga("set", "anonymizeIp", true);
ga('set', 'forceSSL', true);
ga('send', 'pageview');  

The problem with the above approach is that you depend on what Google is willing to build into their API. They still control what happens behind the scenes, and can change it in the future by modifying analytics.js. In addition, they only anonymize the last octet of the IP address (e.g. 12.214.31.144 -> 12.214.31.0), which isn’t very anonymous if you’re on a corporate or university network (and still allows tracking at the city level for a large ISP).

The Measurement Protocol

An alternative to loading analytics.js is to manually collect whatever data you need, then send it to them using the measurement protocol. This way you don’t have to load analytics.js, and you have more fine tuned control over what data is sent. Here’s an example script that sends a GET request (an example POST request is in the appendix):

function sendData() {
    var ls = [];
    var tid = 'UA-XXXXX-Y';  //Your Analytics ID
    var cid = Math.floor(100+Math.random()*900);
    var fields = ['v', 'tid', 'cid', 't', 'aip', 'uip', 'dl', 'dt']; 
    var values = [1, tid, cid, 'pageview', 1, '0.0.0.0', window.location.href, document.title]; 
    for (var i = 0; i<fields.length; i++) {
        ls.push(String(fields[i]) + '=' + encodeURIComponent(String(values[i])));
    }
    var url = "https://www.google-analytics.com/r/collect?" + ls.join('&');
    var request = new XMLHttpRequest();
    request.open("GET", url, true);
    request.send();
}
sendData();

There are more details about how this works in the reference, but the main point is I only collect the page URL and page title for each visit. The uip field is an override that sets the IP to 0.0.0.0, and the aip field tells Google to anonymize this IP address.

One caveat: It took me awhile to realize this, but much of the data that GA collects is actually gathered when the user’s browser performs the GET request, regardless of the data that’s attached to the request. This is how GA gets the user’s IP address and User Agent information, so in the end this approach comes down to trusting that Google actually overrides and anonymizes the IP address in their database. So I’m not completely sold on this approach yet, but it seems like an improvement.

Other Options

  • Other third party analytics providers - The problem is, they can still sell your data.
  • Google Webmasters Console - This shows you how often your site appears in searches, how often people clicked through to your site, and inbound links. The downside is it only shows traffic from Google searches.
  • Self Hosted - Options like Matomo/Piwik are nice, or you could just set up your own server and perform get requests to it whenever a site page loads. The problem is, these options are more complicated than the static site itself.
  • Amazon Bucket - Static sites hosted in S3 buckets can be configured to store access logs. You could pull these down using the command line and analyze them locally. Unfortunately, static sites on S3 are a pain to set up.

Appendix - Post Request

Note this uses a different URL (/collect), and sends the joined fields as the body of the POST.

function sendData() {
    var ls = [];
    var tid = 'UA-XXXXX-Y';  //Your Analytics ID
    var cid = Math.floor(100+Math.random()*900);
    var fields = ['v', 'tid', 'cid', 't', 'aip', 'uip', 'dl', 'dt']; 
    var values = [1, tid, cid, 'pageview', 1, '0.0.0.0', window.location.href, document.title]; 
    for (var i = 0; i<fields.length; i++) {
        ls.push(String(fields[i]) + '=' + encodeURIComponent(String(values[i])));
    }
    var data = ls.join('&');
    var request = new XMLHttpRequest();
    request.open("POST", "https://www.google-analytics.com/collect", true);
    request.setRequestHeader('Content-Type', 'application/x-www-form-urlencoded');
    request.send(data);
}
sendData();