When I started this blog, I automatically included Google Analytics because it was convenient and useful. I’ve recently been thinking more about privacy, so I’ve been looking for alternatives. I ended up coming up with the following solution that still uses Google Analytics, but sends less user information (and loads faster).
How Analytics.js Works
The most common way to use GA is to include a script like this somewhere on the page:
This code creates a GA object and a script tag, which then loads analytics.js. When the script loads, functions are called to collect information from the browser, set some cookies, and then send a pageview. This pageview
GET request is the most important part – it’s where all the data is collected. From a privacy standpoint, another relevant part is the
auto argument, which tells GA to automatically set cookies to track users. One of these cookies persists for 2 years, and is meant to track users across multiple sessions.
The problem with the above approach is that you depend on what Google is willing to build into their API. They still control what happens behind the scenes, and can change it in the future by modifying
analytics.js. In addition, they only anonymize the last octet of the IP address (e.g.
220.127.116.11), which isn’t very anonymous if you’re on a corporate or university network (and still allows tracking at the city level for a large ISP).
The Measurement Protocol
An alternative to loading analytics.js is to manually collect whatever data you need, then send it to them using the measurement protocol. This way you don’t have to load analytics.js, and you have more fine tuned control over what data is sent. Here’s an example script that sends a
GET request (an example
POST request is in the appendix):
There are more details about how this works in the reference, but the main point is I only collect the
page URL and
page title for each visit. The
uip field is an override that sets the IP to
0.0.0.0, and the
aip field tells Google to anonymize this IP address.
One caveat: It took me awhile to realize this, but much of the data that GA collects is actually gathered when the user’s browser performs the
GET request, regardless of the data that’s attached to the request. This is how GA gets the user’s IP address and User Agent information, so in the end this approach comes down to trusting that Google actually overrides and anonymizes the IP address in their database. So I’m not completely sold on this approach yet, but it seems like an improvement.
- Other third party analytics providers - The problem is, they can still sell your data.
- Google Webmasters Console - This shows you how often your site appears in searches, how often people clicked through to your site, and inbound links. The downside is it only shows traffic from Google searches.
- Self Hosted - Options like Matomo/Piwik are nice, or you could just set up your own server and perform get requests to it whenever a site page loads. The problem is, these options are more complicated than the static site itself.
- Amazon Bucket - Static sites hosted in S3 buckets can be configured to store access logs. You could pull these down using the command line and analyze them locally. Unfortunately, static sites on S3 are a pain to set up.
Appendix - Post Request
Note this uses a different URL (
/collect), and sends the joined fields as the body of the