Notes on the cloudflare web analytics api
So the particular technical challenge on hand is that cloudflare is stingy about what analytics data they will give me over the API and I want all the analytics data to stop using cloudflare and start using umami again for this. Cloudflare likes aggerate statistics. Umami does not.
I mean, actually, aggerate statistics aren’t a bad idea, just they make it difficult for this one use case.
The cloudflare dashboard makes queries like
I will save you time
query GetRumAnalyticsTopNs {
viewer {
accounts(filter: { accountTag: $accountTag }) {
total: rumPageloadEventsAdaptiveGroups(filter: $filter, limit: 1) {
count
sum {
visits
}
}
topReferers: rumPageloadEventsAdaptiveGroups(
filter: $filter
limit: 15
orderBy: [$order]
) {
count
avg {
sampleInterval
}
sum {
visits
}
dimensions {
metric: refererHost
}
}
topPaths: rumPageloadEventsAdaptiveGroups(
filter: $filter
limit: 15
orderBy: [$order]
) {
count
avg {
sampleInterval
}
sum {
visits
}
dimensions {
metric: requestPath
}
}
topHosts: rumPageloadEventsAdaptiveGroups(
filter: $filter
limit: 15
orderBy: [$order]
) {
count
avg {
sampleInterval
}
sum {
visits
}
dimensions {
metric: requestHost
}
}
topBrowsers: rumPageloadEventsAdaptiveGroups(
filter: $filter
limit: 15
orderBy: [$order]
) {
count
avg {
sampleInterval
}
sum {
visits
}
dimensions {
metric: userAgentBrowser
}
}
topOSs: rumPageloadEventsAdaptiveGroups(
filter: $filter
limit: 15
orderBy: [$order]
) {
count
avg {
sampleInterval
}
sum {
visits
}
dimensions {
metric: userAgentOS
}
}
topDeviceTypes: rumPageloadEventsAdaptiveGroups(
filter: $filter
limit: 15
orderBy: [$order]
) {
count
avg {
sampleInterval
}
sum {
visits
}
dimensions {
metric: deviceType
}
}
countries: rumPageloadEventsAdaptiveGroups(
filter: $filter
limit: 200
orderBy: [$order]
) {
count
avg {
sampleInterval
}
sum {
visits
}
dimensions {
metric: countryName
}
}
}
}
}
and
{
"accountTag": "<x>",
"filter": {
"AND": [
{
"datetime_geq": "2023-12-12T04:21:53Z",
"datetime_leq": "2024-01-11T04:21:53Z"
},
{
"OR": [
{
"siteTag": "<x>"
}
]
}
]
},
"order": "count_DESC"
}
This is how this API is supposed to be used. Lets abuse it.
I tried something like
{
viewer {
accounts(filter: { accountTag: $tag }) {
rumPageloadEventsAdaptiveGroups(
filter: { datetime_gt: $start }
limit: 1000
orderBy: [datetimeMinute_ASC]
) {
dimensions {
countryName
date
datetimeMinute
deviceType
refererHost
refererPath
requestHost
requestPath
userAgentBrowser
userAgentOS
}
avg {
sampleInterval
}
sum {
visits
}
count
}
}
}
}
and
{
"tag": "<x>",
"start": "2024-01-01T02:07:05Z"
}
I thought my query would work because there is just no way all these people will fall into minute groups. But I was only getting 60 back. Hmm. What’s important to understand is all this is is just guesses. One in 10 requests or so are saved, and in the grand scheme of things it all works out. So we need to run with that. Count and sum.visits are the same for me.
Umami Internals
There are two tables of interest here
website_event
event_id: UUID()
website_id: const UUID
session_id: ref session.session_id
created_at: date
url_path: string
url_query: string
referrer_path: string
referrer_query: string
referrer_domain: string
page_title: string
event_type: 1
event_name: ""
session:
session_id: UUID()
website_id: const UUID
hostname: string
browser: string
os: string
device: string
screen: string
language: string
country: string
subdivision1: ""
subdivision2: ""
city: ""
created_at: date
Most of this data maps over pretty well. The question really is, how do we map sessions to sessions and pages to pages (keeping in mind, of course, that all we are getting out of this is aggregate statistics)
lets take a closer look at what we are getting back from the cloudflare API
"avg": {
// Average sample interval
"sampleInterval": 10
},
// The number of pages viewed by end-users
"count": 10,
"sum": {
// The number of pages viewed by end-users that were initiated from a different website (i.e. where the Document.referrer does not match the hostname)
"visits": 10
}
So each visit corresponds to a session and each count or avg corrosponds to the amount of pages those 10 users visited (for all we know they reloaded the page an average of 2 times in this cluster):
v.data.viewer.accounts[0].rumPageloadEventsAdaptiveGroups.reduce(
(a, b) => a + b.avg.sampleInterval,
0,
); // 660
v.data.viewer.accounts[0].rumPageloadEventsAdaptiveGroups.reduce(
(a, b) => a + b.count,
0,
); // 660
v.data.viewer.accounts[0].rumPageloadEventsAdaptiveGroups.reduce(
(a, b) => a + b.sum.visits,
0,
); // 460
v.data.viewer.accounts[0].rumPageloadEventsAdaptiveGroups.filter(
(a) => a.count != a.avg.sampleInterval,
); // []
All of these are multiples of 10. What’s interesting is that visits
(what denotes a session) can be 0
. This would indicate they clicked around on my site!
Let’s verify if that is indeed the case:
v.data.viewer.accounts[0].rumPageloadEventsAdaptiveGroups
.filter((a) => a.sum.visits == 0)
.map((a) =>
v.data.viewer.accounts[0].rumPageloadEventsAdaptiveGroups.find(
(b) =>
b.dimensions.userAgentBrowser ==
a.dimensions.userAgentBrowser &&
b.dimensions.deviceType == a.dimensions.deviceType &&
a.dimensions.countryName == b.dimensions.countryName &&
b.sum.visits > 0,
),
); // [{…}, {…}, {…}, undefined, undefined, undefined, {…}, undefined, undefined, undefined, undefined, {…}, undefined, {…}]
This is to be expected, of course. You need to be lucky to get samples consistent here, what’s to say that the first page the person visited counts as a sample? Well the ones we did match together we can happily associate.
Also just keep in mind that when we dry run at the very least the amount of pages should be the same.
What to do about the rest? I suppose we just need to make extra sessions.
Let’s make a game plan:
- Step through the list. For each
visits
, make a session. To determine the amount of time between thecreated_at
s of each session, space each one out by((time_of_next_group - time_of_this_group) / number_of_visits)
, starting attime_of_this_group
- Step through the list again. If there are
10
counts and10
visits, make one event for each session. if there are20
counts and10
visits, make 2 events for each session. If there are no visits, continue to step 3 - If there are no sessions for this group, try to find another group with matching sessions. If you do, associate these events with that group. If you don’t, just make
count
worth of sessions. Not worth the drama. - Insert
a little hacking later and https://github.com/boehs/site/blob/master/utils/cloudflare-analytics-to-umami-lord-save-me.ts is born