Quality of Tracking Data

ericteubert · February 15, 2015, 3:16am

That’s a huge difference! Would be interesting to compare and find out why.

Here’s the relevant excerpt on how we handle aggregation:

Before tracking data is presented in the analytics area, it is cleaned up. Cleanup involves the following steps:

Based on the UA analysis bots are filtered out.

Duplicate requests are filtered out. A request is considered a duplicate if it contains

the same File ID

and the same Request ID

and was made within the same hour

Pre-Release downloads are filtered out. They may happen if you test downloads before publishing the episode.

(Source: http://docs.podlove.org/guides/download-analytics/)

… which immediately raises the question how much of a difference it would make if I changed “within the same hour” to “within 24 hours”. I will investigate.

Unlikely that they are all bots. I use the same UA (user agent) parser as Piwik and they already have a pretty decent bot detection. They only had a completely blind eye for podcast clients. That’s why I forked their library and started adding my own detection rules for popular clients. This is far from complete, because it’s a slow and tedious process. We have a crowd-sourced solution in mind but lack the manpower to execute at the moment.

When building the system, I only had data from the Metaebene to work with. I actually worked until the “Unknown” podcast client disappeared from the top 10 most of the time. That’s why your report of such a high number surprises me. If you want to help, throw this against your database:

SELECT
	COUNT(ua.id) cnt, ua.*
FROM
	wp_podlove_downloadintentclean di
	JOIN `wp_podlove_useragent` ua ON ua.id = di.`user_agent_id`
WHERE client_name IS NULL
GROUP BY ua.id
ORDER BY cnt DESC

It looks for unknown clients and orders them by popularity.