Quality of Tracking Data

That’s a huge difference! Would be interesting to compare and find out why.

Here’s the relevant excerpt on how we handle aggregation:

Before tracking data is presented in the analytics area, it is cleaned up. Cleanup involves the following steps:

  • Based on the UA analysis bots are filtered out.
  • Duplicate requests are filtered out. A request is considered a duplicate if it contains
    • the same File ID
    • and the same Request ID
    • and was made within the same hour
  • Pre-Release downloads are filtered out. They may happen if you test downloads before publishing the episode.

(Source: http://docs.podlove.org/guides/download-analytics/)

… which immediately raises the question how much of a difference it would make if I changed “within the same hour” to “within 24 hours”. I will investigate.

Unlikely that they are all bots. I use the same UA (user agent) parser as Piwik and they already have a pretty decent bot detection. They only had a completely blind eye for podcast clients. That’s why I forked their library and started adding my own detection rules for popular clients. This is far from complete, because it’s a slow and tedious process. We have a crowd-sourced solution in mind but lack the manpower to execute at the moment.

When building the system, I only had data from the Metaebene to work with. I actually worked until the “Unknown” podcast client disappeared from the top 10 most of the time. That’s why your report of such a high number surprises me. If you want to help, throw this against your database:

SELECT
	COUNT(ua.id) cnt, ua.*
FROM
	wp_podlove_downloadintentclean di
	JOIN `wp_podlove_useragent` ua ON ua.id = di.`user_agent_id`
WHERE client_name IS NULL
GROUP BY ua.id
ORDER BY cnt DESC

It looks for unknown clients and orders them by popularity.

1 Like