Quality of Tracking Data


#1

Regarding the Download Analytics: Many of us podcasters already had some kind of tracking. Let’s assume Podlove is the new “reference implementation” and let’s collect our experience with how well (or not) our old systems match the Podlove’s numbers. Preferably in a dedicated topic; Admin, please :wink:

For example, Tim apparently observed more listeners. What did you observe when comparing your old tracking system with Podlove’s?

For KonScience I can say (greetings from Mariëlle at this point, as she is the more advanced data guru of the both of us):

  • Podlove reports almost twice as many downloads as we measured with our old system :smile:
  • There are about 40% unknown podcast clients & operating systems. If these were all bots of some kind, we’d have a similar number of “real” downloaders as we thought.
  • However, these unknown downloads have very similar distribution of sources, contexts and assets. This makes it less likely that they are bots. Could be Turing test winners, though!

Release 2.0.0
#2

That’s a huge difference! Would be interesting to compare and find out why.

Here’s the relevant excerpt on how we handle aggregation:

Before tracking data is presented in the analytics area, it is cleaned up. Cleanup involves the following steps:

  • Based on the UA analysis bots are filtered out.
  • Duplicate requests are filtered out. A request is considered a duplicate if it contains
    • the same File ID
    • and the same Request ID
    • and was made within the same hour
  • Pre-Release downloads are filtered out. They may happen if you test downloads before publishing the episode.

(Source: http://docs.podlove.org/guides/download-analytics/)

… which immediately raises the question how much of a difference it would make if I changed “within the same hour” to “within 24 hours”. I will investigate.

Unlikely that they are all bots. I use the same UA (user agent) parser as Piwik and they already have a pretty decent bot detection. They only had a completely blind eye for podcast clients. That’s why I forked their library and started adding my own detection rules for popular clients. This is far from complete, because it’s a slow and tedious process. We have a crowd-sourced solution in mind but lack the manpower to execute at the moment.

When building the system, I only had data from the Metaebene to work with. I actually worked until the “Unknown” podcast client disappeared from the top 10 most of the time. That’s why your report of such a high number surprises me. If you want to help, throw this against your database:

SELECT
	COUNT(ua.id) cnt, ua.*
FROM
	wp_podlove_downloadintentclean di
	JOIN `wp_podlove_useragent` ua ON ua.id = di.`user_agent_id`
WHERE client_name IS NULL
GROUP BY ua.id
ORDER BY cnt DESC

It looks for unknown clients and orders them by popularity.


#3

OK, done! Easier than I thought :slight_smile: The top three are (cnt & user_agent):

Which export format do you need/want for the full dataset?


#4

CSV is fine :smile:

bingbot and majesticbot should already be identified as bots.
Mechanize, technically, is a library, but should probably also be identified as a bot since it’s used for crawling websites.


#5

OK, sent to the address you published on GitHub.
PS: Just a collection of related threads: GitHub #659, others to follow…


#6

Regarding the bots: Maybe a first step towards a user_agent database could be to crowdsource a robots.txt? Could be done in an EtherPad. Or could excluding (non-podcast related) bots from accessing the media files be a problem? Why filter them out of the data later, when they can be blocked from the start?


#7

A robots.txt doesn’t block bots, it merely asks them nicely to stay away. It’s their choice to follow the robots.txt rules or not.


#8

True. Are they mostly being ignored, though? When I give it a try, I’ll report the findings here.


#9

It would be nice if Mariëlle could

a) open an account here
b) use Podlove Tracking Parameters in her stats tool to make proper comparisons with our system


#10

No success with disallowing /wp_content/uploads/ :-/
I queried again with
AND accessed_at >= (curdate() - INTERVAL DAYOFWEEK(curdate())+N DAY)
inserted into the WHERE statement from above. Result: WWW-Mechanize and MJ12bot appear again up top, along with Googlebot. Some less prominent ones are gone, though. Bingbot for example. Kudos to their engineers for teaching them respect :+1:

I’ll try disallowing /podlove/file/ additionally during the next week.


#11

I didn’t look further into the robots.txt approach, but wanted to report that after 1 month of running v2.3.0, AntennaPod\ has climbed to the top of the “unknown clients” list :smile: