Quality of Tracking Data

katrinleinweber · February 14, 2015, 7:17am

Regarding the Download Analytics: Many of us podcasters already had some kind of tracking. Let’s assume Podlove is the new “reference implementation” and let’s collect our experience with how well (or not) our old systems match the Podlove’s numbers. Preferably in a dedicated topic; Admin, please

For example, Tim apparently observed more listeners. What did you observe when comparing your old tracking system with Podlove’s?

For KonScience I can say (greetings from Mariëlle at this point, as she is the more advanced data guru of the both of us):

Podlove reports almost twice as many downloads as we measured with our old system
There are about 40% unknown podcast clients & operating systems. If these were all bots of some kind, we’d have a similar number of “real” downloaders as we thought.
However, these unknown downloads have very similar distribution of sources, contexts and assets. This makes it less likely that they are bots. Could be Turing test winners, though!

ericteubert · February 15, 2015, 3:16am

That’s a huge difference! Would be interesting to compare and find out why.

Here’s the relevant excerpt on how we handle aggregation:

Before tracking data is presented in the analytics area, it is cleaned up. Cleanup involves the following steps:

Based on the UA analysis bots are filtered out.

Duplicate requests are filtered out. A request is considered a duplicate if it contains

the same File ID

and the same Request ID

and was made within the same hour

Pre-Release downloads are filtered out. They may happen if you test downloads before publishing the episode.

(Source: http://docs.podlove.org/guides/download-analytics/)

… which immediately raises the question how much of a difference it would make if I changed “within the same hour” to “within 24 hours”. I will investigate.

Unlikely that they are all bots. I use the same UA (user agent) parser as Piwik and they already have a pretty decent bot detection. They only had a completely blind eye for podcast clients. That’s why I forked their library and started adding my own detection rules for popular clients. This is far from complete, because it’s a slow and tedious process. We have a crowd-sourced solution in mind but lack the manpower to execute at the moment.

When building the system, I only had data from the Metaebene to work with. I actually worked until the “Unknown” podcast client disappeared from the top 10 most of the time. That’s why your report of such a high number surprises me. If you want to help, throw this against your database:

SELECT
	COUNT(ua.id) cnt, ua.*
FROM
	wp_podlove_downloadintentclean di
	JOIN `wp_podlove_useragent` ua ON ua.id = di.`user_agent_id`
WHERE client_name IS NULL
GROUP BY ua.id
ORDER BY cnt DESC

It looks for unknown clients and orders them by popularity.

katrinleinweber · February 15, 2015, 5:05am

OK, done! Easier than I thought The top three are (cnt & user_agent):

7335 WWW-Mechanize/1.71
2821 Mozilla/5.0 (compatible; MJ12bot/v1.4.5; MJ12Bot | Home | from Majestic?
1419 Mozilla/5.0 (compatible; bingbot/2.0; Bing Webmaster Tools

Which export format do you need/want for the full dataset?

ericteubert · February 15, 2015, 6:41am

CSV is fine

bingbot and majesticbot should already be identified as bots.
Mechanize, technically, is a library, but should probably also be identified as a bot since it’s used for crawling websites.

katrinleinweber · February 15, 2015, 7:51am

OK, sent to the address you published on GitHub.
PS: Just a collection of related threads: GitHub #659, others to follow…

katrinleinweber · February 16, 2015, 1:58pm

Regarding the bots: Maybe a first step towards a user_agent database could be to crowdsource a robots.txt? Could be done in an EtherPad. Or could excluding (non-podcast related) bots from accessing the media files be a problem? Why filter them out of the data later, when they can be blocked from the start?

ericteubert · February 16, 2015, 2:48pm

A robots.txt doesn’t block bots, it merely asks them nicely to stay away. It’s their choice to follow the robots.txt rules or not.

katrinleinweber · February 16, 2015, 6:22pm

True. Are they mostly being ignored, though? When I give it a try, I’ll report the findings here.

timpritlove · February 17, 2015, 7:47am

It would be nice if Mariëlle could

a) open an account here
b) use Podlove Tracking Parameters in her stats tool to make proper comparisons with our system

katrinleinweber · February 21, 2015, 6:55am

No success with disallowing /wp_content/uploads/ :-/
I queried again with
AND accessed_at >= (curdate() - INTERVAL DAYOFWEEK(curdate())+N DAY)
inserted into the WHERE statement from above. Result: WWW-Mechanize and MJ12bot appear again up top, along with Googlebot. Some less prominent ones are gone, though. Bingbot for example. Kudos to their engineers for teaching them respect

I’ll try disallowing /podlove/file/ additionally during the next week.

katrinleinweber · October 31, 2015, 11:10am

I didn’t look further into the robots.txt approach, but wanted to report that after 1 month of running v2.3.0, AntennaPod\ has climbed to the top of the “unknown clients” list