Tuesday, August 10, 2010

Can we count users without uniquely identifying them?

Aaaah
Hi all. I'm just back from a rather nice holiday. Well, technically, I'm still on holiday, but there were a few things I wanted to take care of, so I popped in for a few hours of work yesterday and today. I saw that there was this post on Phoronix that triggered me writing a post that I've been meaning to do for the last few weeks, since the Canonical Platform Team got together in Prague three weeks ago, to be exact.

Pre-installed desktops ftw
One of the roles of Canonical relative to Ubuntu is to get Ubuntu pre-installed on as many computers as possible. This is one of the dreams of the Linux desktop. Pre-installs mean end users don't have to fiddle with configurations, installing drivers, etc... (at least when done well) and the users can make an apples to apples comparison between their free desktop and proprietary systems that normally come pre-installed.

Canonical does this by working with OEM customers. OEMs are companies that sell assembled computers to people. One of these customers asked Canonical if there was some way that they could know how many computers that they send out with Ubuntu on them keep Ubuntu on them. The customer's engineer came up with a system where they would create a unique identifier for each Ubuntu computer they sold, and then when the computers requested update info daily, it would send that unique identifier with it.

The customer didn't really want to use a unique identifier though, because though it was anonymous, the customer wanted to *count* computers, but unique identifiers are for *tracking* (following a user over time). We mulled it over and over, and finally, based on our experience with web browsers we hit upon a system of non-unique channel identifiers to do the counting. This would make tracking impossible, but of course, tracking is not the goal, counting is.

Non-unique channel identifiers
So, we flashed on this: if each install sent just the model name and the number of times it has updated, systems could be counted, but no unique data would ever be sent to the server. Now, I am not a mathematician, so each time I try to explain why I think this works, it takes me a while. But in the end, everyone is convinced. In fact, Matt Zimmerman ended up writing a test program to prove to himself that it worked. Let me try, stick with me here ...

Every day each computer from the customer sends it's model name and the number of times it has already sent this data to the server. So if a model of a computer is called, say "foo", the first day it sends "foo" and 0 to census.canonical.com. After sending the 0, the computer remembers that it already sent a 0, so it will send a 1 next time. When the server sees the foo.0 in the log data, it essential stars a new counter for the model foo. The total number of foo.0 are the total number of the model foo ever activated.

Take one of those foo computers. The next day it will send foo.1, saying "this is a computer of model foo, and this is the 2nd time it has pinged that it's alive". Notice that neither foo or the number 1 are unique data. Any number of computers will be reporting the exact same model name and increment number. When the server sees a 1 come in, it finds the first counter at 0 and increments that counter to 1. Now it knows the total number of computers ever activated (all the counters), and it can count all the counters that were incremented in a day and thereby know how many computers were online that day.

Future?
Currently this system is only slated to be used by the specific OEM customer who requested it, and it will be up to the customer to disclose the data they collect as they wish. I wonder if it would be a good thing to install on normal ISOs though, but this would be part of our normal participatory community decision making process. Projects like this make think that users would like to be counted, so long as they can't be tracked. We'll see how it plays out, it may be something to discuss at UDS if the community feels the data would be useful.

22 comments:

  1. Hmmm... so if a computer fails to send its number one day, it's lost forever:

    Day 1:
    - computer #1 sends 0
    - computer #2 sends 0
    - computer #3 sends 0
    Server has 3*0 -> 3 computers are active.

    Day 2:
    - computer #1 sends 1
    - computer #2 sends 1
    - computer #3 sends 1
    Server has 3*1 -> 3 computers are active.

    Day 3:
    - computer #1 sends 2
    - computer #2 is off-line
    - computer #3 sends 2
    Server has 1*1 and 2*2 -> 2 computers are active.

    Day 4:
    - computer #1 sends 3
    - computer #2 sends 2
    - computer #3 sends 3
    Server has 1*2 and 2*3 -> 2 computers are active.

    Day 5:
    - computer #1 is off-line
    - computer #2 sends 3
    - computer #3 sends 4
    Server has 2*3 and 1*4 -> 1 computer is active.

    etc...

    What am I missing here?

    ReplyDelete
  2. Privacy is hard, guys.
    One cannot be sure that you don't collect IP adresses. So say I am the only one with foo.434 and so far I had always the same IP. Now I visit one of my secret lovers (I have many) and log on from her place. Now you know that I have probably moved my laptop.

    Bottom line:
    This is unacceptable and has to be opt-in.

    ReplyDelete
  3. Anyway, why don't you just count the number of connections of type "foo" for each day (without the counter)?

    ReplyDelete
  4. even if this was an opt-out, the numbers will not be accurate if a significant number of users deactivate them.
    What about others who will for some reason (i.e. to discourage such counting be done by OEMs in the future) will send such beacons from random computers (not from that vendor) to skew the results?

    ReplyDelete
  5. @Tom
    But you'll never be the only one with foo.434. There are literally millions of Ubuntu users; it's a fair bet that there's at least a half-dozen other foo.434's, unless you have some kind of crazy rare laptop.
    And shouldn't you be more worried about, say, your email provider? They can uniquely identify you (Since chances are you're the only one that uses your email account), and it's a known fact that most of them keep logs of IP addresses. Same goes for almost any website that you sign into. An Ubuntu counter system is the least of your worries.

    ReplyDelete
  6. Dieki:
    There are literally millions of users? How do you know that? Up till this discussion about how to count users... there's been no actual _counting_ of users. The reality is noone knows how many Ubuntu users are out there. Your _millions_ is completely pulled out of thin air and is a faith-based estimate.

    Beyond that, you are not taking into account the time distribution of how ubuntu systems are installed. If install in the post-release rush..sure you are probably somewhat anonymous in the numbers for a period of time. But if you install 1 month.. 2 months.. 3 months... out from release.. Can you be sure that your low counter is not unique? And these late system activations are exactly how OEM installs would trickle in.

    -jef

    ReplyDelete
  7. Jef:

    http://ostatic.com/blog/canonical-announces-12-million-ubuntu-users-google-makes-a-comeback

    Please tell me how installing later from the release cycle some how magically makes the counter in any way more specific to you. People don't update all at the same time, nor do OEM's that ship an OS all ship the OS at the same time. Dell is only more recently moving to 10.04 even, meaning they've been selling 9.10 systems for quite some time.

    ReplyDelete
  8. ModplanMan:

    Ah thanks for the link. I've been pushing people for _any_ description on how the counting is happening for something like 2 years. that article is the first public statement I've seen that attempt to publicly describe how its done. Much appreciated. Now to see if they will publish the algorithm used to boild down the number.. and to get specifics about the time window over which unique ip addresses are compiled.

    -jef

    ReplyDelete
  9. @Julien Read closely. They are just counting the connections of type "foo" per day. The numbers appear to be there so that they can figure out on how many days the computers are used.

    ReplyDelete
  10. I think the best solution would be to send MAC addresses to a census server (maybe even a hash of [MAC + BIOS info] - because some people would surely spoof their MACs). And this should be built into the standard ISO. Why should a user be bothered if his computer would send this data with the sole purpose of knowing how many Ubuntu users are out there? I don't think this invades the user's privacy at all...

    ReplyDelete
  11. @Radu, I would be bothered if that same data could serve other purposes like tracking my location without my explicit approval.

    ReplyDelete
  12. You don't need to be a mathmo to explain how it works, instead just reframe it.

    Each computer sends a unique ID to the server, then you count the unique IDs.

    foo, bar, baz, etc.

    But you don't want them to persist, so each time you send a unique ID, you generate a new one and throw away the old. You need to link them together, so the server gets sent both the old and new IDs.

    Now each request the server can pull up the current unique ID, and replace the record with the new one, and so on.

    It doesn't actually matter if the IDs are unique, just as long as the server replaces one with the other and leaves the second record alone.

    And since it doesn't matter, it doesn't matter if you use a complex algorithm or a counter. In fact, a counter is better, since then the server can infer the next ID itself and you only need to send your current counter to the server and increment.

    Obviously you don't increment until you're sure the server got the count, otherwise you'd leave gaps and create odd artifacts in the data.

    ReplyDelete
  13. See "Tom"'s comment above for my objections.

    I love Ubuntu, but if I should read sometime in the future that Canonical has supplied a system like the one described in this blog (except when it is opt-in), I will make it a point of honor to never use it again.

    Come on, surely you have marketing people among your staff? They should already be crying uncle about all the bad publicity this will get you, regardless of whether it works as you describe or not.

    ReplyDelete
  14. What I think would be cool is if all OEM installs had this, and then maybe Canonical could release stats like:
    33% of OEM installs come from Dell
    33% of OEM installs come from System76
    34% of OEM installs come from ZaReason

    Or ya know...whatever it actually is. Then we might see them start vying to be the OEM selling the most Ubuntu machines. While I suspect the latter two probably are already trying that, it'd be incentive for Dell to try to have their Linux sales outstrip the smaller OEMs' Linux sales, and so maybe then they'd start actually advertising the existence of their Linux machines.

    ReplyDelete
  15. Mackenzie,
    Why do you believe that Canonical is in a contractual business relationship with all three of those OEMs at the moment? And why do you believe that Dell's sales aren't outstripping the niche OEMs already?

    -jef

    ReplyDelete
  16. @Jef:
    I didn't say they were. I said "would be cool if" -- as in, I'd like a way to see the stats between them. And I was just giving them each 1/3 ;-) But given that Dell hides Ubuntu at a URL you need to have memorized already (not listing it as an option with their usual stuff), I doubt Linux is selling too well there.

    ReplyDelete
  17. Mackenzie:
    I'm sure Dell sells more than its fair share of _linux_ systems when you look at its full line of products including servers and mobile devices. And I would imagine the _linux_ based Streak will sell its fair share as well even though its not currently on the homepage. I wonder how many of the _linux_ based unlocked Nokia N900's Dell has sold to date. Wouldn't it be fascinating if they have sold more N900's than System76 has sold netbooks.

    -jef

    ReplyDelete
  18. Bounce the counts through an anonymizing service, and tracking the source IP of the requests becomes a non-issue. I'm sure Anonymizer would be happy to take Canonical's money to do this (disclaimer: I am a former employee of Anonymizer). Or the tor network would be fine, though I'd guess they'd like a few more nodes added for this.

    --- Mad scientist idea disclaimer ---

    There's still a signature, and its important. If there is only one "Bob's Internet Terminal", then it should *not* send its count every day, as this is highly trackable. However, it can send a "i386 machine" count every day. If you feed back a score to the program that indicates how large the crowd it has just claimed to be a part of is, it can add more info. It would go something like:

    client->: hi I am a machine, my counter is 0
    server->: thank you. You are in a MASSIVE group

    client waits 24 hours

    client->: hi I am a Dell+OEM installed, my counter is 1
    server->: thank you. You are in a LARGE group

    client waits 24 hours

    client->: hi I am a Dell mini10n OEM installed, my counter is 2
    server->: thank you. You are in a MEDIUM group

    client will then send Dell mini10n as long as it gets back MEDIUM

    scenario 2:

    .0 is repeated as above

    client->: Hi I am a Generic Ubuntu Box, my counter is 1
    sever->: Thank you, you are in a MASSIVE group

    24 hrs.

    client->: Hi I am a Bob's Super Crazy Unique machine, my counter is 2
    server->: Thank you, you are in a TINY group

    24 hrs.

    client->: Hi I am a Generic Ubuntu Box...

    and that would continue for a *random* length time of at least 90 days before it feeds back its model string again.

    This is highly open to abuse, so you can fight that with random challenge and response to aid in at least keeping the abusers honest. Basically, when you get the .0, you feed them back a token that the client should keep. Then in a tiny sampling every day you say "Hey can I get back the token I gave you?" The client will only feed back the token once per year, so there's no chance of the server being able to "track" the user, but the server should have a reasonable chance of getting back valid tokens, and ONLY getting back the tokens it fed to people once per year.

    The token might embed the date it was given, so any abusers will be limited to messing with numbers closer to 0, rather than closer to 365, because they'll have to *wait* all year to get those numbers screwed up. If there are abnormal rates of declined token response, then most likely these are abusers and can be removed statistically.

    If people are worried about the sample size of the tokens, I suggest that the community runs and audits this service to ensure that it is not being tampered with.

    One thing that worries me is that this token could actually be recovered by other means and then tied to the responses, but again, if you anonymize the IP, its pretty hard to do anything with that other than say that yes, this computer's OS was in fact first booted on day X.

    ReplyDelete
  19. Model numbers? Isn't that why UUIDs were created? Unique identifier, and anonymous:

    https://secure.wikimedia.org/wikipedia/en/wiki/UUID

    ReplyDelete
  20. I might be a black sheep in this discussion, but here it goes...

    It would be great if there was a way to get accurate statistics on the, let's say, number of Ubuntu installations for a specific country (geoip). This would help tremendously our LoCo efforts as we are now in the dark.

    ReplyDelete
  21. Jef:
    I don't count phones on par with laptops & desktops. Most people don't know or care what's on their phone, as long as it makes calls.

    ReplyDelete
  22. Isn't this system over-complicating counting the number of machines? Why not use a system like:

    (0 is an activation ping, 1 is a post-activation ping)
    Day 0:
    computer #1 sends 0
    computer #2 sends 0
    server count: day 0, 0 normal, 2 activation, 2 all-time

    Day 1:
    computer #1 sends 1
    computer #2 is offline
    computer #3 sends 0
    server count: day 1, 1 normal, 1 activation, 3 all-time

    Day 2:
    computer #1 sends 1
    computer #2 sends 1
    computer #3 sends 1
    computer #4 sends 0
    server count: day 2, 3 normal, 1 activation, 4 all-time

    Day 3:
    computer #1 is offline
    computer #2 is offline
    computer #3 sends 1
    computer #4 sends 1
    computer #5 sends 0
    computer #6 sends 0
    server count: day 3, 2 normal, 2 activation, 6 all-time

    The counting method is simple and obvious (it can even include model info if it's really necessary). To get the number of all-time activations, the server keeps track of the 0's, or activation pings. For each day, the server can track the number of activations (0's), and also the number of normal users active on that day.

    This method is not only simpler than the suggested method, it also solves the privacy problems (barring IP address logging, of course). Because the systems do not send a "day #5301" ping, they are not uniquely identifiable. After its "day 0" ping, a system only sends an "I'm alive" 1 ping.

    I'd love to hear feedback on this idea.

    ReplyDelete