pydenticon Files · docs/privacy.rst

Files @ 842107cb8260
Branch filter:
Location: pydenticon/docs/privacy.rst

842107cb8260 3.4 KiB text/prs.fallenstein.rst Show Annotation Show as Raw Download as Raw
branko
PYD-2: Added privacy chapter and release notes.
Privacy
=======

It is fundamentally important to understand the privacy issues if using
Pydenticon in order to generate uniquelly identifiable avatars for users leaving
the comments etc.

The most common way to expose the identicons is by having a web application
generate them on the fly from data that is being passed to it through HTTP GET
requests. Those GET requests would commonly include either the raw data, or data
as hex string that is then used to generate an identicon. The URLs for GET
requests would most commonly be made as part of image tags in an HTML page.

The data passed needs to be unique in order to generate distinct identicons. In
most cases the data used will be either name or e-mail address that the visitor
posting the comment fills-in in some field. That being said, e-mails usually
provide a much better identifier than name (especially if the website verifies
the comments through by sending-out e-mails).

Needless to say, in such cases, especially if the website where the comments are
being posted is public, using raw data can completely reveal the identity of the
user. If e-mails are used for generating the identicons, the situation is even
worse, since now those e-mails can be easily harvested for spam purposes. Using
the e-mails also provides data mining companies with much more reliable user
identifier that can be coupled with information from other websites.

Therefore, it is highly recommended to pass the data to web application that
generates the identicons using **hex digest only**. I.e. **never** pass the raw
data.

Although passing hash instead of real data as part of the GET request is a good
step forward, it can still cause problems since the hashses can be collected,
and then used in conjunction with rainbow tables to identify the original
data. This is particularly problematic when using hex digests of e-mail
addresses as data for generating the identicon.

There's two feasible approaches to resolve this:

* Always apply *salt* to user-identifiable data before calculating a hex
  digest. This can hugely reduce the efficiency of brute force attacks based on
  rainbow tables (althgouh it will not mitigate it completely).
* Instead of hashing the user-identifiable data itself, every time you need to
  do so, create some random data instead, hash that random data, and store it
  for future use (cache it), linking it to the original data that it was
  generated for. This way the hex digest being put as part of an image link into
  HTML pages is not derived in any way from the original data, and can therefore
  not be used to reveal what the original data was.

Keep in mind that using identicons will inevitably still allow people to track
someone's posts across your website. Identicons will effectively automatically
create pseudonyms for people posting on your website. If that may pose a
problem, it might be better not to use identicons at all.

Finally, small summary of the points explained above:

* Always use hex digests in order to retrieve an identicon from a server.
* Instead of using privately identifiable data for generating the hex digest,
  use randmoly generated data, and associate it with privately identifiable
  data. This way hex digest cannot be traced back to the original data through
  brute force or rainbow tables.
* If unwilling to generate and store random data, at least make sure to use
  salt when hashing privately identifiable data.