Big data’s multiple personality disorder

By R E J Saunders

Who are you? Be honest. Are you the person in the mirror, the person you post on social media, or a collection of personalities you choose to present to yourself and the world? You are never really a singular you, more a collection of personas that people see and online databases collect as you live your life. You may change from situation to situation, one person in private and another in public. You are as much the mask you wear on Instagram as you are in the offline, at least to datasets and algorithms. This, then, begs the question of how can data ever be effective and trustworthy enough to build our lives in data?

You, the root part of yourself that machine learning and algorithms digests, is always going to be an incomplete imprint of your actual self. Data by its very nature is out-of-date the moment it is captured, and while it may have a partial relevance to who you are as a person, ultimately it is still a prism through which machine learning interprets you.

Every intersection online, be it through an app like Facebook or a web browser like Google leaves digital fingerprints. As any archaeologist or CSI fan will state you can only glean so much from evidence. The picture you build depends on how much you can contextualise a person from that evidence. This inherent fuzziness is the puzzle data scientists and AI programmers are desperate to overcome, yet the sum of all their computations will forever butt up against this issue of contextuality and you being more than the imprint you leave.

Because we all leave multiple versions of ourselves scattered across a plethora of databases, some such as medical, insurance, and governmental we have no direct access to or control over, any algorithm that attempts to construct us has to do so by walking through a hall of mirrors that distorts and twists who we are. These multiple personalities are not you, merely a version of you that gets reconstructed new data is fed into the algorithms.

The impact of this is profound. Cathy O’Neil’s Weapons of Math Destruction probed this question, highlighting the intrinsic nature of algorithmic injustice precisely because no database can essentially know you. James Baldwin talks about race being a white person’s problem, that white people deliberately constructed racial studies to prove their own superiority; this carries over into the database age, where the construction of identity is not a you problem, but a problem caused by the very people designing and using databases. The very fact that they want to bottle you into neat packages, resell you to the highest bidder, then rinse and repeat sums up the nature of the disorder.

Aside from burning down the whole edifice of the digital networked economy, there are two obvious solutions to the you dilemma. The first is to allow all databases to talk to each other and harvest unfettered data about us to create a better synthetic approximation of who you are. This is the route many companies are taking, harvesting vast quantities of our freely given data to better serve their needs.

The second option is digital literacy, educating ourselves, our peers and our children on the realities of lives built and lived in data. The price of free access is the very essence of who we are, and for many of us that price is getting progressively steeper. There is no such thing as unfettered access, everything comes at a cost, and by educating ourselves on the true nature of the cost of “free” access to information and services we can stop the worst excesses before they happen. Education takes time, effort, and a willingness to engage, and as databases harvest evermore parts of ourselves, their version of You becomes more distorted.

There are other options available, such as regulations, laws, trust busting, boycotting, data access requests, opting out of harvesting when given the chance, blocking cookies, blocking adverts, paying for news content from reputable sources, refusing to engage on social media, using a web browser and search engine that does not harvest your data. All those options require active engagement with the problem, require a degree of sacrifice that many cannot make due to personal circumstances. It is a privilege to be able to afford to circumvent the “free” internet, especially if you cannot afford or do not have access to the means to non-ad driven algorithmic platforms. Boycotts, pressure groups, unions, political activism, and grass roots education hold the keys to effectively combatting the atomization of you into an ever more dystopian version who we are.

As a society we have to acknowledge that data bases are always going to be flawed; not just because they are created by humans based on humans, but also because they are always going to be fragments of a greater whole. No amount of manipulation or algorithmic alchemy will transmute a dataset, no matter how large, into a gold-plated version of you. Ever since the Sumerians wrote on clay tablets humanity has been trying to tabulate and recreate lives in data, and each iteration of technology has edged closer to a version of the truth. But it is still a hall of mirrors, subjective and biased to the needs of the programmer, not those whose lives are constructed within the data.

In educating ourselves, in reframing data’s problem as a multitude of you and acknowledging the inherent issue of databases, we as a society can better protect all citizens and harness technologies that benefit everyone, not just those at the centres of power and money. A digital utopia for some is fast becoming a dystopia for many, and without remedy the distorted you will become the only you that matters.

Writer, researcher, and generally curious

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store