Computer Scientist: Cambridge Analytica Are Dirty Tricksters With An Overhyped Data Operation
It appears that their analytics product is overhyped and their main service may be dirty tricks — which deserves scrutiny
By now, you’ve probably heard all about the “manic pixie Canadian tech bro” with bizarre far-right ties and a tool capable of stealing all your and your friends’ Facebook data to predict your every move. According to all the animated interviews with Christopher Wylie, Cambridge Analytica and a cluster of other shady companies like it have a product for mining all your social media data and predicting exactly what ad will get you to vote for who they want you to vote, essentially brainwashing any tech-savvy electorate. It’s a compelling story that sounds like something out of a Black Mirror episode. It’s also scientifically and technologically implausible.
Yes, it sounds very impressive when you hear that 50 million profiles were siphoned off from Facebook and combined with other methods, created an immense database covering every voter in America. Certainly, after running all those profiles through some sophisticated algorithms, the programmers involved should have a good idea of who you are and what you’re like. But there’s a huge caveat. All the data and algorithms in question would be fairly arbitrary and depend on what Facebook provided, and the personal hunches of the system’s designers. It’s practically a certainty that CA knew this, which is why its data product was a lure to attract clients to whom they could sell their underhanded tricks. We can see that when we consider the details of how such a system would have to work and what happens when its users and creators are asked to go into detail about how it works in the real world.
The False Hype Over Cambridge Analytica’s Data
Imagine if you will that yours truly, who works on data analysis on a regular basis, evaluated your profile and decided that because you frequently use the word “safe” or the phrase “stay safe” in your posts and comments, you must have some degree of neuroticism and are concerned about safety. After analyzing tens of millions of records, it would be highly impractical to review every user’s profile and check if my conclusions were right. You may keep posting safety tips and tell everyone to stay safe while they’re traveling just because that’s your personality. Or you may be a salesperson who only uses social media in a professional capacity to talk about safes and other safety devices. Unless I have a few hundred researchers working to confirm which one it is, I’ll just have to hope that the program’s conclusions are within an acceptable margin of error.
And that’s not the end of the potential problems. Just knowing that you use the word “safe” a lot isn’t enough to do much. I’d need to continue to look for mentions of keywords that represent political topics to calculate your general tone by looking at what adjectives and modifiers you use and grading them for positivity or negativity to try and guess how you feel about them. This is known as sentiment analysis in the computer world, and while it’s very widely used and accepted as a valid technique, it can be extremely misleading. Even with the best neural networks employed for this task, it can be difficult to tell if you’re very angry and agitated about, say, immigrants, or if you’re angry at someone who showed up to bash immigrants if the words being analyzed for sentiment are too close together to tell the difference.
Sentiment analysis needs a large data set which catches you both discussing a topic of interest and arguing about it because if the only samples of how you feel about immigrants are you quoting, mocking, and arguing with a random xenophobe, the score showing your positivity or negativity may be a wash, or indicate that you also feel very negatively about immigrants. But if you often post positive links about immigration or say nice things about immigrants in your community, the cumulative score will shift into the positive territory. In other words, you can have extremely strong opinions about crucial political topics but if you’re trying your best to keep your politics off social media or argue primarily with bitter sarcasm and anger, the code could classify you as not caring about those topics, having an opposite opinion than you do, or just mark you as an apolitical dead end for targeted messaging.
Likewise, I could also test whatever other hypothesis catches my fancy. Do the average colors of the pictures you upload have any correlation to the posts or personality types of users? Is there anything about the size of one’s friends list? Is there something to the combination of personality type and average rate and type of engagement with friends’ posts? I can mine all this data in countless ways to come up with classifications and correlations which I think might be important. But it’s a given that some of these correlations will be extremely weak and there will be outliers, and there’s no practical way that I could make unique ads tailored to each of the tens of millions of individuals in the database. What all these impressive calculations will end up doing is just p-hacking their way to something that looks scientific but really isn’t.
Even worse, what if the you online has little resemblance to the real you? We all know that people generally try to present the face they want the web to see on social media, and post content is no exception. What if the racist troll who lights up our dashboard as a neo-Nazi is actually just trying to be “edgy” and his belief structure is virtually non-existent offline? Or what if the passionate social justice warrior on Twitter actually thinks that white supremacists might be right about other races having lower intelligence, and is just using social justice activism to appeal to certain people with whom he wants to ingratiate? False positives like this are a guaranteed certainty in large data sets.
Knowing this, I’d do what I always do when working with large, arbitrary data and start classifying people in relatively broad categories which would be both more actionable for the advertising team, and smooth out the outliers in the processed results. While putting users in slots, I could take the confidence level of the decisions the algorithms make and create a bell curve within each category, discarding extreme outliers and focusing on those whose scores fall in the middle 80% for targeted messaging. Though notice, we just went from talking about what you personally may be posting on social media to what a few hundred thousand people like you are posting, allocated in buckets with a score calculated by an artificial neural network, then smoothing out the data to discard the inevitable false positives and weak correlations.
This is a lot more realistic than the hype you’ve probably heard and sounds a lot like typical marketing focus groups and market segmentation practices of every major corporation. It also makes sense because while each person is a unique individual with a unique backstory, we all have group identities and know we can find a lot of others with similar interests, beliefs, concerns, and even similar backgrounds, broadly speaking. Trying to describe us will often strip away individual details and backstory to focus on what defines us from a very high level, and without those individual details, groups of millions can be summed up as if they were one person. In marketing, this is known making a model customer, and this is exactly what Facebook delivers advertisers.
The power of using data to classify us as a part of a certain group is that while you may be in denial that you act most like, say, young urban gentrifiers, your posts, shopping habits, and favorite locations line up neatly with the group you disdain and the algorithms can show that very clearly. When those of us fluent in working with large data sets say that we can write code that knows you better than you know yourself, one part of it is self-promotional hype but the other is understanding that human episodic memory and self-image can often ignore the broader picture, and the person in question may not realize that he or she is behaving a certain way, while patterns and the big picture of said person’s behavior is what the code truly cares about.
However, as we just saw, the code will have to deal in generalities for every person in the database it’s analyzing to avoid throwing out a mess filled with bad classifications, radical outliers, and barely extant correlations to random calculated data points the programmer created to test his or her hypotheses. So in other words, Wylie’s big claims of being able to nano-target voters with individualized ads are mostly self-hype. If he really managed to develop the technology that can live up to his claims, there’s absolutely no evidence of it being used. Instead, we just have proof of the Trump campaign using the exact same targeting and segmentation tactics as any company marketing its wares and services on social media, showing a certain version of an ad to relatively broadly targeted groups of people.
Now, I’m certainly not implying that Wylie didn’t build something he and his bosses saw as incredibly important. What I am saying, however, is that there’s no evidence that it does what he claims it does, and even if it exists and lives up to the hype he’s trying to sell, it would’ve had to solve problems which are still very much open questions in the world of computer science and used the resulting algorithms to do its magic without a trace, even though when you’re targeting groups of people like, for example, 137,432 users classified as “very politically active gun owners,” generating 137,432 versions of the same ad would be very financially and psychologically wasteful. There’s no indication in any literature on the subject that this level of individualization has any real effect on one’s buying or voting habits.
For the record, I reached out to Wylie asking if he would be willing to discuss some of the specifics of his work and disclosed my relevant credentials to be fair, but as of this writing, I have not received a response.
The Real Product
Trump won by roughly 80,000 votes in three key states. In one investigative video for Channel 4, representatives of Cambridge Analytica claimed that the number was actually half that, and heavily implied that their data product somehow had a role in eking out this victory. They had to imply and hint because saying their data operation was responsible would be a very shaky claim. First of all, well known GOP voter suppression efforts through red tape at the polls and during registration can account for Trump’s win in Wisconsin. In Michigan and Pennsylvania, third-party candidates sealed the deal for him. Unless the data tools they had were Prophecy Orbs from the Harry Potter universe, the likelihood they could reliably deduce this series of events only from scraped Facebook profiles and ad engagement is extremely slim.
But don’t just take my word for it. Aides and former CA employees were very underwhelmed by the actual data product and played down their effects on real campaigns, wondering aloud whether there was really any merit to them from a scientific standpoint when pressed for detail by reporters. Even Nix, the company’s CEO, carefully hedged his words when it came to the efficacy of its data operation, saying that it was just one ingredient in stark contrast to the borderline excessive number of times CA’s website uses the words “data” and “data-driven” to describe its work. Their tagline is “data drives all that we do” for crying out loud. Yet their data product doesn’t have a name and their representatives shy away from making it front and center in their sales pitch, further validating informed skepticism of their data operation.
In effect, Cambridge Analytica and Wylie may be trying to take credit for a fluke in order to land their next big job, using survivorship bias in their respective sales pitches. This leaves the all important question of why hire Cambridge Analytica in the first place. According to the same Channel 4 investigation, their specialty is dirty tricks. Working with hackers, planting fake news and amplifying it with bots, and creating scandals for their clients’ opponents. All that data-driven marketing and voter analytics are just a cover for political hit jobs and custom propaganda behind the glossy veneer of typical political ads and lectures about the power of big data. Pay them enough and they’ll “send some girls up to a politician’s office,” particularly Ukrainian ones, or pose as wealthy developers to set up a James O’Keefe-style “expose” and break some more rules, coordinating with dark money groups to make the end result go viral on Facebook, Twitter, and YouTube.
They are a natural consequence of sociopathic politics with no limits on how much money can be spent to win elections. When winning is all that matters, not why you want to win, and money is no object, driving up the stakes of all campaigns, it’s only a matter of time until candidates will resort to enlisting the services of political hitmen for whom election laws are merely polite and outdated suggestions. And now that they’ve attracted some real scrutiny after claiming credit for one of the biggest upsets in American political history and implicated social media’s opportunistic personal data practices in the process, people like Wylie are taking their leave and whitewashing their participation, while the rest of the company can’t decide what they should pretend to be in light of the very incriminating and disturbing investigations into their dirty deeds currently making international headlines.
Make no mistake, it’s a huge deal that your social media data is being sold or given away by companies that profit from using it irresponsibly. It’s also a very big deal that people are trying to build tools to brainwash voters by flooding their media diet with propaganda they hope will make them think or vote in certain ways. But their chances of success are very slim and the real scandal here is that so many elections around the world are becoming expensive and years-long affairs with few rules and perfectly legal dark money backers who are happy to break the law, lie, and pit citizens against each other to get their favorite candidates in office by any means they deem necessary. In his quest to “drain the swamp” in D.C., Trump managed not just to make it even swampier, filling it with even worse monsters, but to also taint the electoral process with immoral henchmen whose sense of decency is nonexistent .