Popping the Filter Bubble: Data Collection and Visualization

Haley Fox

Prof. Lampe

SI 316

2 December 2016

Introduction

When first imagining this project, I aimed to uncover patterns in my Facebook feed in order to create a better understanding about how Facebook’s algorithm shapes my filter bubble. After the election, I was intrigued by how similar and frequent the political posts flooding my newsfeed were, and wanted to find out the reason Facebook decided to order posts on my feed the way it does. Do all of my friends make political posts, or is the “flooding” effect perpetuated by a vocal minority? Did click-based engagement (likes, loves, etc.) dominate the algorithm, or were comments the more rewarded engagement quantifier? Does Facebook prioritize posts by people who link back to another Facebook page? Some of my theories were that there would be a strong correlation between when a post showed up on my feed, how old a post was, how much engagement it received, and whether it linked to an internal or external source. What I was able to uncover about my own feed surprised me, and I have visualized the data in a way that will allow my peers (who are in ways demographically similar to myself) to see these trends within their own feeds. Through my analysis, I will prove that the Facebook algorithm is mindbogglingly difficult to decipher, and even the most outwardly obvious ranking metrics might not be as obvious as one would think. And, if the Facebook “echo chamber” is the result of  this increasingly complex, constantly learning algorithm, perhaps it is already out of the developer’s power to change it.  

Methods

Data collection occurred from November 18, 2016, to November 30th, 2016. To collect data, I started at the top of my Facebook feed and looked for posts that dealt with the political landscape; for example, the election, the environment, and the alt-right. I did not include posts that were ads, suggested posts, posts to a group, or posts made by a page. I would then code these posts for several demographic variables and search for the presence of certain keywords. These 85+ keywords, like “Donald/Trump/Donald Trump” and “America/American”, were chosen because they were highly likely to appear on my feed. Keywords were added as certain topics became trendy, such as “Hamilton” and “Boo/Booed”. However, keywords became insignificant to my overall argument and thus have not been visualized. The raw data is available to examine here. To preserve anonymity, all parties within the analysis of my data will be referred to only by their identification letter.

Demographics

I shall begin by explaining the demographics of my participant pool. I coded 49 posts made by 30 individuals. Approximately 60% are female, 30% are male, and 10% are gender nonconforming (fig. 1). Of these individuals, 61.2% were straight-identifying and 38.8 were queer-identifying (fig. 2). According to myself, I am a single, liberal millennial interested in video games, dogs, and my education. According to Facebook (fig. 3), I am a single, liberal millennial interested in African-American culture, books, and the Communist Party of India, among other things. While Facebook’s perception of me isn’t exactly correct, it appears that the posts on my feed are stacked in a way that aligns with that surmised identity.

Correlations

In my data, I began by figuring out that approximately 41% of the posts included in this study were made by 20% of the participants, revealing that there is a small group of people who are very politically active on my Facebook feed rather than a large group of people who are sort of politically active. Then, I looked at whether a person’s age was a determiner in how often their post appeared on my feed. The Pearson’s coefficient is 0.30, and as is visible in Figure 4, there is a small amount of correlation between an individual’s age and how often they post. However, there is a strong correlation of 0.87 between the number of people in an age group and the total number of posts they made, as shown by Figure 5. While this could simply be because the majority of my Facebook friends are of similar age to me, this still suggests that Facebook most frequently shows me posts made by people of my same age.

Next, I wanted to look at what might cause a post to not show up on my feed. I first found out whether a post’s age is a barrier to the feed by looking for a correlation between when it was first posted and how far down the feed it appeared. As one can see in Figure 6, there is barely any correlation between the age of the post and how early or late it appears on my feed. Indeed, the correlation coefficient is 0.34, minimally higher than the correlation between age and post frequency. However, due to the complexity and vastness of the algorithm, this suggests that how old a post is is indeed a factor in feed placement. Second, I looked to see if click-based engagement affects feed placement: Based on the coefficient of -.23, it would appear that they are not as correlated as I’d originally suspected. To elaborate, if a post with 100 likes appeared first and a post with zero likes appeared last, the coefficient would be -1. In Figure 7, with a correlation of -1 we would see two line graphs that mirror each other horizontally and cross at the center. Based off my findings, it appears that a high number of likes does not immediately lead to a first-place spot on the feed. Third, I looked at where each posts’ link points as a possible indicator of the algorithm’s favor. My original theory was that, since Facebook is a company that sells ads, it wants to keep the user on the site as long as possible. Therefore, one could assume that posts that have internal links (links that reference back to Facebook) would receive preferred placement. Despite this logical train of thought, the coefficient for these variables is a mere 0.04, the smallest correlation yet. In coding this data, I set any “# on Feed” value that was less than 20 to 1, and any value greater than 20 to 2. I also set internal links to equal 1 and external links to equal 2. Following my original theory, the graphs made by these two datasets should be identical, but Figure 8 shows just how dissimilar a .04 correlation is.

Conclusions

Thus, my theory about how Facebook ranks its newsfeed was incorrect, at a first glance. Because the algorithm is “stubbornly opaque” (Oremus, 2016), correlations that seem minute are actually significant when the scope of the algorithm is taken into consideration. Since we do not know the exact how many variables the Facebook algorithm quantifies, the question then becomes who should be held responsible when the algorithm fails us? This is perhaps one of the most important ethical questions society will have to confront in the following years. Based off the data I collected, research I’ve done, and the black box that is the algorithm, it would seem that the complexity of the algorithm is beyond human understanding. At the risk of sounding hyperbolic, I would liken it to trying to completely understand the inner workings of the brain; a feat society has long been striving towards, but has yet to reach. Finally, I invite the reader to look at my original data, provided here. Many variables were coded that were not discussed here, and may be of interest. Perhaps the secret to the Facebook algorithm is there, waiting for us, hiding in plain sight.

 

Secondary Sources

Deng, B. (2015, July 01). Retrieved December 02, 2016, from http://www.nature.com/news/machine-ethics-the-robot-s-dilemma-1.17881

Oremus, W. (2016, January 03). Who Really Controls What You See in Your Facebook Feed—and Why They Keep Changing It   . Retrieved December 02, 2016, from http://www.slate.com/articles/technology/cover_story/2016/01/how_facebook_s_news_feed_algorithm_works.html

Nickolas, S. (2015). What does it mean if the correlation coefficient is positive, negative, or zero? Retrieved December 02, 2016, from http://www.investopedia.com/ask/answers/032515/what-does-it-mean-if-correlation-coefficient-positive-negative-or-zero.asp

Kirk, A. (2012). Data Visualization: A successful design process. Birmingham: Packt Publishing.

Appendix

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s