The Data Files: Twitter Emoji Analysis

Published in

Code Like A Girl

3 min readDec 6, 2017

Emojis and data are two of my favorite things and I have been itching to combine them in a fun project. A few weeks ago, as I was scrolling through my Twitter feed, inspiration struck:

Describe your dream trip in three emojis.

See the project here.

The Process

1. Scrape the data:

At first I thought scraping would be simple since all I needed to do was query the twitter API. Surely, there was an endpoint that would give me the responses to a specific tweet. But alas, this was not the case. After several hours of googling and stack-overflowing, I found some possible solutions. But what I ended doing was using the following rscript (with help from @HamdanAzhar from @PRISMOJI)

The workaround is essentially get all mentions of the original account. And then filter those mentions with the original tweet id. It’s better to do this as soon as possible since for popular accounts, you might need to gather hundreds of thousands of responses before you get the full set. Another caveat is the free API account currently only allows queries as far back as 7 days. So make sure to collect responses within the week.

2. Translating emojis into English:

Once I had the data, I had to convert the unicode into english so I could better filter and analyze the data. While there are a few unicode <> english dictionaries out there, I happen to be working in R and on a windows machine (so I had some pretty obscure encodings to work with). After several hours of frustration, I finally arrived at a solution that worked. I first extracted the emojis from the comments since some of them also had extraneous text. From there, I did a few stringr manipulations to get the format I needed and used this dictionary to translate my emojis.

2. Analyze and clean the data:

After I was done translating the emojis, I was finally ready for analysis. I primarily used the tidytext library to work with the dataset.

I also wanted to categorize the different emojis. After thinking of writing a classifier (hi scipy) or using Princeton’s Wordnet, I ultimately decided against both. Neither solutions offered exactly what I wanted and since my dataset had just about 150 unique emojis (small by dataset standards), I decided to just bite the bullet, pop on my favorite Spotify playlist, and hand-code the categories that I wanted.

3. Visualize the Data

One of primary goals from the beginning was to build a custom d3 chord to visualize the different pairings of emojis. Since this was my first time building this sort of d3 chart, I had a bit of learning to do. In particular, there was the radial scale for positioning, how to append custom images(the emojis)to an svg, and how to create perfect arcs.

After figuring all of that out, I arrive at this (tada!): Click here for the interactive version.

Thanks for reading! Let me know if you have any feedback or questions about how I made this. Follow me to read more about data visualization, mini tutorials, and various data projects I am working on. :)

Christine