Counting Tweets

Chinese abacus by David R. Tribble — displaying the decimal number 2,048

Today Twitter released a new API endpoint that allows you to fetch the number of tweets that match a query (like a hashtag) over time.

… we’re excited to launch two new endpoints, recent Tweet counts and full-archive Tweet counts, to the Twitter API v2. These endpoints are valuable for a number of reasons, but most commonly, to understand the size of the conversation, or the amount of data a query will return, prior to submitting a search request.

Thanks to some light coordination between Twitter and software developer Igor Brigadir a new version of the twarc utility was released today with a new command for collecting data from the counts endpoint.

This is significant news for researchers because these aggregate statistics can be critical for data gathering activities, and these numbers were previously really only known to Twitter themselves.

For example, it’s now possible to use a single twarc command to get the number of #blacklivesmatter tweets per day since the very first tweet, and save it as a CSV file, that you can then use to create a graph of the hashtag’s usage over time:

#blacklivesmatter hashtag usage by day

This admittedly dull graph is really quite remarkable, because it clearly shows that on May 28, 2020 there was a huge increase in the number of #blacklivesmatter tweets. There’s really no need to look any closer at the data to realize what happened that day:

https://www.nytimes.com/video/us/100000007161078/george-floyd-minneapolis-protests.html?smid=url-share

The results are a bit surprising however because the total number of #blacklivesmatter tweets is 71,190,571. That’s a large number, but it’s not as big as you might expect.

Counting tweets that match #blacklivesmatter or #blm doesn’t change the shape of the graph too much, but it does result in almost five times as many tweets being returned: 302,401,621. It’s important to remember here that these counts do not include tweets that have since been deleted or protected. Depending on what you are counting this can be a significant factor when interpreting counts for particular events.

Once installed and configured the twarc command for collecting these counts is pretty simple. Here is the command for retrieving the raw hourly count data (JSON) from the API for blacklivesmatter over the last seven days:

$ twarc2 counts "blacklivesmatter OR blm" counts.jsonl

To collect the daily counts from the full archive and write the data as CSV instead of JSON a few more options are needed.

$ twarc2 counts "blacklivesmatter OR blm" counts.csv --archive --csv --granularity day 

For an example of how the simple graph above was created take a look at this Jupyter notebook.

Finally as a counterpoint consider this graph of tweets mentioning #maga:

There are a few peaks and valleys there, but zooming in on the largest of them yields this graph:

Zoomed in on January 2021

That peak is January 6, 2021.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top