From its early adopters in 2007 at SXSW to its use today by journalists around the world, Twitter has always been a social media platform for documenting events. In 2014 the Documenting the Now project itself got started around efforts to document the protests and activism following the murder of Michael Brown in Ferguson, Missouri. Twitter remains one of the most accessible sources of data for breaking events of various kinds.
The combination of Twitter’s new V2 API and their Academic Research Product Track means that Twitter is still an excellent source of information about events. But figuring out the best practices to document events can be challenging. So we were really pleased to see Ryan Gallagher’s short thread about the 4 steps he uses for using Twitter’s API to document events.
https://twitter.com/ryanjgallag/status/1390725526726266882
This post is simply a gloss on Ryan’s thread which provides examples of how to perform each step with the new functionality we released in twarc2. But put on your snorkel and diving gear because we are headed to the command line. (Aside: f you are new to the command line and are looking for some help look for the links to the documentation and our Slack channel at the end of this post).
1. Collect Live Tweets
https://twitter.com/ryanjgallag/status/1390726300906700801
If the event is ongoing and you know a few relevant hashtags or keywords you can use the filter stream API to collect tweets from the filter stream API as they happen. We did this recently for the Kenmure Street Protests where the hashtags #kenmurestreet and #kenmurest were being used. First we added the two keywords to our streaming rules:
$ twarc2 stream-rules add kenmurest
$ twarc2 stream-rules add kenmurestreet
And then started streaming the tweets to a file tweets.jsonl
:
% twarc2 stream > tweets.jsonl
You can leave this running as long as you want to collect tweets: hours, days, months … we’ve run a twarc job for over a year before. The hard thing is keeping the computer turned on and connected to the Internet. Most of the time you are only interested in a few days or weeks which was the case for the #kenmurestreet data collection.
2. Collect Past Tweets
https://twitter.com/ryanjgallag/status/1390727096616505349
Unless you are planning to document an event before it happens you often need to go back and get older tweets that were already sent. If you note down when you start collecting the live tweets you can use this time in combination with when the event began to scope the data collection, for example:
$ twarc2 search 'kenmurest OR kenmurestreet' --start-time 2021-05-13 --end-time 2021-05-21 > tweets.jsonl
You can collect tweets from the last 7 days, but as Ryan says you will need access to the Academic Research product track to search further back. If you have access you can simply add the --archive
option, but be careful to limit how much you are collecting with a --start-time
or --limit
(discussed below) or else you can use up your monthly quota!
$ twarc2 search --archive 'kenmurest OR kenmurestreet' --start-time 2021-01-01 --end-time 2021-05-21 > tweets.jsonl
3. Collect Conversations
https://twitter.com/ryanjgallag/status/1390727373713203205
Collecting hashtags and keywords can be a useful entry point into tweets about an event, but it will not collect responses to those tweets and all the threaded conversations that can happen. You can use theconversations
command to collect all the conversation threads that were referenced in your initially collected data:
$ twarc2 conversations tweets.jsonl > conversations.jsonl
4. Collect Timelines
https://twitter.com/ryanjgallag/status/1390727713082728448
It can also be useful to see what users were doing while they were documenting an event. Collecting the user timelines of users can be one way of seeing what some of this ambient content andactivity might be.
$ twarc2 timelines tweets.jsonl > timelines.jsonl
If the event didn’t happen in the last 7 days you will need to use the historical archive:
$ twarc2 timelines --archive tweets.jsonl > timelines.jsonl
As Ryan notes depending on how many tweets you have collected you could end up putting a big dent in your monthly quota. You can use --limit
and --conversation-limit
to control how much data is returned. For example if you only wanted to collect the last 100 tweets of each user and no more than 25,000 timeline tweets in total:
% twarc2 timelines tweets.jsonl --limit 25000 --conversation-limit 100 > timelines.jsonl
5. Limits
https://twitter.com/ryanjgallag/status/1390728895352221697
Speaking of limits you can actually apply --limit
to any stage of the process. So if you wanted to search for tweets, but no more than 50,000 you could have:
$ twarc2 search kenmurest --limit 50000 > search.jsonl
Or if you wanted to limit the number of conversation tweets by total tweets:
$ twarc2 conversations tweets.jsonl --limit 50000 > conversations.jsonl
You can also control how many tweets per conversation to collect. So for example if you only wanted to collect the first 100 tweets in each conversation:
$ twarc2 conversations tweets.jsonl --conversation-limit 100 > conversations.jsonl
Now What?
twarc will log what activities it is performing and when to a file named twarc.log
This log can be useful for remembering what you’ve done along the way. Once you have collected the data you will no doubt wonder what to do with all the JSON data. While this is largely a topic for another post a lot of people want to turn the data into CSV for examination in Google Sheets or as an R or Pandas dataframe. To do that you can install the twarc-csv plugin and:
twarc2 csv tweets.jsonl > tweets.csv
If you have any questions about this please check out the twarc2 documentation or join us in the Documenting the Now Slack. Documenting the Now’s Ed Summers also did a recent presentation about the Twitter V2 API and twarc2 at the University of Maryland Social Data Science Center: