Twitter's terms of service don't allow tweet datasets to be published on the web, but they do allow tweet identifier datasets to be shared. This speaks to users rights as content creators, while also allowing researchers to share their data with others.
This site is a catalog of datasets that are publicly available on the web. If you would like to turn these tweet identifier datasets back into the original JSON first download the dataset and then use the Hydrator desktop application, or Twarc if you are comfortable working at the command line.
You can add your own datasets to the catalog by following these instructions. If you'd like updates when datasets are added please subscribe to the RSS feed. All metadata listed here is licensed CC0. You may want to refer to our code of conduct if you have questions or concerns about the datasets we list here.
The Unite the Right rally (also known as the Charlottesville rally) was a protest in Charlottesville, Virginia, United States from August 11–12, 2017, to oppose the removal of a statue of Robert E. Lee in Emancipation Park, which itself was renamed from Lee Park two months earlier. Protesters included white supremacists, white nationalists, neo-Confederates, neo-Nazis, and militias. This dataset contains 200,113 tweet ids collected with the #unitetheright hashtag. Data collection was performed twice from the search API using twarc: once at 2017-08-13 11:46:05 GMT and the other at 2017-08-15 12:03:48 GMT. The second search was run to collect only up to where the first search left off. The time ranges for the tweets are from 2017-08-04 11:44:12 to 2017-08-15 16:03:30 GMT.
The #WITBragDay hashtag was used starting August 12, 2017 by women sharing their accomplishments in technology. Tweets matching the query WITBragDay were collected using using the POST statuses/filter method of the Twitter Stream API and the GET statuses/search Twitter REST API using Social Feed Manager. There are 34,266 ids for tweets retrieved from the filter stream and 47,621 ids for tweets retrieved using the search API. The dataset includes a list of 52,457 unique tweet ids from both APIs.
On Friday, August 11th, 2017 a large groups of racist white nationalists carrying torches marched on the University of Virginia campus in Charlottesville, VA as an intimidation tactic against proponents for the removal of confederate statues of Robert E. Lee. The Friday evening march was held ahead of a much larger racist white nationalist rally in the center of Charlottesville planned for Saturday, August 12th, 2017. This dataset includes 100,000 tweet ids collected using the DocNow prototype http://app.docnow.io/ and includes tweets sent from 01:13:56 - 7:11:36 EDT on August 12.
39,264 IDs for tweets related to the Charlottesville KKK rally on July 8, 2017. These tweet IDs matched a search for ‘Charlottesville KKK OR #charlottesvilleKKK OR #blocKKK or #blocKKKparty’. These tweet IDs were collected with the twarc command line tool from Documenting the Now. Using twarc’s hydrate command, researchers can retrieve the full content of those tweets—with additional metadata provided by Twitter’s API—provided the tweets still exist.
Identifiers for 25,489 tweets about the students’ strike at the University of Puerto Rico. The tweets included the hashtag #HuelgaUPR or #Huelga2017 and are from April 11 to May 18, 2017. The tweets were collected using twarc. For a list of resources about the strike visit Puerto Rico Syllabus. Identificadores de 25,439 tuits sobre la huelga estudiantil en la Universidad de Puerto Rico. Los tuits fueron capturados utilizando twarc y cubren el periodo del 11 de abril al 18 de mayo. Para más información sobre la huelga visite Puerto Rico Syllabus.
Identifiers for 782,509 tweets that included the hashtag #macronleaks or #macrongate that were sent between 2017-05-10 16:14:51 and 2017-05-02 07:02:05 UTC. The tweets were collected from the Twitter Search API using twarc. The data does not include the first use of the #macrongate hashtag, but it does include the first use of the #macronleaks hashtag which went viral after Wikileaks retweeted it. More about the story of the #marconleaks hashtag can be found at: http://www.newyorker.com/news/news-desk/the-far-right-american-nationalist-who-tweeted-macronleaks
On 20 April 2017 the Australian Government announced that the Australian citizenship test would be made harder, with an increased focus on ‘Australian values’. Suggestions as to what ‘Australian values’ might actually be soon started to be shared on Twitter using the hashtag #australianvalues. 55,698 tweet ids for #australianvales collected with Documenting the Now’s Twarc from 20 to 27 April 2017.
681,668 tweet ids for #climatemarch collected with Documenting the Now’s twarc from January 22-26, 2017. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py hydrate climatemarch_tweet_ids.txt > climatemarch.json.
This bag contains 10,159,892 tweets and retweets sent by or to Twitter user jk_rowling between 2015-07-08 and 2017-03-18. The tweets were collected with Social Feed Manager (m5_003).
1,276,220 tweet ids for #MarchForScience collected with Documenting the Now’s twarc from January 22-26, 2017. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py hydrate MarchForScience_tweet-ids.txt > MarchForScience.json.
The hashtag #BlackWomenAtWork began trending following Fox News host, Bill O’Reilly’s sexist and racist comments about California Congresswoman’s Maxine Water’s hair on March 28th, 2017 and White House Press Secretary, Sean Spicer’s remarks to journalist, April Ryan during press briefing on the same day. The hashtag began trending after Brittany Packnett used it in a set of tweets where she asked black women to share their experiences about being black women at work. These tweet ids were collected on four separate occasions using the DocNow prototype twitter collection tool. bwaw1 (10,000 tweets), bwaw2 (41,256 tweets), bwaw3 (92,756 tweets) were collected on March 28th, the day the hashtag began trending. bwaw4 (140,000 tweets) was collected on March 29th.
This bag contains 2,711,011 tweets identifiers collected from the Twitter filter stream between 2017-02-09 and 2017-03-18 that used one or more of the following hashtags: alternativefacts, fakenews, truthiness, postfact, posttruth, factcheck. The original tweets were collected using twarc.
This dataset contains the tweet ids of 7,275,228 tweets related to the Women’s March on January 21, 2017. They were collected between December 19, 2016 and January 23, 2017 from the Twitter API using Social Feed Manager. See included README.txt for additional information.
#brexit tweets collected from the 5th of May to the 24th August 2016.
14,478,518 tweet ids for #WomensMarch collected with Documenting the Now’s twarc from January 21-28, 2017. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py –hydrate WomensMarch_tweet_ids.txt > WomensMarch.json Also included are the logs files for the Filter API and Search API queries. The Filter API query captures the cumulative number of dropped tweets.
On January 12th, 2017 the Senate voted 51-48 to approve a budget resolution as the first step in repealing the Affordable Care Act. The hashtag #SaveACA began being used heavily on Twitter the same day as a response. This dataset includes tweet ids collected on four separate occasions on January 12th and 13th, 2017 for the hashtag #SaveACA
An ongoing collection of Tweets collected by NCSU Libraries using twarc for the key terms “HB2”, “WeAreNotThis”, and “BoycottNC”, “KeepNCFair”, and “ThisIsNotUs”. “WeAreNotThis”, “BoycottNC”, “ThisIsNotUs”, and “North Carolina” beginning on 2016-03-24, and “HB2” beginning on 2016-12-25. Only Tweets including “HB2”, “bathroom”, “bill”, or “KeepNCFair” are included from the “North Carolina” set. These tags were used to discuss North Carolina House Bill 2 (The Public Facilities Privacy & Security Act), passed in March 2016, which includes provisions (among others) that disallow local municipalities from passing their own anti-discrimination ordinances and also require individuals, when using use public bathrooms, to use those that align with their sex as stated on their birth certificates rather than the restroom that is consistent with their gender identity (see: https://en.wikipedia.org/wiki/Public_Facilities_Privacy_%26_Security_Act). This dataset is broken into files of no more than 50,000 Tweet IDs each.
A list of 10,538 Twitter IDs for tweets harvested between 4 January at 11am and 9 January at 11am using Social Feed Manager. As this used the search API, the 4 January at 11am crawl went back about 5-9 days. Tweet IDs included, as is a log of the decisions made to curate this dataset.
A list of 24876 Twitter IDs for tweets harvested between Nov. 28 and Dec. 6 2014 containing the hashtag #bill10. Bill 10 in the Alberta legislature would have given public and Catholic school boards the right to refuse student requests to form gay-straight alliances in schools. Under intense public interest it was withdrawn by the Conservative government.
These 136,990 tweet ids represent reaction to a Facebook Live video that was posted on January 3rd, 2017, showing four African American men violently attacking a white, mentally disabled man. The tweets were collected on 01/05/2017. After the video surfaced, the Twitter hashtag, #BLMkidnapping, was created and used to incorrectly attribute the violent attack to members of the Black Lives Matter movement. Police in Chicago, where the attack took place, have found no evidence the attack has any connection to the Black Lives Matter movement. This link is to a CNN story documenting the police denial of Black Lives Matter connection: http://www.cnn.com/2017/01/05/us/black-lives-matter-chicago-facebook-live-beating/index.html
This is a dataset of ids for tweets purchased from Twitter as part of the Beyond the Hashtags study http://cmsimpact.org/resource/beyond-hashtags-ferguson-blacklivesmatter-online-struggle-offline-justice/ The dataset includes a year of tweets that mention one or more of 45 keywords associated with the BlackLivesMatter movement. This period covers a critical time in which social media was used to raise awareness about police killings of unarmed Black citizens in the United States.
228,086 tweet ids for “TheHip, hipinkingston” captured during the Tragically Hip’s final concert in Kingston, Ontario in August 2016. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py –hydrate th_final_concert_kingston_tweet_ids.txt > th_final_concert_kingston.json
These are tweets that were collected between August 27, 2015 and January 4, 2016 that mention the word “trump”. This period marked important early months in the Republican primaries. They were collected from Twitter’s streaming API using twarc.
There are 40,202,199 tweet identifiers in all. Due to network outages there are gaps at the following points: 2015-08-27 19:12:37 - 2015-08-27 20:13:44 ; 2015-11-02 02:02:13 - 2015-11-05 16:20:35 ; 2015-12-28 02:02:42 - 2015-12-28 02:04:00
8,595,589 tweet ids for aleppo tweets captured during the fall of Aleppo in December 2016. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py –hydrate aleppo_tweet_ids.txt > aleppo.json
This dataset contains the tweet ids of approximately 280 million tweets related to the 2016 United States presidential election. They were collected between July 13, 2016 and November 10, 2016 from the Twitter API using Social Feed Manager. These tweet ids are broken up into 12 collections. Each collection was collected either from the GET statuses/user_timeline method of the Twitter REST API or the POST statuses/filter method of the Twitter Stream API.
Tweet ids for #YMMfire tweets captured during the 2016 Fort McMurray Wildfire from 2016-05-01 to 2016-06-25.
This data set identifies 38M tweets collected for the analysis of social media messages related to the 2012 U.S. Presidential election. The data set provides tweet IDs for tweets containing the words “obama”, “romney”, or both (case-insensitive matching) during the period from July 1, 2012 through November 7, 2012. The paper, “Online and Social Media Data As an Imperfect Continuous Panel Survey.” PLoS ONE 11(1): e0145406 by Diaz et al. provides further description of the dataset.
Tweet ids for #NDP2016 tweets during the 2016 NDP Convention.
Tweet ids for #panamapapers tweets.
Tweet ids for #thechalkening tweets.
Tweet ids for #MakeDonaldDrumpfAgain tweets.
Tweet IDs for tweets carrying the #cdnpoli hashtag, applied to Canadian politics, collected as part of a larger project centered on Canada’s 42nd federal election.
Tweet ids for #paris #Bataclan #parisattacks #porteouverte tweets.
Tweet ids for #elxn42 tweets.
This item represents a collection of 13,480,000 tweet IDs that mentioned ‘ferguson’ from 2014-08-10 to 2014-08-27 and 15,080,078 tweet IDs that mention “ferguson” between 2014-11-11 and 2014-12-08. The first set includes tweets for the two week period after the shooting of Michael Brown, and the second range includes tweets around the grand jury’s decision not to indict police office Darren Wilson which was announced on 2014-11-24. The first set of tweets were collected by Ed Summers at the University of Maryland and the second was a collaboration between Molly Loyd, Gregory Coleman, Kimberly Lamke, Benjamin Sugar and Ed Summers.
This dataset contains 32,056 tweets that mention “ferguson” between August 8 and August 10, 2014. They were collected on May 7th, 2015 from the search form on Twitter’s website. Some important side effects to be aware of is that the dataset does not include retweets and tweets that were deleted before May 7th, 2015.
Tweet ids for #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, #CharlieHebdo tweets.
This dataset contains Twitter JSON data for several Twitter search queries that were collected around the #YesAllWomen Twitter “conversation” between May 25, 2014 and June 8, 2014 using the twarc (https://github.com/edsu/twarc) package that makes use of Twitter’s search API. A total of 2,805,763 Tweets and 34,532 images make up the combined dataset.