Twitter's terms of service don't allow tweet datasets to be published on the web, but they do allow tweet identifier datasets to be shared. This speaks to users rights as content creators, while also allowing researchers to share their data with others.
This site is a catalog of datasets that are publicly available on the web. If you would like to turn these tweet identifier datasets back into the original JSON first download the dataset and then use the Hydrator desktop application, or Twarc if you are comfortable working at the command line.
You can add your own datasets to the catalog by following these instructions. If you'd like updates when datasets are added please subscribe to the RSS feed. All metadata listed here is licensed CC0. You may want to refer to our code of conduct if you have questions or concerns about the datasets we list here.
This hashtag was used by Twitter user @MatthewACherry on February 9th, 2018 to start a conversation about the history of popular gifs on Twitter. Here is a link to the original tweet using the hashtag https://twitter.com/MatthewACherry/status/962011241815277568
Ursula K Le Guin died on January 24, 2018 and there was an outpouring of tweets that commemorated her life as a writer. This dataset contains tweet identifiers for 251,287 tweets that mentioned the phrase “Le Guin” between January 14 and January 24, 2018. They were collected with the twarc utility.
The first #BLKTwitterstorians chat of 2018.
This dataset includes 2,995 tweets collected using the keyword “BlackDigArchive” and 1,888 tweets collected using the hashtag “#BlackDigArchive”. The second Documenting the Now symposium, “Digital Blackness in the Archive”, was held on December 11th and 12th, 2017 and addressed issues at the intersection of archival practice and the existence of Black people on the web and social media. Invited speakers discussed their work on the Black experience in online spaces including research on joy and creativity expressed by Black people on the web, cultural and social expression, activism and other acts of resistance, the Black experience with state sponsored online surveillance, and racism and bias in algorithm and social media platform design. The program was an opportunity for the general public, activists, archivists, library and museum professionals, and the academic community, to learn and share together in conversations about digital culture and digital archives that center blackness.
2,278,757 tweet ids for #JeffSessions collected with Documenting the Now’s twarc. Tweets can be “rehydrated” with Documenting the Now’s twarc, or Hydrator.
59,261,490 tweet ids for tweets directed at Donald Trump (@realDonaldTrump), collected with Documenting the Now’s twarc. Tweets can be “rehydrated” with Documenting the Now’s twarc, or Hydrator. twarc hydrate to_realdonaldtrump_ids.txt > to_donaltrump.jsonl. Tweets from May 7, 2017 - June 21, 2017 of the dataset used a combination of the Filter (Streaming) API and Search API. The Filter API failed on June 21, 2017. From June 23, 2017 forward only the Search API was used to collect. This is done every 5 days on a cron job. Collection is ongoing, and this dataset will be periodically updated with additional tweet ids sets.
1,797,260 tweet ids for #paradisepapers collected with Documenting the Now’s twarc from November 5-26, 2017.
987,938 tweets retrieved that mentioned #PuertoRico over the period of October 4 to November 7, 2017. This was a period where there was increased concern being expressed in social media about the response to the humanitarian crisis caused by Hurricane Maria, which made landfall on September 20. Tweets with ids greater than 919222753353457664 were collected from the streaming API, and the earlier tweets were collected using the search API. In both cases tweets using #PuertoRico were collected.
This dataset contains the tweet ids of 35,596,281 tweets related to Hurricanes Irma and Harvey. They were collected during these events from the Twitter API using Social Feed Manager. These tweet ids are broken up into 2 collections. Each collection was collected using the POST statuses/filter method of the Twitter Stream API.
This is a collection of 1,430 tweet ids for tweets using the hashtag #BlackTheory collected on September 19th, 2017. The hashtag was used in a conversation started by Dr. Jessica Marie Johnson (@jmjafrx) on September 18th, 2017, where she asked people to name black theorists. See these tweets for context: https://twitter.com/jmjafrx/status/909850396377706496 and https://twitter.com/jmjafrx/status/911739572685438976
This dataset contains identifiers for 8,410,431 tweets that were collected between September 19, 2017 and October 5, 2017 that mentioned #CatalanReferendum, #CatalalonianReferendum, #Catalonia, #1oct, #1o or #votarem. These hashtags were used in the lead up to the Catalan Independence Referendum on October 1, 2017. The referendum was declared illegal under Spanish law, and the Spanish police attempted to prevent it. The data collection was a collaboration with Vicenç Ruiz Gómez and Aniol Maria of the Society of CatalanArchivists working in conjunction with Ed Summers of the Maryland Institute for Technology in the Humanities. The hashtags were selected after monitoring the #CatalanReferendum hashtag for several hours on September 28 to determine what the top hashtags being used were. The tweets themselves were collected from the Twitter Search API using twarc and its twarc-archive utility. twarc-archive was run every hour to collect the tweets that occurred since the last run.
This dataset contains 17,292,130 tweet ids for tweets collected from the Twitter filter stream API for #blm and #blacklivesmatter between 2016-01-29 and 2017-03-18. The data was collected using the twarc utility. The files are broken into segments because of network connectivity problems that were encountered during data collection. So there are varying time gaps present between the files. Also when the hashtags were trending globally rate limits may have prevented some tweets from being streamed over the API.
This dataset includes 80,339 tweet ids collected on October 14th, 2017 that use the hashtag #WOCAffirmation. The hashtag was started by April Reign (@ReignOfApril) as a way to amplify voices of women of color and partly as a response to a Twitter boycott started in support of actress Rose McGowan, after she revealed that she was sexually assaulted by HarveyWeinstein. These tweets by April Reign show her calling for Twitter users to use the hashtag: to https://twitter.com/ReignOfApril/status/918691938143834112, https://twitter.com/ReignOfApril/status/918695092587601920, https://twitter.com/ReignOfApril/status/918696352359391232.
This dataset includes 10,894 tweet ids for tweets that used that hashtag #AmplifyWomen. The tweets were collected on October 14th, 2017. The hashtag started being used in response a Twitter boycott that started in support of actress Rose McGowan, after she revealed she was sexually assaulted by Harvey Weinstein. These two tweets by @bardgal and @Chatvert give some context for the purpose of the hashtag: https://twitter.com/bardgal/status/918729587625902080, https://twitter.com/Chatvert/status/918817455505575942.
This dataset contains 18, 646 tweet ids documenting the March for Black Women which was held on September 30th, 2017 in Washington D.C. The dataset contains 2,925 tweet ids for tweets that included the hashtag #marchforblackwomen and 15,271 tweet ids for tweets that included the #hashtag M4BW. The march website is here: https://www.mamablack.org/march-for-black-women.
2017 Catalonia attacks were a terrorism action against pedestrians who were at La Rambla (Barcelona) and beach promenade of Cambrils on the afternoon and night of 17-18th August 2017. We selected #NoTincPor hashtag because it was the motto of the demonstrations during that period and was the most positive message. No one knew how the situation would go so the best option was to collect the dataset using the search API and limit by time between August 17 - 26, which was the final demonstration in Barcelona for the victims.
This dataset contains Twitter JSON data for Tweets related to Hurricane Harvey and the subsequent flooding along the Texas gulf region. This dataset was created using the twarc (https://github.com/edsu/twarc) package that makes use of Twitter’s search API. A total of 7,041,866 Tweets make up the combined dataset. See included README.txt for additional information.
The hashtag #DrawingWhileBlack was started by artist, Annabelle, on September 15th, 2017 to celebrate the work of Black artists. The dataset includes 69,236 tweet ids collected 09/17/2017. Annabelle’s Tumblr website can be found at http://sparklyfawn.tumblr.com/ and her Twitter profile is @sparklyfawn.
This dataset contains the tweet ids of 20,040,948 tweets collected from the Twitter accounts of aproximately 4,500 news outlets, i.e., accounts of media organizations intended to disseminate news. The media organizations include everything from local U.S. newspapers to foreign television stations. They were collected between August 4, 2016 and September 28, 2017 from the Twitter API using Social Feed Manager. Note that not all accounts may have been collected for the entire duration and there may be tweets from before the time period. We intend to update this dataset approximately every 3 months.
This dataset contains the tweet ids of 1,594,687 tweets from the Twitter accounts of members of the 115th U.S. Congress. They were collected between January 27, 2017 and December 19, 2017 from the Twitter API using Social Feed Manager. Some tweets may come before this time period. This collection will be updated periodically.
The 2017 solar eclipse occurred on August 21 and and was total for Oregon, Idaho, Wyoming, Nebraska, Kansas, Missouri, Illinois, Kentucky, Tennessee, North Carolina, Georgia, and South Carolina. This dataset includes 13,548,321 tweet identifiers for tweets that included any of the keywords solareclipse2017, solareclipse, eclipse2017, eclipseday or eclipse for the period August 17 to August 23, 2017. The hashtags were were selected after watching Twitter’s streaming API for the trending hashtag #solareclipse2017 and counting the most popular co-occurring hashtags. The search API was used instead of the filter stream API because the stream was emitting notifications that many tweets were not delivered, since the volume was so high.
This dataset contains Twitter JSON data for several Twitter search queries that were collected the week following the shooting of police officers in Dallas, Texas on July 7th 2017, using the twarc (https://github.com/edsu/twarc) package that makes use of Twitter’s search API. See included README.txt for additional information.
This dataset contains the tweet ids of 5,655,632 tweets that were collected from approximately 3000 Twitter accounts affiliated with the U.S. government. They were collected between October 21, 2016 and January 21, 2017 from the Twitter API using Social Feed Manager. This dataset was created as part of the End of Term Web Archiving initiative. The lists of accounts came from the U.S. Digital Registry and by public submissions.
The Unite the Right rally (also known as the Charlottesville rally) was a protest in Charlottesville, Virginia, United States from August 11–12, 2017, to oppose the removal of a statue of Robert E. Lee in Emancipation Park, which itself was renamed from Lee Park two months earlier. Protesters included white supremacists, white nationalists, neo-Confederates, neo-Nazis, and militias. This dataset contains 200,113 tweet ids collected with the #unitetheright hashtag. Data collection was performed twice from the search API using twarc: once at 2017-08-13 11:46:05 GMT and the other at 2017-08-15 12:03:48 GMT. The second search was run to collect only up to where the first search left off. The time ranges for the tweets are from 2017-08-04 11:44:12 to 2017-08-15 16:03:30 GMT.
The #WITBragDay hashtag was used starting August 12, 2017 by women sharing their accomplishments in technology. Tweets matching the query WITBragDay were collected using using the POST statuses/filter method of the Twitter Stream API and the GET statuses/search Twitter REST API using Social Feed Manager. There are 34,266 ids for tweets retrieved from the filter stream and 47,621 ids for tweets retrieved using the search API. The dataset includes a list of 52,457 unique tweet ids from both APIs.
On Friday, August 11th, 2017 a large groups of racist white nationalists carrying torches marched on the University of Virginia campus in Charlottesville, VA as an intimidation tactic against proponents for the removal of confederate statues of Robert E. Lee. The Friday evening march was held ahead of a much larger racist white nationalist rally in the center of Charlottesville planned for Saturday, August 12th, 2017. This dataset includes 100,000 tweet ids collected using the DocNow prototype http://app.docnow.io/ and includes tweets sent from 01:13:56 - 7:11:36 EDT on August 12.
39,264 IDs for tweets related to the Charlottesville KKK rally on July 8, 2017. These tweet IDs matched a search for ‘Charlottesville KKK OR #charlottesvilleKKK OR #blocKKK or #blocKKKparty’. These tweet IDs were collected with the twarc command line tool from Documenting the Now. Using twarc’s hydrate command, researchers can retrieve the full content of those tweets—with additional metadata provided by Twitter’s API—provided the tweets still exist.
Identifiers for 25,489 tweets about the students’ strike at the University of Puerto Rico. The tweets included the hashtag #HuelgaUPR or #Huelga2017 and are from April 11 to May 18, 2017. The tweets were collected using twarc. For a list of resources about the strike visit Puerto Rico Syllabus. Identificadores de 25,439 tuits sobre la huelga estudiantil en la Universidad de Puerto Rico. Los tuits fueron capturados utilizando twarc y cubren el periodo del 11 de abril al 18 de mayo. Para más información sobre la huelga visite Puerto Rico Syllabus.
Identifiers for 782,509 tweets that included the hashtag #macronleaks or #macrongate that were sent between 2017-05-10 16:14:51 and 2017-05-02 07:02:05 UTC. The tweets were collected from the Twitter Search API using twarc. The data does not include the first use of the #macrongate hashtag, but it does include the first use of the #macronleaks hashtag which went viral after Wikileaks retweeted it. More about the story of the #marconleaks hashtag can be found at: http://www.newyorker.com/news/news-desk/the-far-right-american-nationalist-who-tweeted-macronleaks
On 20 April 2017 the Australian Government announced that the Australian citizenship test would be made harder, with an increased focus on ‘Australian values’. Suggestions as to what ‘Australian values’ might actually be soon started to be shared on Twitter using the hashtag #australianvalues. 55,698 tweet ids for #australianvales collected with #Documenting the Now’s Twarc from 20 to 27 April 2017.
681,668 tweet ids for #climatemarch collected with Documenting the Now’s twarc from January 22-26, 2017. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py hydrate climatemarch_tweet_ids.txt > climatemarch.json.
This bag contains 10,159,892 tweets and retweets sent by or to Twitter user jk_rowling between 2015-07-08 and 2017-03-18. The tweets were collected with Social Feed Manager (m5_003).
1,276,220 tweet ids for #MarchForScience collected with Documenting the Now’s twarc from January 22-26, 2017. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py hydrate MarchForScience_tweet-ids.txt > MarchForScience.json.
The hashtag #BlackWomenAtWork began trending following Fox News host, Bill O’Reilly’s sexist and racist comments about California Congresswoman’s Maxine Water’s hair on March 28th, 2017 and White House Press Secretary, Sean Spicer’s remarks to journalist, April Ryan during press briefing on the same day. The hashtag began trending after Brittany Packnett used it in a set of tweets where she asked black women to share their experiences about being black women at work. These tweet ids were collected on four separate occasions using the DocNow prototype twitter collection tool. bwaw1 (10,000 tweets), bwaw2 (41,256 tweets), bwaw3 (92,756 tweets) were collected on March 28th, the day the hashtag began trending. bwaw4 (140,000 tweets) was collected on March 29th.
This bag contains 2,711,011 tweets identifiers collected from the Twitter filter stream between 2017-02-09 and 2017-03-18 that used one or more of the following hashtags: alternativefacts, fakenews, truthiness, postfact, posttruth, factcheck. The original tweets were collected using twarc.
This dataset contains the tweet ids of 7,275,228 tweets related to the Women’s March on January 21, 2017. They were collected between December 19, 2016 and January 23, 2017 from the Twitter API using Social Feed Manager. See included README.txt for additional information.
#brexit tweets collected from the 5th of May to the 24th August 2016.
14,478,518 tweet ids for #WomensMarch collected with Documenting the Now’s twarc from January 21-28, 2017. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py –hydrate WomensMarch_tweet_ids.txt > WomensMarch.json Also included are the logs files for the Filter API and Search API queries. The Filter API query captures the cumulative number of dropped tweets.
On January 12th, 2017 the Senate voted 51-48 to approve a budget resolution as the first step in repealing the Affordable Care Act. The hashtag #SaveACA began being used heavily on Twitter the same day as a response. This dataset includes tweet ids collected on four separate occasions on January 12th and 13th, 2017 for the hashtag #SaveACA
An ongoing collection of Tweets collected by NCSU Libraries using twarc for the key terms “HB2”, “WeAreNotThis”, and “BoycottNC”, “KeepNCFair”, and “ThisIsNotUs”. “WeAreNotThis”, “BoycottNC”, “ThisIsNotUs”, and “North Carolina” beginning on 2016-03-24, and “HB2” beginning on 2016-12-25. Only Tweets including “HB2”, “bathroom”, “bill”, or “KeepNCFair” are included from the “North Carolina” set. These tags were used to discuss North Carolina House Bill 2 (The Public Facilities Privacy & Security Act), passed in March 2016, which includes provisions (among others) that disallow local municipalities from passing their own anti-discrimination ordinances and also require individuals, when using use public bathrooms, to use those that align with their sex as stated on their birth certificates rather than the restroom that is consistent with their gender identity (see: https://en.wikipedia.org/wiki/Public_Facilities_Privacy_%26_Security_Act). This dataset is broken into files of no more than 50,000 Tweet IDs each.
A list of 10,538 Twitter IDs for tweets harvested between 4 January at 11am and 9 January at 11am using Social Feed Manager. As this used the search API, the 4 January at 11am crawl went back about 5-9 days. Tweet IDs included, as is a log of the decisions made to curate this dataset.
A list of 24876 Twitter IDs for tweets harvested between Nov. 28 and Dec. 6 2014 containing the hashtag #bill10. Bill 10 in the Alberta legislature would have given public and Catholic school boards the right to refuse student requests to form gay-straight alliances in schools. Under intense public interest it was withdrawn by the Conservative government.
These 136,990 tweet ids represent reaction to a Facebook Live video that was posted on January 3rd, 2017, showing four African American men violently attacking a white, mentally disabled man. The tweets were collected on 01/05/2017. After the video surfaced, the Twitter hashtag, #BLMkidnapping, was created and used to incorrectly attribute the violent attack to members of the Black Lives Matter movement. Police in Chicago, where the attack took place, have found no evidence the attack has any connection to the Black Lives Matter movement. This link is to a CNN story documenting the police denial of Black Lives Matter connection: http://www.cnn.com/2017/01/05/us/black-lives-matter-chicago-facebook-live-beating/index.html
This is a dataset of ids for tweets purchased from Twitter as part of the Beyond the Hashtags study http://cmsimpact.org/resource/beyond-hashtags-ferguson-blacklivesmatter-online-struggle-offline-justice/ The dataset includes a year of tweets that mention one or more of 45 keywords associated with the BlackLivesMatter movement. This period covers a critical time in which social media was used to raise awareness about police killings of unarmed Black citizens in the United States.
228,086 tweet ids for “TheHip, hipinkingston” captured during the Tragically Hip’s final concert in Kingston, Ontario in August 2016. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py –hydrate th_final_concert_kingston_tweet_ids.txt > th_final_concert_kingston.json
These are tweets that were collected between August 27, 2015 and January 4, 2016 that mention the word “trump”. This period marked important early months in the Republican primaries. They were collected from Twitter’s streaming API using twarc.
There are 40,202,199 tweet identifiers in all. Due to network outages there are gaps at the following points: 2015-08-27 19:12:37 - 2015-08-27 20:13:44 ; 2015-11-02 02:02:13 - 2015-11-05 16:20:35 ; 2015-12-28 02:02:42 - 2015-12-28 02:04:00
8,595,589 tweet ids for aleppo tweets captured during the fall of Aleppo in December 2016. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py –hydrate aleppo_tweet_ids.txt > aleppo.json
This dataset contains the tweet ids of approximately 280 million tweets related to the 2016 United States presidential election. They were collected between July 13, 2016 and November 10, 2016 from the Twitter API using Social Feed Manager. These tweet ids are broken up into 12 collections. Each collection was collected either from the GET statuses/user_timeline method of the Twitter REST API or the POST statuses/filter method of the Twitter Stream API.
Tweet ids for #YMMfire tweets captured during the 2016 Fort McMurray Wildfire from 2016-05-01 to 2016-06-25.
This data set identifies 38M tweets collected for the analysis of social media messages related to the 2012 U.S. Presidential election. The data set provides tweet IDs for tweets containing the words “obama”, “romney”, or both (case-insensitive matching) during the period from July 1, 2012 through November 7, 2012. The paper, “Online and Social Media Data As an Imperfect Continuous Panel Survey.” PLoS ONE 11(1): e0145406 by Diaz et al. provides further description of the dataset.
Tweet ids for #NDP2016 tweets during the 2016 NDP Convention.
Tweet ids for #panamapapers tweets.
Tweet ids for #thechalkening tweets.
Tweet ids for #MakeDonaldDrumpfAgain tweets.
Tweet IDs for tweets carrying the #cdnpoli hashtag, applied to Canadian politics, collected as part of a larger project centered on Canada’s 42nd federal election.
Tweet ids for #paris #Bataclan #parisattacks #porteouverte tweets.
Tweet ids for #elxn42 tweets.
This item represents a collection of 13,480,000 tweet IDs that mentioned ‘ferguson’ from 2014-08-10 to 2014-08-27 and 15,080,078 tweet IDs that mention “ferguson” between 2014-11-11 and 2014-12-08. The first set includes tweets for the two week period after the shooting of Michael Brown, and the second range includes tweets around the grand jury’s decision not to indict police office Darren Wilson which was announced on 2014-11-24. The first set of tweets were collected by Ed Summers at the University of Maryland and the second was a collaboration between Molly Loyd, Gregory Coleman, Kimberly Lamke, Benjamin Sugar and Ed Summers.
This dataset contains 32,056 tweets that mention “ferguson” between August 8 and August 10, 2014. They were collected on May 7th, 2015 from the search form on Twitter’s website. Some important side effects to be aware of is that the dataset does not include retweets and tweets that were deleted before May 7th, 2015.
Tweet ids for #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, #CharlieHebdo tweets.
This dataset contains Twitter JSON data for several Twitter search queries that were collected around the #YesAllWomen Twitter “conversation” between May 25, 2014 and June 8, 2014 using the twarc (https://github.com/edsu/twarc) package that makes use of Twitter’s search API. A total of 2,805,763 Tweets and 34,532 images make up the combined dataset.