Twitter's terms of service don't allow tweet datasets to be published on the web, but they do allow tweet identifier datasets to be shared. This speaks to users rights as content creators, while also allowing researchers to share their data with others.
This site is a catalog of datasets that are publicly available on the web. If you would like to turn these tweet identifier datasets back into the original JSON first download the dataset and then use the Hydrator desktop application, or Twarc if you are comfortable working at the command line.
You can add your own datasets to the catalog by following these instructions. If you'd like updates when datasets are added please subscribe to the RSS feed. All metadata listed here is licensed CC0. You may want to refer to our code of conduct if you have questions or concerns about the datasets we list here.
Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 30th which yielded over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to February 27th, to provide extra longitudinal coverage.The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We include two different files. Full-dataset.tsv contains all the 100 million tweet ids where as full_dataset-clean.tsv contains only original tweets (with no retweets.) For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
This dataset includes more than 20 million tweet IDs related to the Coronavirus outbreak of 2019-2020. It is being added to over time, so the numbers and dates may change. The tweets have been collected by the LSTM model deployed at sentiment.live. The model monitors the real-time Twitter feed for corona virus-related tweets, using filters language “en” and keyword “corona”. The CSV files include columns: tweet ID, sentiment score. The dataset should be solely used for non-commercial research purpose.
This dataset contains the tweet ids of 100,063,982 tweets related to Coronavirus or COVID-19. They were collected between March 3, 2020 and March 31, 2020 (midnight UTC-0) from the Twitter API using Social Feed Manager. These tweets were collected using the POST statuses/filter method of the Twitter Stream API, using the track parameter with the following keywords: #Coronavirus, #Coronaoutbreak, #COVID19.
The dataset contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020. Twitter’s search API was used to gather historical Tweets from the preceding 7 days, leading to the first Tweets in our dataset dating back to January 22, 2020. Twitter’s streaming API was leveraged to follow specified accounts and also collect in real-time tweets that mention specific keywords, which include: Coronavirus, Koronavirus, Corona, CDC, Wuhancoronavirus, Wuhanlockdown, Ncov, Wuhan, N95, Kungflu, Epidemic, outbreak, Sinophobia, China, covid-19, corona virus, covid, covid19, sars-cov-2, COVIDー19, COVD, pandemic.
EveTAR test collection, the first Arabic freely-available Test Collection for multiple information retrieval tasks in Twitter. It supports Event Detection (ED), Ad-hoc search (AS), Timeline generation (TTG), Real-time summarization (RTS). EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with substantial average inter-annotator agreement (Kappa value of 0.71). Besides the full collection (EveTAR-F), we provide four different subsets of EveTAR: (1) EveTAR-S: Random sample of 15M tweets (2) EveTAR-S.m: MSA tweets of the sample (3) EveTAR-S.d: Dialectal tweets of the sample (4) EveTAR-Q: Judged tweets only
The 1619 Project was developed by The New York Times Magazine in 2019 with the goal of re-examining the legacy of slavery in the United States and timed for the 400th anniversary of the arrival in America of the first enslaved people from West Africa. It is an interactive project by Nikole Hannah-Jones, a reporter for The New York Times, with contributions by the paper’s writers, including essays, poems, short fiction, and a photo essay. Originally conceived of as a special issue for August 20, 2019, it was soon turned into a full-fledged project, including a special broadsheet section in the newspaper, live events, and a multi-episode podcast series. (from Wikipedia) This bag contains metadata for 58,506 tweets related to the hashtag #1619project. They were collected on January 8, 2020 using twint and the keyword 1619project. twint -s ‘1619project’ –csv –output twint.csv twint scrapes Twitter’s search results and writes the results as CSV. The tweet identifiers were extracted from this CSV and included as the ids.txt file.
2,944,525 tweet ids for #elxn43 tweets. Tweets were collected via the Standard Search API on a cron job every five days from September 9, 2019 - November 23, 2019.
These tweets were collected during Hurricane Dorian as it made landfall on the east coast of the US in late August -early September 2019. Tweets were collected using Social Feed Manager based on the keywords #dorian, #hurricane, #hurricanedorian, and the search term ‘hurricane dorian’. This dataset is stored in the Purdue University Research Repository via the doi listed. Summaries of the data are available at the associated github repo: https://github.com/brachunok/dorian-tweets
On July 24, 2019, after two weeks of mass protests in Puerto Rico, Governor Ricardo Roselló announced he was resigning from office effective August 2nd. This collection contains 1,113,759 IDs for tweets related to the protests, with the hashtags ‘RickyRenuncia’ and ‘RickyVeteYa’. Along with this collection, there’s the “#RickyRenuncia web collection”, part of the Internet Archive Global Events (https://archive-it.org/collections/12491). / El 24 de julio de 2019, luego dos semanas de protestas masivas en Puerto Rico, el gobernador Ricardo Roselló anunció que renunciaba a su puesto efectivo el 2 de agosto. Esta colección contiene 1,113,759 identificadores de tuits relacionados a las protestas, con las etiquetas ‘RickyRenuncia’ y ‘RickyVeteYa’. Además, puede acceder a la “Colección Web #RickyRenuncia”, la cual es parte del Internet Archive Global Events (https://archive-it.org/collections/12491).
These tweets were collected for the third Democratic Party Debate held on September 12, 2019 at Texas Southern University in Houston, Texas. The official hashtag #demdebate was collected from the Twitter filter stream API using twarc from 2019-09-12 15:44 to 2019-09-14 10:09 UTC. The API reported that 195,725 tweets were not delivered due to rate limiting during high volume periods. The total number of tweets collected was 1,280,956 tweets. This dataset was created as part of a collaboration between the Maryland Institute for Technology in the Humanities and the Department of Communication at the University of Maryland.
This file provides the tweet identifiers of a dataset created after the July 5, 2015 Greek referendum. A sample of 204,713 tweets with the hashtag #greferendum were collected with NodeXL software from July 6 to July 16 2015. Thus, the dataset provides tweets produced in the aftermath of the Greek vote during the short period that led to a third financial assistance programme.
This corpus contains a snapshot of German-language Twitter posts from the month of April, 2013. The posts have been automatically collected and filtered by language (see paper). The collection method guarantees a near-complete (>90%) sample of all tweets in German sent during that time period. For more information see the paper that used this dataset to study http://www.lrec-conf.org/proceedings/lrec2014/pdf/1146_Paper.pdf
On August 5, 2019, Toni Morrison, one of the greatest literary figures of the 20th and 21st centuries, died at the age of 88. She was the first Black woman to win the Nobel Prize in literature and the author of eleven novels, dozens of essays, as well as literary criticism. This dataset was collected using the streaming API with the keywords “toni morrison” and “tonimorrison” beginning on August 6 (when the news of Morrison’s death was announced) and concluding on August 9. The dataset contains 1,170,032 tweets.
This dataset contains tweets using the #ach2019 hashtag that were sent as part of the Association for Computers and the Humanities conference held between July 23 and July 26, 2019 in Pittsburgh, Pennsylvania. The dataset includes ids for 10,340 tweets that were collected with the DocNow application that were sent between July 15 and July 29. The zip file contains a minimal viewer that you can use by opening the index.html in your browser. To get the original tweet data you will want to rehydrate the tweet ids with a tool like the Hydrator https://github.com/docnow/hydrator
This dataset consists of 402,650 tweet ids depicting commentary and memes from Black Twitter. Tweets containing the hashtag #BestofThrowbackBlackTwitter were collected from July 12th, 2019 to July 14th, 2019 using the DocNow demo Twitter data collection tool.
This dataset consists of 14,588 tweet IDs related to the “Obama Hope” movement. Tweets matching the search phrase “fairey AND hope” were collected to restrict the dataset to records referencing Sheppard Fairey’s “Obama Hope” image. This archive spans the date range 2008-2016 (inclusive).
Nipsey Hussle was an American rapper, entrepreneur, and community activist from Los Angeles. On March 31, 2019, Hussle was fatally shot outside his store, Marathon Clothing, in South Los Angeles. Hussle’s memorial service was held on April 11 at the Staples Center in Los Angeles. The 25.5-mile (41.0 km) funeral procession wound through the streets of South L.A. including Watts where he spent some of his formative years. On Wed Apr 03, 2019 tweets with the following hashtags were collected from the Twitter streaming and search APIs: NipseyHussle, Nipsey, Nipsey Hussle, RIPNipseyHussle, RIPNipsey. The collection includes 11,642,103 tweet identifiers from March 28 until April 15.
This dataset contains the tweet ids of 39,622,026 tweets related to climate change. They were collected between September 21, 2017 and May 17, 2019 using Social Feed Manager. There is a gap in data collection between January 7, 2019 and April 17, 2019. Tweets were collected using the POST statuses/filter method of the Twitter Stream API, using the track parameter with the following keywords: #climatechange, #climatechangeisreal, #actonclimate, #globalwarming, #climatechangehoax, #climatedeniers, #climatechangeisfalse, #globalwarminghoax, #climatechangenotreal, climate change, global warming, climate hoax
Any replies to Tweets and Retweets by Rep. Alexandria Ocasio-Cortez’s Tweets and Retweets in March 2019. Includes Green New Deal developments, MSNBC town hall with Chris Hayes, links to articles. Some retweets surfaced older Tweets and their replies from before March 2019. Replies to a Tweet were captured manually with a TamperMonkey userscript (https://github.com/mapmeld/aoc_reply_dataset/blob/master/scan.js) because there is no equivalent API endpoint.
This dataset contains Twitter JSON data for Tweets related to the fire at Notre Dame Cathedral in Paris, France. This dataset was created using the twarc (https://github.com/edsu/twarc) package that makes use of Twitter’s search API. A total of 8,046,185 Tweets and 163,055 media files make up the combined dataset.
These tweets have been collected with the Hashtagger platform, which considered these tweets relevant to the monitored stream of news from Irish sources (The Irish Times, Irish Examiner, etc.). All 198’725’860 tweets contain at least one hashtag.
This dataset contains the tweet ids of 171,248,476 tweets related to the 2018 U.S. Congressional Election, collected between January 22, 2018 and January 3, 2019. The collection includes tweets by Senate candidates, tweets by House candidates, election-related hashags, partisan Republican hashtags, and partisan Democratic hashtags.
This dataset contains the tweet ids of 2,041,399 tweets from the Twitter accounts of members of the 115th U.S. Congress. They were collected between January 27, 2017 and January 2, 2019 from the Twitter API using Social Feed Manager. Some tweets may come before this time period.
This tweet identifier dataset was collected from the Twitter streaming and search APIs to collect tweets containing the phrase “R Kelly” or the hashtag “#SurvivingRKelly” between December 25, 2018 and January 4, 2019. This partially covers the time period in which the 6 part Lifetime documentary Surviving R Kelly was released (January 3 to January 5). It includes 1,431,655. The documentary had an estimated 1.9 million viewers.
This dataset was released by Twitter on October 17, 2018 in order to provide transparency for state sponsored propaganda that was alleged to have occurred on their platform in the lead up to and directly following the 2016 Presidential Election. The original dataset of deleted content was downloaded from Twitter and deposited at the Internet Archive by Ed Summers.
This dataset was released by Twitter on October 17, 2018 in order to provide transparency for state sponsored propaganda that was alleged to have occurred on their platform in the lead up to and directly following the 2016 Presidential Election. The original dataset of deleted content was downloaded from Twitter and deposited at the Internet Archive by Ed Summers.
This dataset contains 14,108,104 tweets that document the mass shooting that occurred in Las Vegas, Nevada on September 1, 2017. Data were gathered via concurrent queries to the Twitter search API for all tweets containing the term “vegas”, occurring September 29 - October 7 before 5:00 PM PT. Because of the volume, and the fact that the tweets were being recorded after the event, but within the 7 day time window of the search API, the data was collected by limiting the search to a particular day, for example, “vegas until:2017-10-03”, “vegas until:2017-10-04”, “vegas until:2017-10-05”, etc. The resulting JSON data was then deduplicated using twarc’s deduplicate.py utility, the identifers were then sorted chronologically.
University of Glasgow’s unique Integrated Multimedia City Data (iMCD) Survey is a cross-sectional survey based on a sample of the general population in private residences across eight local authority areas of Glasgow and Clyde Valley. The purpose of the iMCD dataset was to provide a 360° overview of a life in the city, combining various datasets and methods of collection. The survey fieldwork was run by Ipsos MORI and took place between 15th April 2015 and 21st November 2015. This project was funded by ESRC and it was a result of collaboration across the University of Glasgow, Newcastle and Sheffield. The intention was to provide new innovative datasets with new methods and methodologies that could be used by the policy makers. iMCD consists of 5 main strands (Survey, GPS, life-logging devices, image analysis, textual media and multimedia data). The core of the iMCD is the Household Survey. Each of the five data strands are built on unique models of data sampling. As a part of the project, we also collected for example a sample of GPS and lifelogging sensors. Lifelogging sensors collected data through GPS devices and wearable cameras. Concurrent to the collection of the iMCD Household Survey data UBDC researchers have undertaken a significant information extraction exercise to capture data streams related to Glasgow from a variety of online sources. Twitter data comprises a large part of this collection, and this dataset comprises a selection of tweets during the period 1/12/14 - 30/11/2015 that arose from the greater Glasgow area. This, for example, may give insights into the citizens’ behaviour, reactions and moods in certain contexts or at particular times. The dataset can be queried through a bespoke online tool by specific hashtag or tweet term, in order to return statistical information or specific Tweet IDs. Captured tweets were those geolocated in Glasgow (based on a polygon around the geography), those from certain known Glasgow accounts (e.g. @BBCWestScot; @policescotland) and containing certain terms or hashtags (e.g. glasgow, or #glasgow2014). Tweets from Glasgow users: BBCWestScot, DailyMailUK, BBCScotWeather, GlasgowCC, Daily_Record, GLA_Airport, WeatherCast_UK, STVGlasgow, glabreakingnews, GreaterGlasgPol, OpenGlasgow, TheEveningTimes, trafficscotland, policescotland, Herald_editor, TheScotsman, scotairquality, GdnScotland, PeopleMakeGLA, EverttHerald, GlasgowSubway, newsundayherlad, EducationScot, TheSunNewspaper, TravelineScot, BBCTravelScot, CBItweets, MetroUK, FristinGlasgow, PeopleMakeGLA, FoEScot gtcs, WhatsOnGlasgow, scdiglobal, edhubSctoland, Heart_Glasgow Tweet with certain terms or hashtags: glasgow, BetterTogether, GlasgowCC, Yougov, glasgow2014, CommonWealthGame, GlasgowRoads, scotnight, indyref, CWG2014, goforitscotland, BBCScotWeather, scotdecides, VoteYes, NoThanks, the45, Scotland, yesScotland, AlexSalmond, CommonWealthGames, NoBecause, VoteNo, Darling, train, undecided, HopeOverFear, PatronisingBTlady
This dataset contains Twitter JSON data for Tweets related to Hurricane Florence and the subsequent flooding along the Carolina coastal region. This dataset was created using the twarc (https://github.com/edsu/twarc) package that makes use of Twitter’s search API. A total of 4,971,575 Tweets and 347,205 media files make up the combined dataset. See included README.txt for additional information.
On August 16, 2018 Aretha Franklin died in Detroit, Michigan at the age of 76. Franklin, also known as the Queen of Soul, had an award winning career as a singer, songwriter, actress and pianist while also being described as the voice of the civil rights movement. This dataset contains two tweet id files. The first was collected from the search API during the response to the announcement of her death, which includes tweets from August 8 - August 19 using the query “Aretha Franklin” OR “Queen of Soul”. The second dataset was collected over August 24 to September 3, which includes the date of her funeral on August 31. This second dataset was collected using the query “Aretha Franklin” OR “Queen of Soul” OR ArethaHomegoing OR ArethaFranklinFuneral OR ArethaFranklin which includes hashtags that were trending at the time. The datasets contain 2,832,128 and 1,332,442 tweets respectively.
This dataset contains the tweet ids of 9,673,959 tweets for approximately 3400 U.S. government accounts. These are accounts that are associated with federal government agencies, not individuals. They were collected between January 20, 2017 and July 20, 2018 from the GET statuses/user_timeline method of the Twitter API using Social Feed Manager.
A collection of 184,759 tweet ids for tweets collected that included the hashtags #J20, #DisruptJ20, and #DefendJ20. On January 20th, 2017, during President Donald Trumps inauguration, 234 protestors were arrested in Washington D.C. and subsequently prosecuted by the U.S. federal government. 21 of the defendants plead guilty to various levels of charges over the course of the trials while all the other charges were dismissed. The last of the charges were dismissed on July 6th, 2018. Buzzfeed News: https://www.buzzfeed.com/zoetillman/the-government-is-dropping-all-charges-in-the-remaining?utm_term=.kvwGjgW4dN#.ii6kWOrp8D 7-6-2018 US Motion to Dismiss J20 Remaining: https://www.scribd.com/document/383351930/7-6-18-US-Motion-to-Dismiss-J20-Remaining
This dataset contains 2,279,396 tweets related to the referendum to repeal the 8th amendment to the Irish constitution on May 25, 2018. They were collected between April 13, 2018 and June 4, 2018 from the Twitter filter stream API. These hashtags were provided by Helena Byrne, British Library.
4,058,754 tweets collected from the streaming and search APIs using the keyword “gaza” covering the period to 2018-05-08 to 2018-05-19. The stream and stream data collection was started on 2018-05-16. This time period included the opening of the US Embassy in Jerusalem on May 14th. On the same day Israeli forces killed over over 60 Palestinians, and injured 2,700 who were part of a non-violent protest in the Gaza Strip. https://www.democracynow.org/2018/5/24/after_latest_gaza_slaughter_open_an
This collection includes data for 30 different Twitter datasets associated with real world events. The datasets were collected between 2012 and 2016, always using the streaming API with a set of keywords. More on this paper: https://doi.org/10.1002/asi.24026
The #blackwomanhood Twitter chat took place on May 10th, 2018. It was a discussion around the syllabus for the Black Womanhood course taught by Martha Jones and Jessica Marie Johnson at Johns Hopkins University. The discussion coincided with the end of the course. http://dh.jmjafrx.com/2018/01/27/black-womanhood-the-syllabus/
This dataset contains tweets that were sent during the Ethics and Archiving the Web conference that was held at the New Museum in New York City, March 22 - 24, 2018. The total of 3,627 tweets included 3,155 tweets that used the hashtag #eaw18 and 472 replies to those tweets. More about the conference itself can be found at https://eaw.rhizome.org
During the last political campaign in the Italian Election 2018 populism, racism and far-right acts raised, gaining attentions by international media. On February 3, a drive-by shooting targeted sub-Saharan Africans in Macerata, in central Italy, wounding at least 6 immigrants. Police charged the shooter of aggravated assault by the purpose of racism. The data set give a glimpse of debates, hate speeches, protests, unofficial and official marches. Including the national march held on Feb 10 in Macerata. The attached data set contains all conversations related with “Macerata” and the shooter’s name, collected since January 24, 2018 until February 19, 2018 (574,925 tweets by 116,586 users). More info at Oohmm (src: @remagio @Oohmminfo https://oohmm.info/about).
#Expo2015 was a Universal Exposition hosted by Milan, Italy (https://goo.gl/iRi7Yf). It opened on May 1 at 10:00 and closed on October 31. Milan hosted an exposition for the second time; the first was the 1906 Milan International. The data set collection contains all conversations about #Expo2015 collected between ‘April 23 2015’ and ‘November 13 2015’ (1,437,301 Tweets by 218,641 users). Being the first social media Expo the attached archive aims to show the debate through 7 months of conversations of events, protests, finances, media, marketing and politics related with #Expo2015. Presence of bot-repeaters and minor comprop. Previously held in Shanghai, China 2010, next one in Dubai-UAE 2020. More info at Oohmm (src: @remagio @Oohmminfo https://oohmm.info/about).
#EstamosporTI: A state-sponsored hashtag. Spain’s Ministry of the Interior and security forces promoted a hashtag during the referendum shutdown & police repression in Catalunya. More details and visualizations are available reading the Erin Gallagher’s Medium post (https://goo.gl/ADWJ1Q) about the #EstamosporTI state sponsored hashtag. The attached data set is a collection of 18,831 Tweet IDs using the #EstamosporTI hashtag covering some more days. First on Oct 1 12:07:02 UTC 2017, last on Oct 24 05:37:24 UTC 2017. The attached Twitter IDs is part of a collection covering Catalan’s events and elections since July 2017 (src: @remagio @Oohmminfo).
Tweet ids and referenced urls collected from the #twitterandnews Twitter chat hosted by the Knight Foundation on Thursday, March 8th, 2018. The chat was about a newly released report on media converage of twitter subcultures. https://knightfoundation.org/features/twittermedia
This dataset contains the tweet ids of 13,816,206 tweets related to the 2018 Winter Olympics held in Pyeongchang, South Korea. They were collected between January 31, 2018 and February 27, 2018 from the Twitter filter stream API (POST statuses/filter) using Social Feed Manager. The filter tracked: #olympics, #pyeongchang2018, #winterolympics, #평창동계올림픽
This dataset contains the tweet ids of 7,665,497 tweets related to events in Charlottesville, Virginia in August, 2017. They were collected from the Twitter search and filter stream APIs using Social Feed Manager.
On February 27, 2018 the National Museum of African American History and Culture hosted a Twitter chat with the Documenting the Now project. Bergis Jules from the Documenting the Now team coordinated the project responses and delivered them as the @documentnow user on Twitter. Other people from the project and elsewehere responded. The event started at 9:30 AM EST and finished at 10:30 AM EST. Tweets with the designated hashtag #ArchivesBlackHistory were collected using twarc at 11:30 AM on Feburary 27, 2018. It collected 1402 tweets, some of which were created prior to the twitter chat, since it has been used in other promotional outreach by the NMAAHC.
This dataset contains the tweet ids of 16,875,766 tweets related to the immigration and travel ban executive order announced by the Trump Administration in January 2017. They were collected between January 30, 2017 and April 20, 2017 from the Twitter filter stream API using Social Feed Manager. The terms using for the filter were: #MuslimBan, #NoBanNoWall, #NoMuslimBan, #JFKTerminal4, #RefugeesWelcome, muslim ban, immigrant ban, immigration ban, travel ban, immigration order, #ImmigrationBan, #TravelBan.
This hashtag was used by Twitter user @MatthewACherry on February 9th, 2018 to start a conversation about the history of popular gifs on Twitter. Here is a link to the original tweet using the hashtag https://twitter.com/MatthewACherry/status/962011241815277568
On 7 April 2017, in central Stockholm, the capital of Sweden, a hijacked lorry was deliberately driven into crowds along Drottninggatan (Queen Street) before being crashed through a corner of an Åhléns department store. Five people were killed and 14 others were seriously injured. Police considered the attack an act of terrorism. The Stockholm-attack-2017-ids.txt contains tweet identifiers for 1067963 tweets that mentioned or tagged one or more of the following: #PrayForStockholm, #Stockholm, #Tukholma, #kärleksmanifestation, #lastnightinsweden, #openstockholm, #prayforsweden, #stockholmimitthjärta, #stockholmterror, #swedenattack, #swedenincident, Drottninggatan, Fridhemsplan, Hötorget, Märsta, Rakhmat Akilov, Sergels, Suecia, Swedish Police, Åhlens. This dataset were collected with the twarc utility.
Ursula K Le Guin died on January 24, 2018 and there was an outpouring of tweets that commemorated her life as a writer. This dataset contains tweet identifiers for 251,287 tweets that mentioned the phrase “Le Guin” between January 14 and January 24, 2018. They were collected with the twarc utility.
The first #BLKTwitterstorians chat of 2018.
This dataset includes 2,995 tweets collected using the keyword “BlackDigArchive” and 1,888 tweets collected using the hashtag “#BlackDigArchive”. The second Documenting the Now symposium, “Digital Blackness in the Archive”, was held on December 11th and 12th, 2017 and addressed issues at the intersection of archival practice and the existence of Black people on the web and social media. Invited speakers discussed their work on the Black experience in online spaces including research on joy and creativity expressed by Black people on the web, cultural and social expression, activism and other acts of resistance, the Black experience with state sponsored online surveillance, and racism and bias in algorithm and social media platform design. The program was an opportunity for the general public, activists, archivists, library and museum professionals, and the academic community, to learn and share together in conversations about digital culture and digital archives that center blackness.
2,278,757 tweet ids for #JeffSessions collected with Documenting the Now’s twarc. Tweets can be “rehydrated” with Documenting the Now’s twarc, or Hydrator.
238,991,027 tweet ids for tweets directed at Donald Trump (@realDonaldTrump), collected with Documenting the Now’s twarc. Tweets can be “rehydrated” with Documenting the Now’s twarc, or Hydrator. twarc hydrate to_realdonaldtrump_ids.txt > to_donaltrump.jsonl. Tweets from May 7, 2017 - June 21, 2017 of the dataset used a combination of the Filter (Streaming) API and Search API. The Filter API failed on June 21, 2017. From June 23, 2017 forward only the Search API was used to collect. This is done every 5 days on a cron job. Collection is ongoing, and this dataset will be periodically updated with additional tweet ids sets.
1,797,260 tweet ids for #paradisepapers collected with Documenting the Now’s twarc from November 5-26, 2017.
987,938 tweets retrieved that mentioned #PuertoRico over the period of October 4 to November 7, 2017. This was a period where there was increased concern being expressed in social media about the response to the humanitarian crisis caused by Hurricane Maria, which made landfall on September 20. Tweets with ids greater than 919222753353457664 were collected from the streaming API, and the earlier tweets were collected using the search API. In both cases tweets using #PuertoRico were collected.
This dataset contains the tweet ids of 35,596,281 tweets related to Hurricanes Irma and Harvey. They were collected during these events from the Twitter API using Social Feed Manager. These tweet ids are broken up into 2 collections. Each collection was collected using the POST statuses/filter method of the Twitter Stream API.
This is a collection of 1,430 tweet ids for tweets using the hashtag #BlackTheory collected on September 19th, 2017. The hashtag was used in a conversation started by Dr. Jessica Marie Johnson (@jmjafrx) on September 18th, 2017, where she asked people to name black theorists. See these tweets for context: https://twitter.com/jmjafrx/status/909850396377706496 and https://twitter.com/jmjafrx/status/911739572685438976
This dataset contains 17,292,130 tweet ids for tweets collected from the Twitter filter stream API for #blm and #blacklivesmatter between 2016-01-29 and 2017-03-18. The data was collected using the twarc utility. The files are broken into segments because of network connectivity problems that were encountered during data collection. So there are varying time gaps present between the files. Also when the hashtags were trending globally rate limits may have prevented some tweets from being streamed over the API.
This dataset contains identifiers for 8,410,431 tweets that were collected between September 19, 2017 and October 5, 2017 that mentioned #CatalanReferendum, #CatalalonianReferendum, #Catalonia, #1oct, #1o or #votarem. These hashtags were used in the lead up to the Catalan Independence Referendum on October 1, 2017. The referendum was declared illegal under Spanish law, and the Spanish police attempted to prevent it. The data collection was a collaboration with Vicenç Ruiz Gómez and Aniol Maria of the Society of CatalanArchivists working in conjunction with Ed Summers of the Maryland Institute for Technology in the Humanities. The hashtags were selected after monitoring the #CatalanReferendum hashtag for several hours on September 28 to determine what the top hashtags being used were. The tweets themselves were collected from the Twitter Search API using twarc and its twarc-archive utility. twarc-archive was run every hour to collect the tweets that occurred since the last run.
This dataset includes 10,894 tweet ids for tweets that used that hashtag #AmplifyWomen. The tweets were collected on October 14th, 2017. The hashtag started being used in response a Twitter boycott that started in support of actress Rose McGowan, after she revealed she was sexually assaulted by Harvey Weinstein. These two tweets by @bardgal and @Chatvert give some context for the purpose of the hashtag: https://twitter.com/bardgal/status/918729587625902080, https://twitter.com/Chatvert/status/918817455505575942.
This dataset includes 80,339 tweet ids collected on October 14th, 2017 that use the hashtag #WOCAffirmation. The hashtag was started by April Reign (@ReignOfApril) as a way to amplify voices of women of color and partly as a response to a Twitter boycott started in support of actress Rose McGowan, after she revealed that she was sexually assaulted by HarveyWeinstein. These tweets by April Reign show her calling for Twitter users to use the hashtag: to https://twitter.com/ReignOfApril/status/918691938143834112, https://twitter.com/ReignOfApril/status/918695092587601920, https://twitter.com/ReignOfApril/status/918696352359391232.
This dataset contains 18, 646 tweet ids documenting the March for Black Women which was held on September 30th, 2017 in Washington D.C. The dataset contains 2,925 tweet ids for tweets that included the hashtag #marchforblackwomen and 15,271 tweet ids for tweets that included the #hashtag M4BW. The march website is here: https://www.mamablack.org/march-for-black-women.
2017 Catalonia attacks were a terrorism action against pedestrians who were at La Rambla (Barcelona) and beach promenade of Cambrils on the afternoon and night of 17-18th August 2017. We selected #NoTincPor hashtag because it was the motto of the demonstrations during that period and was the most positive message. No one knew how the situation would go so the best option was to collect the dataset using the search API and limit by time between August 17 - 26, which was the final demonstration in Barcelona for the victims.
This dataset contains Twitter JSON data for Tweets related to Hurricane Harvey and the subsequent flooding along the Texas gulf region. This dataset was created using the twarc (https://github.com/edsu/twarc) package that makes use of Twitter’s search API. A total of 7,041,866 Tweets make up the combined dataset. See included README.txt for additional information.
The hashtag #DrawingWhileBlack was started by artist, Annabelle, on September 15th, 2017 to celebrate the work of Black artists. The dataset includes 69,236 tweet ids collected 09/17/2017. Annabelle’s Tumblr website can be found at http://sparklyfawn.tumblr.com/ and her Twitter profile is @sparklyfawn.
This dataset contains the tweet ids of 39,695,156 tweets collected from the Twitter accounts of aproximately 4,500 news outlets, i.e., accounts of media organizations intended to disseminate news. The media organizations include everything from local U.S. newspapers to foreign television stations. They were collected between August 4, 2016 and July 20, 2018 from the Twitter API using Social Feed Manager. Note that not all accounts may have been collected for the entire duration and there may be tweets from before the time period. We intend to update this dataset periodically.
This dataset contains the tweet ids of 5,655,632 tweets that were collected from approximately 3000 Twitter accounts affiliated with the U.S. government. They were collected between October 21, 2016 and January 21, 2017 from the Twitter API using Social Feed Manager. This dataset was created as part of the End of Term Web Archiving initiative. The lists of accounts came from the U.S. Digital Registry and by public submissions.
The 2017 solar eclipse occurred on August 21 and and was total for Oregon, Idaho, Wyoming, Nebraska, Kansas, Missouri, Illinois, Kentucky, Tennessee, North Carolina, Georgia, and South Carolina. This dataset includes 13,548,321 tweet identifiers for tweets that included any of the keywords solareclipse2017, solareclipse, eclipse2017, eclipseday or eclipse for the period August 17 to August 23, 2017. The hashtags were were selected after watching Twitter’s streaming API for the trending hashtag #solareclipse2017 and counting the most popular co-occurring hashtags. The search API was used instead of the filter stream API because the stream was emitting notifications that many tweets were not delivered, since the volume was so high.
This dataset contains Twitter JSON data for several Twitter search queries that were collected the week following the shooting of police officers in Dallas, Texas on July 7th 2017, using the twarc (https://github.com/edsu/twarc) package that makes use of Twitter’s search API. See included README.txt for additional information.
The #WITBragDay hashtag was used starting August 12, 2017 by women sharing their accomplishments in technology. Tweets matching the query WITBragDay were collected using using the POST statuses/filter method of the Twitter Stream API and the GET statuses/search Twitter REST API using Social Feed Manager. There are 34,266 ids for tweets retrieved from the filter stream and 47,621 ids for tweets retrieved using the search API. The dataset includes a list of 52,457 unique tweet ids from both APIs.
The Unite the Right rally (also known as the Charlottesville rally) was a protest in Charlottesville, Virginia, United States from August 11–12, 2017, to oppose the removal of a statue of Robert E. Lee in Emancipation Park, which itself was renamed from Lee Park two months earlier. Protesters included white supremacists, white nationalists, neo-Confederates, neo-Nazis, and militias. This dataset contains 200,113 tweet ids collected with the #unitetheright hashtag. Data collection was performed twice from the search API using twarc: once at 2017-08-13 11:46:05 GMT and the other at 2017-08-15 12:03:48 GMT. The second search was run to collect only up to where the first search left off. The time ranges for the tweets are from 2017-08-04 11:44:12 to 2017-08-15 16:03:30 GMT.
On Friday, August 11th, 2017 a large groups of racist white nationalists carrying torches marched on the University of Virginia campus in Charlottesville, VA as an intimidation tactic against proponents for the removal of confederate statues of Robert E. Lee. The Friday evening march was held ahead of a much larger racist white nationalist rally in the center of Charlottesville planned for Saturday, August 12th, 2017. This dataset includes 100,000 tweet ids collected using the DocNow prototype http://app.docnow.io/ and includes tweets sent from 01:13:56 - 7:11:36 EDT on August 12.
This dataset contains 32,056 tweets that mention “ferguson” between August 8 and August 10, 2014. They were collected on May 7th, 2015 from the search form on Twitter’s website. Some important side effects to be aware of is that the dataset does not include retweets and tweets that were deleted before May 7th, 2015.
39,264 IDs for tweets related to the Charlottesville KKK rally on July 8, 2017. These tweet IDs matched a search for ‘Charlottesville KKK OR #charlottesvilleKKK OR #blocKKK or #blocKKKparty’. These tweet IDs were collected with the twarc command line tool from Documenting the Now. Using twarc’s hydrate command, researchers can retrieve the full content of those tweets—with additional metadata provided by Twitter’s API—provided the tweets still exist.
Identifiers for 25,489 tweets about the students’ strike at the University of Puerto Rico. The tweets included the hashtag #HuelgaUPR or #Huelga2017 and are from April 11 to May 18, 2017. The tweets were collected using twarc. For a list of resources about the strike visit Puerto Rico Syllabus. Identificadores de 25,439 tuits sobre la huelga estudiantil en la Universidad de Puerto Rico. Los tuits fueron capturados utilizando twarc y cubren el periodo del 11 de abril al 18 de mayo. Para más información sobre la huelga visite Puerto Rico Syllabus.
Identifiers for 782,509 tweets that included the hashtag #macronleaks or #macrongate that were sent between 2017-05-10 16:14:51 and 2017-05-02 07:02:05 UTC. The tweets were collected from the Twitter Search API using twarc. The data does not include the first use of the #macrongate hashtag, but it does include the first use of the #macronleaks hashtag which went viral after Wikileaks retweeted it. More about the story of the #marconleaks hashtag can be found at: http://www.newyorker.com/news/news-desk/the-far-right-american-nationalist-who-tweeted-macronleaks
On 20 April 2017 the Australian Government announced that the Australian citizenship test would be made harder, with an increased focus on ‘Australian values’. Suggestions as to what ‘Australian values’ might actually be soon started to be shared on Twitter using the hashtag #australianvalues. 55,698 tweet ids for #australianvales collected with #Documenting the Now’s Twarc from 20 to 27 April 2017.
681,668 tweet ids for #climatemarch collected with Documenting the Now’s twarc from January 22-26, 2017. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py hydrate climatemarch_tweet_ids.txt > climatemarch.json.
This bag contains 10,159,892 tweets and retweets sent by or to Twitter user jk_rowling between 2015-07-08 and 2017-03-18. The tweets were collected with Social Feed Manager (m5_003).
1,276,220 tweet ids for #MarchForScience collected with Documenting the Now’s twarc from January 22-26, 2017. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py hydrate MarchForScience_tweet-ids.txt > MarchForScience.json.
The hashtag #BlackWomenAtWork began trending following Fox News host, Bill O’Reilly’s sexist and racist comments about California Congresswoman’s Maxine Water’s hair on March 28th, 2017 and White House Press Secretary, Sean Spicer’s remarks to journalist, April Ryan during press briefing on the same day. The hashtag began trending after Brittany Packnett used it in a set of tweets where she asked black women to share their experiences about being black women at work. These tweet ids were collected on four separate occasions using the DocNow prototype twitter collection tool. bwaw1 (10,000 tweets), bwaw2 (41,256 tweets), bwaw3 (92,756 tweets) were collected on March 28th, the day the hashtag began trending. bwaw4 (140,000 tweets) was collected on March 29th.
This bag contains 2,711,011 tweets identifiers collected from the Twitter filter stream between 2017-02-09 and 2017-03-18 that used one or more of the following hashtags: alternativefacts, fakenews, truthiness, postfact, posttruth, factcheck. The original tweets were collected using twarc.
This dataset contains the tweet ids of 7,275,228 tweets related to the Women’s March on January 21, 2017. They were collected between December 19, 2016 and January 23, 2017 from the Twitter API using Social Feed Manager. See included README.txt for additional information.
#brexit tweets collected from the 5th of May to the 24th August 2016.
14,478,518 tweet ids for #WomensMarch collected with Documenting the Now’s twarc from January 21-28, 2017. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py –hydrate WomensMarch_tweet_ids.txt > WomensMarch.json Also included are the logs files for the Filter API and Search API queries. The Filter API query captures the cumulative number of dropped tweets.
These 136,990 tweet ids represent reaction to a Facebook Live video that was posted on January 3rd, 2017, showing four African American men violently attacking a white, mentally disabled man. The tweets were collected on 01/05/2017. After the video surfaced, the Twitter hashtag, #BLMkidnapping, was created and used to incorrectly attribute the violent attack to members of the Black Lives Matter movement. Police in Chicago, where the attack took place, have found no evidence the attack has any connection to the Black Lives Matter movement. This link is to a CNN story documenting the police denial of Black Lives Matter connection: http://www.cnn.com/2017/01/05/us/black-lives-matter-chicago-facebook-live-beating/index.html
On January 12th, 2017 the Senate voted 51-48 to approve a budget resolution as the first step in repealing the Affordable Care Act. The hashtag #SaveACA began being used heavily on Twitter the same day as a response. This dataset includes tweet ids collected on four separate occasions on January 12th and 13th, 2017 for the hashtag #SaveACA
An ongoing collection of Tweets collected by NCSU Libraries using twarc for the key terms “HB2”, “WeAreNotThis”, and “BoycottNC”, “KeepNCFair”, and “ThisIsNotUs”. “WeAreNotThis”, “BoycottNC”, “ThisIsNotUs”, and “North Carolina” beginning on 2016-03-24, and “HB2” beginning on 2016-12-25. Only Tweets including “HB2”, “bathroom”, “bill”, or “KeepNCFair” are included from the “North Carolina” set. These tags were used to discuss North Carolina House Bill 2 (The Public Facilities Privacy & Security Act), passed in March 2016, which includes provisions (among others) that disallow local municipalities from passing their own anti-discrimination ordinances and also require individuals, when using use public bathrooms, to use those that align with their sex as stated on their birth certificates rather than the restroom that is consistent with their gender identity (see: https://en.wikipedia.org/wiki/Public_Facilities_Privacy_%26_Security_Act). This dataset is broken into files of no more than 50,000 Tweet IDs each.
A list of 10,538 Twitter IDs for tweets harvested between 4 January at 11am and 9 January at 11am using Social Feed Manager. As this used the search API, the 4 January at 11am crawl went back about 5-9 days. Tweet IDs included, as is a log of the decisions made to curate this dataset.
A list of 24876 Twitter IDs for tweets harvested between Nov. 28 and Dec. 6 2014 containing the hashtag #bill10. Bill 10 in the Alberta legislature would have given public and Catholic school boards the right to refuse student requests to form gay-straight alliances in schools. Under intense public interest it was withdrawn by the Conservative government.
This is a dataset of ids for tweets purchased from Twitter as part of the Beyond the Hashtags study http://cmsimpact.org/resource/beyond-hashtags-ferguson-blacklivesmatter-online-struggle-offline-justice/ The dataset includes a year of tweets that mention one or more of 45 keywords associated with the BlackLivesMatter movement. This period covers a critical time in which social media was used to raise awareness about police killings of unarmed Black citizens in the United States.
228,086 tweet ids for “TheHip, hipinkingston” captured during the Tragically Hip’s final concert in Kingston, Ontario in August 2016. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py –hydrate th_final_concert_kingston_tweet_ids.txt > th_final_concert_kingston.json
These are tweets that were collected between August 27, 2015 and January 4, 2016 that mention the word “trump”. This period marked important early months in the Republican primaries. They were collected from Twitter’s streaming API using twarc. There are 40,202,199 tweet identifiers in all. Due to network outages there are gaps at the following points: 2015-08-27 19:12:37 - 2015-08-27 20:13:44 ; 2015-11-02 02:02:13 - 2015-11-05 16:20:35 ; 2015-12-28 02:02:42 - 2015-12-28 02:04:00
8,595,589 tweet ids for aleppo tweets captured during the fall of Aleppo in December 2016. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py –hydrate aleppo_tweet_ids.txt > aleppo.json
Tweet ids for #elxn42 tweets.
This item represents a collection of 13,480,000 tweet IDs that mentioned ‘ferguson’ from 2014-08-10 to 2014-08-27 and 15,080,078 tweet IDs that mention “ferguson” between 2014-11-11 and 2014-12-08. The first set includes tweets for the two week period after the shooting of Michael Brown, and the second range includes tweets around the grand jury’s decision not to indict police office Darren Wilson which was announced on 2014-11-24. The first set of tweets were collected by Ed Summers at the University of Maryland and the second was a collaboration between Molly Loyd, Gregory Coleman, Kimberly Lamke, Benjamin Sugar and Ed Summers.
This dataset contains Twitter JSON data for several Twitter search queries that were collected around the #YesAllWomen Twitter “conversation” between May 25, 2014 and June 8, 2014 using the twarc (https://github.com/edsu/twarc) package that makes use of Twitter’s search API. A total of 2,805,763 Tweets and 34,532 images make up the combined dataset.
Tweet ids for #YMMfire tweets captured during the 2016 Fort McMurray Wildfire from 2016-05-01 to 2016-06-25.
Tweet ids for #NDP2016 tweets during the 2016 NDP Convention.
Tweet ids for #panamapapers tweets.
Tweet ids for #thechalkening tweets.
Tweet ids for #MakeDonaldDrumpfAgain tweets.
Tweet ids for #paris #Bataclan #parisattacks #porteouverte tweets.
This dataset contains the tweet ids of approximately 280 million tweets related to the 2016 United States presidential election. They were collected between July 13, 2016 and November 10, 2016 from the Twitter API using Social Feed Manager. These tweet ids are broken up into 12 collections. Each collection was collected either from the GET statuses/user_timeline method of the Twitter REST API or the POST statuses/filter method of the Twitter Stream API.
This data set identifies 38M tweets collected for the analysis of social media messages related to the 2012 U.S. Presidential election. The data set provides tweet IDs for tweets containing the words “obama”, “romney”, or both (case-insensitive matching) during the period from July 1, 2012 through November 7, 2012. The paper, “Online and Social Media Data As an Imperfect Continuous Panel Survey.” PLoS ONE 11(1): e0145406 by Diaz et al. provides further description of the dataset.
Tweet ids for #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, #CharlieHebdo tweets.
Tweet IDs for tweets carrying the #cdnpoli hashtag, applied to Canadian politics, collected as part of a larger project centered on Canada’s 42nd federal election.