If you’ve been following our work for the past few years you may have occasionally seen us mention something called the Catalog. The Catalog is a clearinghouse for Twitter datasets, that has been built by a small but growing community of researchers, archivists and librarians who recognize that social media platforms like Twitter provide essential documentary evidence of our times.
For a variety of reasons Twitter’s Terms of Service don’t allow data collected from their APIs to be published on the public web. But they do allow, and even encourage, researchers to share Tweet Identifier (ID) datasets, which can then be reconstituted (or hydrated) as data using Twitter’s API. We’ve developed a desktop application called the Hydrator, which makes it easy to perform this process of turning a dataset of tweet IDs (which is essentially just a list of numbers), back into tweet JSON, and then even transform it into CSV for research and analysis.
Data archiving is important for one’s memory work and reproducibility of research, but we think its important to further the agency of content creators who are using social media to document their lives. Consent to have ones content be archived is a difficult thing to ascertain when collecting “public” tweets from thousands (or millions) of users. However, when a user decides to delete a tweet, or take their account private, we think it’s important to honor that decision. It’s far from perfect (some deleted tweets need to be remembered), but sharing tweet ID datasets is one way of simultaneously pursuing the goals of data publishing for research, and ethical archiving. P.S. Look for more approaches for navigating consent later this year.
So, the Catalog is simply a descriptive list of these tweet ID datasets, with pointers to the repositories around the web where the data resides. After a few years of modest but steady submissions the Catalog now contains 113 datasets that, taken together, comprise 2.5 billion tweets. While this is a large number the Documenting the Now community aims for quality over quantity. Each dataset has been assembled by a librarian, archivist or researcher for a specific purpose. When it comes to social media and the web it’s just not possible to preserve everything, so our decisions about what to preserve and why become that much more important.
Today we are releasing a new version of the Catalog to support the increased number of datasets it contains, and also to make it a bit easier to explore and submit them. The first version of the Catalog was a very simple static website built using the Jekyll publishing framework. At the time we wanted the Catalog to be a static site to make it easier to maintain and sustain over time. Once deployed static sites have many less moving parts than a typical web application. But as the number of datasets grew, and more and more people wanted to submit them, we knew we had to make some changes.
Version 2 of the Catalog is still a static site, but it uses the Gatsby framework which allows the website to function more like an web application, while still operating as a simple static site on the server side. This means we still have the sustainability wins of running a static site, while allowing users to filter and sort the datasets, and also then visit a distinct page for each dataset.
Furthermore, the netlify-cms JavaScript library allows the Catalog to offer administrative forms for easily adding and editing datasets, which then seamlessly commits the change back to the docnow/catalog GitHub repository. This means we still have a static site that is easy to maintain over time, where all the content lives in a versioned Git repository. But we are no longer asking users to hand edit data files and submit pull requests in order to share information about their datasets.
Having a form now provides an opportunity to learn more from contributors about their engagement with ethical aspects of the data collecting work. This work has been informed specifically by Alexandra Dolan-Mescal’s Social Humans project, who also designed the new Catalog. The new add dataset form has this additional set of questions:
- Did you publicly share (on Twitter) that you were collecting this content?
- Do content creators have the option to opt out of your collecting or remove their content after collection has taken place?
- Do you have an easy-to-find way for content creators to reach out to you or your organization about your collecting project?
- Have you analyzed the tweet dataset for potential threats to those whose content has been collected?
- Is this tweet dataset part of a larger collecting effort of topical materials, including oral histories, web archiving, or physical materials?
And finally, for the data nerds out there, one nice side effect of using Gatsby and netlify-cms is that each dataset in the Catalog is now stored as a distinct Markdown document. Markdown lets you to compose rich textual documents, with lists, links, images and other formatting, while also expressing structured metadata. The netlify-cms JavaScript library makes it easy for users to edit these documents without them even knowing it. For example, here is the Markdown file for a dataset contributed by Laura Wrubel and Dan Kerchner from George Washington University:
George Washington University is currently still in the process of updating this dataset, as the Coronavirus pandemic continues to spread around the world. The new Catalog allows them and others to easily edit the description as they release more data.
We hope that you will get a chance to give the new Catalog a try, either for publishing a new dataset, or finding a new dataset to work with. If you get a chance to use it, have ideas, or notice problems please don’t hesitate to send us an email at [email protected], or to add issues to the issue tracker at GitHub. We’ve got some more things planned for integrating the Catalog better with the Hydrator, and the DocNow application, so look for those later this year.