Visualizing Archival Appraisal
The web is an immense and constantly changing information landscape that intrinsically seems to resist the idea of the archive. But this challenge hasn’t stopped the digital preservation community from working for the past 15 years to build tools and practices for crawling, storing, and replaying web content. Even with the herculean efforts of the Internet Archive to archive the entire web, this work is performed with the understanding that practical decisions need to be made about what to save. These decisions, also known as the process of archival appraisal, shape the historical record, and what we know about the past (Cook, 2011).
Just like in the image of the archivist above, until fairly recently appraisal decisions have been made by people (usually archivists) about particular sets of physical records (usually, but not always, paper documents). While this practice still continues, appraisal is increasingly enacted as part of a human-machine collaboration, where archivists use computers to help select and collect content from the web using automated processes and algorithms. Thinking about appraisal as a form of human-computer collaboration allows us to look at appraisal as a data visualization problem, where analysis, summarization and representation of web content can potentially enhance the archivists ability to identify web content for the archive. In addition the introduction of social media offers the promise of a more participatory archive where the archivist makes appraisal decisions in collaboration with others.
To illustrate, consider this tweet:
https://twitter.com/AntonioFrench/status/498283364672348160
After the killing of Michael Brown in Ferguson Missouri, St Louis Alderman Antonio French video-recorded the protests and demonstrations with his smartphone. French then uploaded these videos to Vine, and shared them on Twitter, where they were circulated as retweets. These tweets, and others like them, were cited by mainstream media from around the world. The sharing activity, and the discussion, are important traces of attention and engagement that can provide essential context for an archivist who is attempting to document the Ferguson protests.
We think it’s useful to look at this engagement in social media as attention data that can be used to identify and evaluate content that is in need of archiving. How can this attention data be visualized to help archivists document the web? Most importantly how can the data serve as signals that connect the archivist and the communities that they are documenting?
Previous Work
Of course aren’t the first ones to think about the potential of social media for web archiving. In 2013 the British Library experimented with using Twitter as a source for collecting URLs to archive. BL found that while the data held promise for appraising web content, a significant amount of processing of the URLs was needed to make them usable. In particular they found problems with duplication (multiple URLs for the same resource) and significant difficulties with unwanted spam.
Rollason-Cass and Reed (2015) discuss how the Internet Archive used social media to help document the #BlackLivesMatter movement. The Documenting the Now project collaborated with the Internet Archive to supply a list of URLs that were found in 13 million Ferguson tweets that were collected in the two weeks following the killing of Michael Brown. However this process was performed in a difficult to repeat, ad-hoc fashion, that didn’t involve archivists directly in the appraisal work.
The iCrawl framework introduced by Gossen, Demidova and Risse (2015) uses a human-in-the-loop approach, where an archivist enters an initial query based on a recent event, and this query is used to query search engines and social media sites in order to generate a list of URLs. The web content found at these URLs is then collected, and named-entity-recognition is performed on the text to help generate a list of entities to collect from the web. This work closely resembles what we are trying to do in DocNow. But rather than focusing on extracting semantics from the web resources we wanted to focus on a visualization environment for the appraisal activity to happen in.
Another strand of relevant work can be found in the world by AlNoamany, Weigle & Nelson (2017) on how social media can be viewed as for of storytelling about the web. These stories and their characteristics are useful models, and sources of data, for archivists as they assemble collections of web content for an archive.
Before diving into our design it’s important to quickly understand two concepts related to web archiving work: the seed list and nomination.
Seed List
Much web archiving appraisal work to date centers on the concept of a seed list. The Archive-It Glossary of Web Archiving Terms defines a seed list as:
One or more starting point URLs from which a web crawler begins capturing web resources.
Each URL in a seed list is a candidate for web archiving. The seed URL itself is fetched by the web crawler and the response is saved. Each seed has an associated scope that defines how far outwards from the seed the web crawler will collect from. Determining what URLs to place in a seed list, and what their scope should be, is the core appraisal activity that the archivist performs. The web archiving software then processes the seed list, which results in the creation of a web archive collection. But given the size of the web determining what should be in and what should not be in a seed list for a particular collection is a challenge.
Nomination
Nomination is the process by which URLs are added to seed lists. Efforts have been made to engender these conversations in the web archiving community. Schneider (2003) describes an early effort to develop a system that allows archivists to submit candidate seed URs for the September 11 Web Archive. Since 2008 the End of Term Web Archive has collected US government websites at the end of presidential administrations. The project uses a web based nomination tool that allows archivists to suggest seed URLs to archive. In 2016 the project received heightened attention because of concerns about environmental data disappearing from the public web during the Trump administration. Because of this attention, the Environment Data Governance Initiative created a browser extension that allowed volunteers at Data Rescue events to submit large numbers of seed URLs. Users were able to browse government websites, identify datasets and other resources of value, and quickly submit them to the End of Term Nomination Tool without interrupting their browsing.
Design
While there has been innovative work around the construction of web archiving seed lists using these approaches to nomination, there has been little work on how to present archivists with these URLs for use in the construction of seed lists for their web archiving efforts. The current state of the art, provided by the Internet Archive’s Archive-It service is to present these URLs in a tabular format:
Our design explores visualizations for seed list construction, with particular attention to the selection and deselection of material about a specific hashtag event for an archive. As noted by the British Library the problem of unwanted resources (spam) in the social media stream is a significant challenge for archiving web content. But spam and other types of unwanted messages that deny attention can sometimes be desirable to record, given the type of collection that is being assembled, and the nature of documentary evidence. Archivists need an efficient workflow for perusing lists of URLs in the context of social media conversation in order to evaluate them for an archival collection. We wanted an interface that allowed users to quickly peruse the websites and select or reject them.
Our design draws on the concept of a document card developed by Strobelt et al (2009) and popularized by Google’s Material Design framework which offers the Card component that operates as
… a sheet of material that serves as an entry point to more detailed information.
The use of cards allows content to naturally reflow on desktop and mobile environments using techniques of responsive design.
The majority of the work involved in implementing our design was focused on the data analysis pipeline that takes tweets as input and provides website metadata as output. One beneficial side effect of social media’s influence on the web has been the widespread use of metadata in HTML to control how web documents appear in social media platforms like Facebook and Twitter. We lean on this metadata in our data collection. The pipeline’s output is made available as a web service call to a card based visualization of the website metadata.
The Tweet Collector is driven by a user interaction where the user is able to view Twitter content for a real time search of tweets via the Twitter API. Co-occurring hashtags as well as specific user accounts can be added to the search by selecting them. All the retrieved tweets are stored in ElasticSearch where they are indexed by properties such as user name, hashtag, tweet text, etc.
The URL Extractor walks through the collected tweets in ElasticSearch, extracts URLs mentioned in the tweets, and puts the URLs into a work queue. The work queue was implemented using a list in Redis, where each item in the list includes the URL to be fetched and metadata about the tweet collection it came from.
The URL Fetcher listens for jobs on the work queue, and when it gets one it will fetch the URL from the web, and look for metadata about the web document (HTML metadata, Facebook’s Open Graph Protocol, Twitter Cards). The extracted metadata includes a representative image for the page, the title, a descriptive summary, the canonical URL, and keywords for the page. The metadata is then stored in Redis using the original URL and canonical URLs as a key, in order to prevent repeated lookups for the same URL. Redis is also used to keep a count of the number of times a given URL was referenced in a collection of tweets.
Finally the Data API was implemented as a REST API which makes URL metadata for a given collection of tweets available to a Card Visualization. Each card in the visualization represents a web page that was referenced in the Twitter data set. The cards are ordered by how many times they were mentioned in the collected Twitter data:
As the user scrolls through the document cards they can select or reject a web page for archiving. Because the Internet Archive is such an important service to the web archiving community an icon is added to the document card to indicate whether the webpage is currently archived in the Wayback Machine. If it is not currently archived clicking on the logo allows a user to tell the Internet Archive to collect it now using their Save Page Now feature (Rossi, 2017). Clicking on the number next to the Twitter logo in the document card will open a modal dialog with the tweets that mention the webpage.
Future Work
The experience of implementing our design using document cards highlighted several areas for future work to improve the visualization of archival appraisal. Primary among these is a user study to observe archivists actually using the visualization in their work. Observing archivists while performing web archiving appraisal would help identify whether the document card visualization is working as desired. A user study would also identify additional data and behaviors that could be useful.
For example as part of some initial informal testing with a set of volunteer users we identified several potential improvements:
- grouping document cards by website (e.g. all YouTube videos)
- limiting to or filtering out mainstream media content (e.g. websites in the Alexa Top 500)
- scoping document cards to those active in Twitter in a particular time period
But the primary feature that we feel is worth exploring in future work is what a collaborative model could look like for the appraisal activity. Web documents are by definition remotely accessible. So there is no reason why a set of documents couldn’t be processed in real time or asynchronously by a group of web archivists. The question is how to mediate the activity of interacting with the document cards in a way that allows for meaningful collaboration.
Finally the approach of using Twitter as a proxy for popular consciousness, and as a way of identifying web documents potentially in need of archiving has limitations that it would be useful to address in the DocNow visualization. As boyd and Crawford (2012) and many other scholars have stressed, Twitter and social media more broadly should not be taken to be representative of society as a whole. Social media data is representative of social media. Viewing this data as a simple measure of objectivity is fraught with biases and a host of other problems. How can this data be visualized and interacted with so as to make these inherent biases clear? Can the data be triangulated with other information sources to help assess the material? Answering these questions would be easier after conducting an in depth user study.
References
AlNoamany, Y., Weigle, M. C., and Nelson, M. L. (2016). Characteristics of social media stories. In International Conference on Theory and Practice of Digital Libraries, pages 267–279. Springer.
boyd, d. and Crawford, K. (2012). Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, communication & society, 15(5):662– 679.
Cook, T. (2011). We are what we keep; we keep what we are: archival appraisal past, present and future. Journal of the Society of Archivists, 32(2):173–189.
Dourish, P. (2016). Algorithms and their others: Algorithmic culture in context. Big Data & Society, 3(2).
Gossen, G., Demidova, E., and Risse, T. (2015). iCrawl: Improving the freshness of web collections by integrating social web and focused web crawling. In Proceedings of the Joint Conference on Digital Libraries. Association for Computing Machinery.
Huvila, I. (2008). Participatory archive: towards decentralised curation, radical user orientation, and broader contextualisation of records management. Archival Science, 8(1):15–36.
Rollason-Cass, S. and Reed, S. (2015). Living movements, living archives: Selecting and archiving web content during times of social unrest. New Review of Information Networking, 20(1–2):241–247.
Strobelt, H., Oelke, D., Rohrdantz, C., Stoffel, A., Keim, D. A., and Deussen, O. (2009). Document cards: A top trumps visualization for documents. IEEE Transactions on Visualization and Computer Graphics, 15(6):1145–1152.