When we first got started on Documenting the Now a few months ago Yvonne Ng raised an interesting use case of archiving video cited in social media. She specifically was interested in Periscope videos found in the Twitter stream. Yvonne cares about this because of her work as an archivist at WITNESS, where she helps human rights activists in their use of video. It’s also of particular interest to the Documenting the Now project because it speaks to the role that the archivist plays in selecting content. Archiving media streams is tricky because first you have to find the streams, sometimes while they are still active, and once you find them you need to be able to effectively capture the streaming content.
A few days ago this topic popped up again as part of a wider conversation as members of Congress used Periscope and other video streaming tools to record and spread awareness about the sit-in that was taking place in the Capitol building. What record of this protest will remain when Periscope no longer exists? While we don’t yet have functionality in DocNow to archive this video content that is being shared, we thought it might be an opportunity to share a possible recipe and example for how to do this work, and to get some feedback from you.
The first, and perhaps most difficult, step is to identify the content to be collected. This began on Wednesday when MITH’s Director, Neil Fraistat, stopped by my desk and suggested we try collecting tweets using the hashtag #NoBillNoBreak. There are whole areas of study devoted to event detection, but in DocNow we are focused, at least at first, on people making these initial decisions about what is of value. Archivists know these decisions as appraisal. In DocNow we’re thinking of appraisal as an iterative process, in which the archivist is a human-in-the-loop, or part of a sociotechnical system in which the archivist and automated agents coproduce the archive.
So what could this coproduction look like in practice? Let’s take #NoBillNoBreak as an example. It was clear from quickly looking at the #NoBillNoBreak activity that the conversation was spanning multiple hashtags. So rather than just collect the one hashtag we did an initial search for #NoBillNoBreak which yielded 17,000 tweets.
With these 17,000 tweets in hand we could then generate a list of other hashtags that co-occur with #NoBillNoBreak. Here were the top 20 at that time.
hashtag tweets
--------------------------
#holdthefloor 1897
#goodtrouble 1674
#disarmhate 643
#enough 576
#noflynobuy 565
#periscope 484
#sitin 442
#turnonthecameras 413
#nomoresilence 354
#endgunviolence 308
#gunviolence 269
#guncontrol 180
#gunsense 166
#trumpspeech 136
#gop 116
#onepulse 115
#nra 94
#ham4sitin 94
#stopgunviolence 73
#demsneversat 68
#euref 64
#brexit 64
#enoughisenough 53
#2a 53
#imwithher 43
At this point we took a look at these hashtags to decide which ones we wanted to start collecting. This is the nitty gritty of appraisal, so it’s an important place in DocNow where we want to be able to record the why of our decisions of what to collect. In our case we weren’t interested in tweets about Trump, the NRA, the British Referendum on leaving the EU, or the Republican or Democratic Parties specifically, so we removed some hashtags related specifically to those things. You often see people piggybacking or hijacking hashtags with their own hashtags — so it’s important to be able to curate this list. After pruning we were left with the following list of hashtags that we wanted collect:
- #nobillnobreak
- #holdthefloor
- #goodtrouble
- #disarmhate
- #enough
- #noflynobuy
- #sitin
- #turnonthecameras
- #nomoresilence
- #endgunviolence
- #gunviolence
- #guncontrol
- #gunsense
- #stopgunviolence
We’re thinking this iterative appraisal process of starting with a simple search, examining results, expanding the search, examining results, refining the search, rinse, lather, repeat, is an important user interaction for DocNow to support.
With our final set of hashtags in hand we started two twarc processes: one to collect what we could from the tweets that had been sent so far (in the 7 day window that the Twitter Search API allows), and another to collect new tweets with any of those hashtags as they came in from the Twitter Streaming API. The search finished after 16 hours after collecting 1,098,389 tweets. As of this morning the stream had collected 1,607,875 tweets. Of course running twarc isn’t something we expect users of DocNow to do. This, or something like it, will happen behind this scenes. But one thing we are considering is functionality that will let users on Twitter know that tweets are being collected, give them an opportunity to learn more about why the data collection, and possibly even opt out of the collection.
One question that always arises in this sort of work is how long to collect for since the stream never really ends completely. The sit-in is now over, so one could make an argument that data collection could stop now. But Neil suggested that we keep data collection going until Congress reconvenes on July 5th. His idea is that it could be interesting to see how the conversation is sustained over that period. So having a research question in mind, or being able to articulate an collection development policy to guide the data collection is important.
It’s also important to note that at this point all we have is the JSON data from the Twitter API. This does not include any of the images or video that are being shared on the Web. However the URLs for these media resources are present in the JSON data. We were interested in the Periscope content, and it’s fairly easy to pull all the URLs being shared out of the Twitter data, and count the ones at periscope.tv. We did this on the command line using jq and some standard Unix utilities. But obviously this is also something that DocNow will make easily accessible and available to users:
zcat search.json.gz stream.json.gz \
| jq -r .entities.urls[].expanded_url \
| grep periscope.tv \
| sort \
| uniq -c \
| sort -rn
Believe it or not there were 2,837 unique Periscope URLs in the dataset at that time. To give you an idea of what is there here are the top 10 Periscope URLs that were shared:
- (1391) We are back at it. We will be here all night until we see action on gun violence. #NoBillNoBreak by Rep Eric Swalwell
- (1009) Untitled by Rep Scott Peters
- (902) It’s 5AM & we aren’t sleeping. House Dems continue to demand action to end gun violence. #NoBillNoBreak by Rep Eric Swalwell
- (782) Untitled by Rep Scott Peters
- (749) Our fight continues into the night. We serve to protect you. #NoBillNoBreak by Rep Eric Swalwell
- (730) Untitled by Rep Scott Peters
- (305) Just chatted w/ colleagues. We are staying. We were sent to keep all of us safe. Watch live. #NoBillNoBreak by Rep Eric Swalwell
- (284) Untitled by Rep Scott Peters
- (215) Our fight continues into the night. We serve to protect you. #NoBillNoBreak by Rep Eric Swalwell
- (118) Untitled by Rep Scott Peters
- (186) Untitled by Rep Scott Peters
It is worth pointing out a few things that went on behind the scenes into generating this simple list.
- Two individuals were responsible for creating these videos: Congressmen Scott Peters and Eric Swalwell. Determining their identity was done by viewing the video’s page, recognizing their name of the creator, and doing a bit of searching on the Web for more information about them. The name itself is also present in the HTML metadata of the page using the meta element with name twitter:text:broadcaster_display_name. A combination of machine and human work is involved in figuring out who the content creators are.
- Each video has a title that can be obtained from the HTML page for the video. Not everyone is good at adding a title to their videos (hint, hint Congressman Peters). A good title is useful for displaying the video in another context.
- There are up to three unique URLs for a given video. For example the top URL from Congressman Swalwell is available at https://www.periscope.tv/w/ajnzDzE5Mjg4OTZ8MU95S0Fsdm5MQWJ4Yp3USgkvcxPuirSpcS89FUGU0MGDJ35qnv7LYFzyckc_ and https://www.periscope.tv/w/1OyKAlvnLAbxb and https://www.periscope.tv/RepSwalwell/1OyKAlvnLAbxb. When counting the occurrences these need to be normalized. The easiest way is to lean on the canonical URL that is expressed in the HTML.
This massaging of the data so that it can be understood by an archivist or researcher is an important step in the process that will be automated by DocNow.
With a list like this in hand it’s now possible to decide to archive some of the video content. In this case these are public officials, so you could consider these videos as part of the public record. But generally speaking we want DocNow to support not only identifying the content to be archived, but also helping you communicate with the content owners, to ask for their consent in being part of your collection, and tracking the conversation you may have with them. Your decision of whether to do this or not largely depends on what you are planning to do with the content. If it’s going to form part of your research and you aren’t planning on publishing it perhaps you don’t need consent from the content creator. But if you are creating a collection for an archive that could be used by any researcher perhaps it is a good idea to seek their permission. We are conscious that there are multiple answers here, and that our work around the ethics of archiving social media will be important in defining functionality here.
But how do you go about archive streaming video on the Web?
Archiving the Web, especially the dynamic Web is a hard job. If you pick one of the Periscope URLs and use the Save Page Now function in the Internet Archive’s Wayback Machine here is what you will see:
This isn’t meant to criticize the Internet Archive. They are doing an amazing job at archiving the Web. It’s just important to recognize that archiving the Web isn’t a solved problem. Some things are missing, and some things that appear to be saved aren’t really there. It’s too big a job for just one organization, and one approach. We think this is an area where a tool like WebRecorder could help a great deal, because we are interested in Web archiving that is driven by an archivist making decisions about what content to save, and are able to review the process. Below is a brief example of archiving one of the Periscope videos with WebRecorder.
One important thing to point out in this video is that it is possible to create an account on WebRecorder so that you can save the content there so that you can come back and view it later. Exactly how DocNow will interact with WebRecorder is still up in the air. But we ❤️ the way it empowers the user to drive the selection and review process.
Hopefully this provides you with one key user story we’re hoping that DocNow will support. One aspect to this work that hasn’t really been remarked on so far is the unstated assumption that the videos worth saving were the ones that were mentioned the most. An argument could be made that there could be valuable video material that was not discovered immediately, and did not get shared widely. Determining the most useful set of tools and views for sifting through the content will likely need to be more expressive and extensible than simply counting the most popular things. If you have ideas about the sorts of views you’d like to see, or have any questions please leave them in comments here. Just highlight the text and leave your comment, or join us over in Slack.