Ilya Kreymer recently asked a question over in Documenting the Now Slack about Twitter’s API data and whether it includes metadata for when the tweet has been labeled as disinformation. This structured data is important for building tools that help trace how disinformation is propagating in Twitter and social media more generally. It also can provide a view into how Twitter themselves are working to combat the problem.
I’ve looked for the disinformation label in Twitter API JSON in the past and not seen it. But I figured it couldn’t hurt to look again so, I used this tweet as an example. It’s a snap to fetch the JSON data for a tweet with twarc:
$ twarc tweet 1297495295266357248 > tweet.json
I’ve included the retrieved data below. I don’t see anything related to the label, do you?
https://gist.github.com/edsu/3271d6aec4a2ed9192065425c9aeb56b
I also took the opportunity to look at Twitter’s new v2 API and see if the twet looks any different there. Twarc doesn’t support the v2 API yet so I hand rolled a little program to talk to the v2 Twitter API:
https://gist.github.com/edsu/1e2c4fc2ea2dae90ea8fb660b708e690
More about the options in a moment. This program fetched this representation of the same tweet:
https://gist.github.com/edsu/2e039cc078e1bb5a2e69940c8ff99e17
Again, I still don’t see any information about the content warning, but maybe I wasn’t squinting right? I did see that according to the data that Donald Trump is in the class of Person who are “Named people in the world like Nelson Mandela”. I mean yes, but NO.
So about the options. Unlike the v1.1 API when using the v2 API you need to indicate in the request what fields you would like to have returned in the response. There are a set of names for the types of fields, such as media.fields, tweet.fields
, place.fields
, poll.fields
and user.fields
. Each of these field types has an enumerated set of associated values like duration_minutes
for a poll, or context_annotations
for a tweet, etc. Think of this as some kind of strange alternative to GraphQL.
There are lots of these enumerated values to choose from so I started by simply requesting all of them. Interestingly this failed, and the error message I received indicated that I had requested field values that required elevated privileges. Once I removed these from the request I was able to get back the JSON I pasted above.
The little error message actually provided a small glimpse of what data Twitter does and doesn’t provide to “regular” developer accounts through the v2 API. For example I had to remove non_public_metrics
from media_fields
and tweet_fields
because the following fields required additional permissions:
- non_public_metrics.impression_count
- non_public_metrics.url_link_clicks
- non_public_metrics.user_profile_clicks
I guess that makes sense since they are non-public. Similarly I had to remove organic_metrics
from media.fields
and tweet.fields
because the following fields required additional permissions:
- organic_metrics.impression_count
- organic_metrics.like_count
- organic_metrics.reply_count
- organic_metrics.retweet_count
- organic_metrics.url_link_clicks
- organic_metrics.user_profile_clicks
And finally I had to remove promoted_metrics
from media.fields
and tweet.fields
because the following fields required additional permissions:
- promoted_metrics.impression_count
- promoted_metrics.like_count
- promoted_metrics.reply_count
- promoted_metrics.retweet_count
- promoted_metrics.url_link_clicks
- promoted_metrics.user_profile_clicks
I guess these are metrics for advertisers who are paying to have their tweets slipped into our timelines. Many of these seem to be present in the v1.1 Metrics API but it’s interesting that the new API is folding that functionality in to the representation of tweets. I did notice that the public_metrics
counts were all zero. As much as I might wish this to be true for this particular tweet I know it’s probably just a bug that Twitter are working on.
I think this highlights one of the major disadvantages to collecting the JSON representation of a tweet. Some information is passed by reference, such as URLs for images and videos. Those need to be fetched in order to understand a tweet.
But some information is simply not included, such as this disinformation label, which must exist somewhere in Twitter’s infrastructure. The same is true for reply threads to a tweet — although this might get easier with the v2 API’s support for threads. And lastly the JSON representation changes. It has been remarkably stable for the past 10 years or so, with small things added here and there (e.g. 280 characters instead of 140). v2 marks a significant change. Where stream based processing of tweets will require substantial changes to keep working.
Returning to the topic of disinformation, for the moment it might be useful to add support toa scraping tool like twint to see if these disinformation labels could be pulled out of the page and serialized into the JSON and CSV it generates. Scraping is against Twitter’s terms of service, but this information is so important for public discourse the research and archival communities are left with little other choice.
Originally published at https://inkdroid.org.