TLDR; By “redesign” I actually mean clusterfuck. 47% of the 315 whitehouse.gov URLs I’ve collected during 2017 are now 404 Not Found.
Since the White House website accidentally broke (or intentionally disabled) its RSS feed back when Trump entered the White House I’ve been running a bot to scrape the website to generate an unofficial RSS feed. I wanted an RSS feed that I could point my diffengine bot at, to watch for changes (or diffs) at whitehouse.gov, which are then tweeted by whitehouse_diff.
The RSS generation has run for almost a year without much fuss, but a few days ago I started to get error messages from the bot. When I dug in to figure out what was going on I could see that:
https://www.whitehouse.gov/blog
is now redirecting to:
https://www.whitehouse.gov/articles/
Poking around in the Wayback Machine you can see that this switch over happened last Thursday, December 14, 2017:
Checking out the representation of the web pages before and after you can tell that the style and layout has changed significantly:
Examining the HTML makes it clear that the new website is now being driven by WordPress instead of Drupal which, at least according to Wikipedia and lore, was running previously. The first mention I could find on Twitter about the change is in this tweet:
https://twitter.com/williamsba/status/941746481022799872
One thing that the conversation thread attached to that tweet makes evident is that it’s not clear who did this job, and people who seem to know aren’t saying.
Don’t get me wrong: content migrations like this can be hella tricky. But this is the White House website, not just someone’s personal blog. So, I thought it merited a bit of extra work to see how many of the old URLs still resolve properly.
Link checking work can be difficult to get right, particularly because of false positives when the web server responds 200 OK for content that that is no longer available, or has drastically changed. In the web archiving community this is known as reference rot, which is related to the better known problem of link rot. Nevertheless, I took a quick look at the URLs that my diffengine instance has collected for the last year for www.whitehouse.gov. There are 315 of them and can you believe that 47% (149) of the URLs are 404 Not Found? You can find the complete results in this CSV.
Unfortunately this little sampling from 2017 does not bode well for the entirety of URLs on the White House website. Presumably (hopefully?) the content is still living at a new location. But not maintaining these links, or at least redirecting them, means that the many, many links from the larger web to the whitehouse.gov website are now broken. In addition to breaking the web this will most likely effect how White House web pages are represented in Google’s index as well, since Google will treat these pages as effectively gone, which (if it were any other website) will negatively impact their Google Juice.
But I imagine the current administration isn’t too worried about a botched CMS migration as they focus their attention on trashing our economy and the environment while sending us teetering on the brink of nuclear war.
It seems like a small, almost petty thing to focus on a handful of website links while the White House is engaged in much more deliberate dismantling of our system of government — as it rends the social fabric that somehow, against the odds, keeps us working together. But let me close with what is most telling about this whitehouse.gov redesign.
WordPress is an amazing open source project. It has been going for 14 years and just keeps on getting stronger despite the many powerful forces that are arrayed against it. WordPress is a vestige of an older, more decentralized web, where people didn’t solely rely on a handful of big social media players to publish the content that matters to them. Central to the functioning of the WordPress ecosystem is the technology of RSS also known as Really Simple Syndication. RSS allows people to subscribe to a website. It allows content to flow from one website into other websites.
When you install WordPress RSS is on by default, because, in many ways, it’s the point. You want to share what you are publishing on the web with the world. You have to go out of your way to turn it off. What kind of person decides to turn it off?
For more about the redesign story see this Mashable article by Sasha Lekach. Thanks to Derek Willis for pointers and taking a look at this as it was being written.
Originally published at inkdroid.org on December 20, 2017.
This post originally mistakenly stated that 98% of the URLs were broken, which was an error introduced by a small bug in my link checking code. My apologies. I jumped to conclusions about the dangerous criminals that are currently running the White House.