DocNow Database Schema

creator: Ed Summers
created: 2020-01-14

Problem

We have determined that it’s feasible to stream an adequate volume of tweets (a trending hashtag at 50 tweets/sec) into a PostgreSQL instance running on a Raspberry Pi 4. This was with a single table (id, json). Now we need to determine what database schema we would need in order to support the current (and anticipated) functionality provided by the DocNow app.

Requirements

The database schema that could support the types of metrics that are currently being provided by the DocNow application. This schema will be expressed using the [knex] library to easily manage the schema over time. A loading script written in JavaScript will be written for loading data from the Twitter filter stream into the database schema, and another script will time a series of reports from collected data. This reporting script can be used to benchmark how well the database schema can be used in live in the DocNow application.

Proposal

Create tables to model the following relations:

Settings: name, twitter app keys, default quota,
User: username, api keys, email address, superuser, quota
Location: location name, location id
Trend: location id, tweet count, created (we could keep these?)
User Locations: user id, location id
Collection: title, created, updated, description, creator, tweet count, active
Query: collection id, query, created
Tweet: id, tweet_id, collection_id, created, twitter_user_id, retweet_id, quote_id, retweet count, quote count, reply count
Image: url, title
TweetImage: image_id, tweet_id
Video: url, title
TweetVideo: video_id, tweet_id
Webpage: url, title, description, image
TweetWebpage: webpage_id, tweet_id
Hashtag: name
TweetHashtag: hashtag_id, tweet_id
Search Tweets Queue
Filter Tweets Queue
Url Lookup Queue

A few things to note about this schema:

The query for a collection has been separated from the collection so that it can be changed over time. This is useful when a user wants to add or remove things from being collected without having to create a new collection, which is what is required in the current application.
Tweets are stored with their collection id. So if the same tweet is collected twice it will end up getting stored twice. While tweets are relatively static, some of their metadata is not, such as the number of times a tweet has been retweeted, or quoted, or replied to.
We need to pay attention to is the performance on joins. For example to get the tweet counts per collection it will be necessary to join Tweet, TweetHashtag and Hashtag. There may be ways to denormalize this further if it is too slow.
It may be feasible to collapse the task queues into one. But we don’t want requests to fetch tweets to block behind URLs to lookup. The different types of messages will need their own topic so that workers can subscribe to just the things they need to process. Hopefully knex will let us express and version the Postgres queues as it does tables.

Action Items

create schema with knex with db migration
update server/database and server/redis to use the new schema. This should result in server/redis going awayj
update server/api to use updated server/database module
update server/url-fetcher to use updated server/database module
update server/stream-loader to use updated server/database module