DocNow Database Schema
Problem
We have determined that it’s feasible to stream an adequate volume of tweets (a trending hashtag at 50 tweets/sec) into a PostgreSQL instance running on a Raspberry Pi 4. This was with a single table (id, json). Now we need to determine what database schema we would need in order to support the current (and anticipated) functionality provided by the DocNow app.
Requirements
The database schema that could support the types of metrics that are currently being provided by the DocNow application. This schema will be expressed using the [knex] library to easily manage the schema over time. A loading script written in JavaScript will be written for loading data from the Twitter filter stream into the database schema, and another script will time a series of reports from collected data. This reporting script can be used to benchmark how well the database schema can be used in live in the DocNow application.
Proposal
Create tables to model the following relations:
- Settings: name, twitter app keys, default quota,
- User: username, api keys, email address, superuser, quota
- Location: location name, location id
- Trend: location id, tweet count, created (we could keep these?)
- User Locations: user id, location id
- Collection: title, created, updated, description, creator, tweet count, active
- Query: collection id, query, created
- Tweet: id, tweet_id, collection_id, created, twitter_user_id, retweet_id, quote_id, retweet count, quote count, reply count
- Image: url, title
- TweetImage: image_id, tweet_id
- Video: url, title
- TweetVideo: video_id, tweet_id
- Webpage: url, title, description, image
- TweetWebpage: webpage_id, tweet_id
- Hashtag: name
- TweetHashtag: hashtag_id, tweet_id
- Search Tweets Queue
- Filter Tweets Queue
- Url Lookup Queue
A few things to note about this schema:
- The query for a collection has been separated from the collection so that it can be changed over time. This is useful when a user wants to add or remove things from being collected without having to create a new collection, which is what is required in the current application.
- Tweets are stored with their collection id. So if the same tweet is collected twice it will end up getting stored twice. While tweets are relatively static, some of their metadata is not, such as the number of times a tweet has been retweeted, or quoted, or replied to.
- We need to pay attention to is the performance on joins. For example to get the tweet counts per collection it will be necessary to join Tweet, TweetHashtag and Hashtag. There may be ways to denormalize this further if it is too slow.
- It may be feasible to collapse the task queues into one. But we don’t want requests to fetch tweets to block behind URLs to lookup. The different types of messages will need their own topic so that workers can subscribe to just the things they need to process. Hopefully knex will let us express and version the Postgres queues as it does tables.
Action Items
- create schema with knex with db migration
- update server/database and server/redis to use the new schema. This should result in server/redis going awayj
- update server/api to use updated server/database module
- update server/url-fetcher to use updated server/database module
- update server/stream-loader to use updated server/database module