JAMStackifying the pollution dashboard

A. Documentation


Goals

  1. Make it blazingly fast to query pollution data for the user, no matter what the date, timespan, or number of cities
  1. Ensure that the data is as close to real-time as possible
 

Constraints

  1. Cost cannot be too high. Great user experience is important, but it has to be relatively inexpensive to maintain
  1. Security is super important. Need to have auth before the data is accessed
  1. Cannot be so computationally intensive on the front-end that it produces lag/overheating, and cannot be so slow on the backend that it introduces significant latency

Approach

  1. Use a JAMStack based approach. Have a basic serverless function "sum up" the data from Cloudflare KV store to serve requests
  1. Update "monthly" data that contains hour-by-hour updates every week, and use the "live" weekly and hourly data for the rest of the calculations
 

Specifications

  • For city-level details: Store data for each city and each month from 2016 onwards at an hourly level in the key-value store. With the key schema CITYNAME_YYYYMM and value schema {"POLLUTANT_NAME": [[TIMESTAMP, VALUE], ... ], ... }. All further processing can be done on the Cloudflare Worker. Since we are charged per write request, I will just overwrite the CITYNAME_MONTH data with new data when available
  • For overall details, given a month: Store data for each month with the average value for all cities. Key schema: YYYYMM. Value schema {DATE: {POLLUTANT_NAME: [[CITY_NAME, AVG_VALUE], ... ]}, ..., }

B. Code


[Done!] Step 1: Get data for all past months, and save it in the Key-Value store

Note: This may cause the SQL database to fail. So do it only at night!
 
First, get data from SQL
SELECT * FROM cpcb WHERE date_int >= %d AND date_int <= %d
CITY-LEVEL DETAILS: Then, do some cleaning in pandas and aggregate data by city, by hour, and by pollutant type
Upload the data in the pollution-city namespace in the key-value store
MONTH-LEVEL DETAILS: Do some cleaning in pandas and aggregate data by date, pollutant type, and city

[Done!] Step 2: Modify the cronjobs so it automatically updates the relevant files in KV

Since there's no "direct" API, just make a normal HTTP request
curl -X PUT "https://api.cloudflare.com/client/v4/accounts/ACCOUNT_ID/storage/kv/namespaces/NAMESPACE_ID/bulk" \
     -H "X-Auth-Email: USER_EMAIL" \
     -H "X-Auth-Key: USER_AUTH_KEY" \ #NOTE: this is NOT the AUTH TOKEN
     -H "Content-Type: application/json" \
     --data '[{"key":"My-Key","value":"Some string"}, {"key":"My-Key-2","value":"Some string", ... }]'
Or from Python:
url_to_put = f"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/storage/kv/namespaces/{KV_NAMESPACE_ID}/bulk"

headers = {
    "X-Auth-Email": EMAIL_ID,
    "X-Auth-Key": AUTH_KEY, #note the AUTH_TOKEN, the AUTH_KEY
    "Content-Type": "application/json"
}
res = requests.put(url=url_to_put, data=json.dumps(fin_dets), headers=headers)
#fin_dets in a list of items in the {"key": key, "value": value} format
print res.json()
Use environment variables for ACCOUNT_ID, NAMESPACE_ID, USER_EMAIL, and AUTH_KEY, and you're good to go
 
Ways to do this
Options
  • Create an hourly cronjob to update both the hourly data and the daily data (this is a bit wasteful — specially for the daily data)?
  • Maybe do the daily data once a day, and just get the "latest" day data by aggregating the hourly data? Doing this will significantly cut down on the number of write requests
 
Option I finally decided on:
Update the hourly data after every cronjob
Update the daily data at the end of every day by creating a simple cron within Cloudflare workers!
 
Note: on the front-end when making daily data requests, don't include the current day. Values for it will be misleading anyway since it won't have the full days worth of data?

[Done!] Step 3: Create some code on Workers to clean up the data and convert it into a form usable at the front-end

  • Check auth
  • Get data for the relevant city and relevant time period from KV store
  • Remove all data that's not needed, or combine multiple values together
  • Reduce data to the amount of granularity required
  • Convert concentration data to AQI data, if needed
 
Request params:
  • List of cities
  • Time from in yyyymmdd (ist)
  • Time to in yyyymmdd (ist)
  • Pollutant requested (PM2.5, PM10, O3, NO2, CO)
  • Time granularity (hours/days/weeks/months)
Done!
💡
Limit discovered! CloudWorkers imposes a 50ms limit on the functions. So if they get too long, the function will die. Because the CPU time becomes way too high. To fix this, I just have to be more thoughtful with the way requests are sent to the server. Instead of sending one request with the entire payload, I can consider sending multiple requests at once. That will make things a lot better! ^ the above won't work, as the main limitation is on JSON parsing. JSON.parse blocks the main thread, so parallelising it won't really work as JS is single-threaded. Even if I send different cities in different requests — the JSON parsing in each request remains a huge problem Should also profile function to make sure that I know which functions are being slow and where the bottlenecks are
 
Gotta move from JS to Rust if I want good performance!
Alternatively — save data in a different format. JSON is slow. Maybe save as text that's easy to parse with split?
So instead of {"POLLUTANT_NAME": [[TIMESTAMP, VALUE], ... ], ... }
use a different format that is not nested and works faster. Like a string that is POLLUTANT_NAME_1:t1,v1,t2,v2, ...|POLLUTANT_NAME_2:t1,v1,t2,v2, ...|.
Or can even remove the pollutant names and save its based on position (PM2.5 always in first position, PM10 always in second position etc)
Alternatively, use more keys (city_month_pollutant), and have the value be a text? (went with this option)

[Done!] Step 4: Repeat this exercise for METAR data

 

[Done!] Step 5: Integrate the API properly with the front-end and design it in a generalizable way

Done!
 

[Done!] Add auth+login/signup. Refactor code and try to generalize. Integrate auth with frontend

How would this work?

If a user is not authenticated, return a 401 error. Only return a 200 response if the user is authenticated
To do this, the user should send a header with some kind of authentication. Perhaps with an id and a hash?

C. Nice to haves (Thursday to Sunday)


[Thursday - Sunday] Move the front-end to React + make UI improvements

  • Started learning React, and realized that it may be overkill at this point in time. But may be worth learning anyway because it'll help me scale much better (and faster) in the future. Specially if a lot of DOM manipulation (tables, images, charts) is involved
  • Can consider moving from Bootstrap to Tailwind. Or at the very least, having a much more customized CSS than what I currently do. Should also either restyle the graphs in Highcharts, or do them up from scratch in D3
 
As an aside, I need to ensure:
  • consistent animations to indicate when data is loading
  • consistent datepickers and dropdowns
  • consistent graph styles
  • consistent animations for buttons, dropdowns, graph load etc

Create automated reports about something (say, pollution levels?) and host that on Cloudflare pages

  • Medium. Try to do on Friday evening/night

Add Stripe integration

Easy

Add testing (unit tests for each sub-function)

Easy

Copy the entire "pollution" worker to Rust [entirely optional — but may bring performance improvements in the future]

Easy, but will take time as I learn some of the fundamentals of Rust

D. Future direction


  • Add more data sciency things to this. Like whether or not two variables are correlated, has there been a sharp spike in one variable etc. Basic predictions would be dope, too
  • Create live popper-like reports with data on the fly, with an option of saving these reports somewhere
  • Scrape data directly from CF workers and put it in Google Sheets/KV storage — can reduce reliance on GCP this way + maybe open up other use cases?
  • Build the last (delivery) layer of the personalization algorithm on Workers. The calculation of item similarity can stay on GCP, while the similarity scores can be in KV. Can respond to updates in user preferences much faster this way