Some tools I made for public

This is a PSA that you can build a surprising variety of tools to run in the browser, tools that you might have expected to only run on the command line or on a server.

Some context: I like toolmaking. I naively used to wonder if I could find a job for making internal tools.

Turns out that, depending on the company, you can just fill a normal job description and develop internal tools as needed. That's the case at my current job. I'm a new hire at a smallish company, finding many opportunities to be a generalist under the pretense of web support.

One obstacle at the moment is that I don't have a good way to deploy custom code. At most, we use a low-code integration platform, but I haven't been introduced to it yet. I've been given a folder in public_html to deploy some internal tools in the meantime.

I'm ok with this! I enjoy ~~not touching the backend~~ building things that can be deployed as a static HTML page. It's a powerful and convenient way to share software.

So what have I put in public_html?

FONT MANIPULATION WITH PYODIDE a

One of our dependencies chokes on variable fonts. I worked around the error with a Python one-liner:

uvx fonttools@4.56.0 varLib.instancer variable_font.ttf wght=700 -o bold_static_subfont.ttf

But non-technical teammates needed a way to accomplish that without opening a terminal.

I used Pyodide to browser-ify the script:

Pyodide makes it possible to install and run Python packages in the browser with micropip. Any pure Python package with a wheel available on PyPi is supported. [Many other packages have also been ported for use with Pyodide.]

I found Pyodide remarkably convenient, especially since fonttools was among the 300-ish prepackaged libraries.

This was a quick page whipped up in a day or two. Kinda like a write-once maintain-never Perl script. Half the work was figuring out how to move bytes from the user to JavaScript to Python. The boilerplate looks something like:

dropzone.addEventListener('drop', async (e) => {
  let file = e.dataTransfer.files[0]
  let arrayBuffer = await file.arrayBuffer()
  let uint8 = new Uint8Array(arrayBuffer)

  let virtualInPath = 'in.ttf' // arbitrary paths
  let virtualOutPath = 'out.ttf'

  let pyodide = await loadPyodide()
  await pyodide.loadPackage('micropip')
  pyodide.FS.writeFile(virtualInPath, uint8)

  await pyodide.runPythonAsync(`
import micropip
await micropip.install("fonttools")

from fontTools.ttLib import TTFont
font = TTFont("${virtualInPath}")
# do work ...
new_font.save("${virtualOutPath}")
`)

  uint8_output = pyodide.FS.readFile(virtualOutPath)
  let blob = new Blob([uint8_output], { type: 'font/ttf' })
  let url = URL.createObjectURL(blob)
  let a = document.createElement('a')
  a.href = url
  a.download = 'processed_font.ttf'
  a.click()
}

The project directory is just assets unzipped from the GitHub release:

index.html
pyodide/fonttools-4.56.0-py3-none-any.whl
pyodide/micropip-0.11.0-py3-none-any.whl
pyodide/pyodide...wasm
pyodide/pyodide...js
pyodide/...
pyodide/python
pyodide/python.bat
pyodide/python_cli_entry.mjs
pyodide/python.exe
pyodide/python_stdlib.zip

ENCRYPTED PAGES WITH PAGECRYPT a

Since public HTML is viewable by anyone with a link, we should not upload confidential content. But what if we upload encrypted content that can only be decrypted by fellow employees? This is the premise of Pagecrypt.

With Pagecrypt, users access pages by entering a password. I think we can do better. Nobody wants to manage more passwords. So I rewrote everything to use an OAuth login button instead of a password box.

Completing the OAuth flow results in an auth token. Ideally, I would tell our server -- or have an OAuth proxy layer tell our server -- to only respond to requests that carry an auth token. But I can't tell our server to do anything! It keeps serving static files to everyone.

Instead of using the auth token to request content from our server, the page uses the auth token to fetch a password from somewhere abstracted away from users. Somewhere gross and shameful... cell A1 of a spreadsheet in a shared Google Drive. (I wonder if there's a better way to store a shared secret in this scenario?)

Each public HTML page is a login template injected with an encrypted payload. The page fetches the password, decrypts the payload, and overwrites itself. It looks something like:

<h1>This page is only for $COMPANY employees</h1>
<button onclick="startLogin">Sign in with Google</button>
<script type="text/plain" id="encrypted-payload">skPenNIiq3...lkfxKdwZ9h</script>

<script>
startLogin = () => window.location.href = 'https://accounts.google.com/o/oauth2/v2/...'

onSuccessfulLogin(authToken) {
  let secretSpreadsheet = fetch('https://sheets.googleapis.com/v4/...', `Authorization: Bearer ${authToken}`)
  let password = secretSpreadsheet.values[0][0]

  let encryptedPayload = $('#encrypted-payload').textContent
  let decryptedPayload = decrypt(encryptedPayload, password)

  document.write(decryptedPayload)
  document.close()
}
</script>

I'm omitting a few hundred lines of error handling, OAuth plumbing, and window.crypto.subtle functions.

For completeness, here's what build.mjs looks like:

let contentToEncrypt = read('src/index.html')
let encryptedPayload = encrypt(contentToEncrypt, password)

let template = read('template.html')
template('#encrypted-payload').textContent = encryptedPayload

write('dist/index.html', template)

I think self-decrypting pages are cool. Kudos to the Pagecrypt author for sharing the idea and implementation.

But I am not comfortable with my franken-cryption. It feels too clever and hacky and brittle. I see several downsides:

this depends on something outside of public_html
I had to spin up a whole Google Cloud Console project
I have to allowlist a new redirect URI for each public HTML page
I anticipate someone accidentally modifying, moving, or deleting the secret spreadsheet
I don't want to condition employees to click on more "Sign in with Google" buttons. Any adversary can spin up a Google project, name it something believable, and ask for read or write permissions just like I did
if a savvy adversary gets a link to a public HTML page, they could save the page, take their time brute forcing the decryption key, and retain access to all assets and functionality. So just in case, I make sure to only serve semi-sensitive read-only content, not highly sensitive content or tools that modify company data.

I think it's good enough as a temporary solution. Hopefully we get proper server-side authentication soon.

DOCS WITH STARLIGHT a

I started taking onboarding notes on my first day. Eventually, the notes grew into a wiki worth sharing. My first instinct was to share it as a static site.

I probably should've put my notes somewhere simpler like a Google Doc or a Slack channel called "Justin's Dumb Questions".

There is nothing surprising about a static site in public_html. But I still wanted to include this section so I could praise the tooling and share a couple new-to-me configurations.

Starlight is great for quick documentation sites. At its simplest, you put markdown files in src/content/docs. I like what's generated automatically: a clean, readable theme; tables of contents; heading slugs; and most of all, a search box using Pagefind.

Pagefind is great. At build time, Pagefind indexes site content and writes special binary artifacts to dist/pagefind/. When the user types in the search box, the Wasm search engine reads those index files.

fragment/en_258b5fb.pf_fragment
fragment/en...
fragment/...
index/en_194adb5.pf_index
index/en..
index/...
pagefind.en_dad68cd660.pf_meta
pagefind...js
...
pagefind...css
wasm.en.pagefind
wasm.unknown.pagefind

But I don't have to worry about any of that. Starlight's Pagefind integration works well out of the box so I don't have to fiddle with build steps or UI components. It's a superb developer experience.

One of my site tweaks was a custom build step. I wanted to clean up URLs and also encrypt pages as mentioned earlier. As of astro@5.10.0, here is the boilerplate for a custom build step:

defineConfig({
  integrations: [
    ...
    myBuildStep()
  ]
})

import fs from 'node:fs/promises'
import url from 'node:url'
import path from 'node:path'

function myBuildStep() {
  return {
    name: 'myBuildStep',
    hooks: {
      'astro:build:done': async ({ pages, dir }) => {
        for (let page of pages) {
	  let filepath = path.join(url.fileURLToPath(dir), page.pathname, 'index.html')
	  let htmlContent = await fs.readFile(filepath, 'utf-8')

	  let transformedHtmlContent = TODO(htmlContent)

          await fs.writeFile(filepath, transformedHtmlContent, 'utf-8')
}}}}}

One more site tweak: I wanted to deploy the site to a subfolder like example.com/a/b/my-astro-site. I needed a config like defineConfig({ base: '/a/b' }) in order to make Astro build all links and references relative to that path. Using <base href="/a/b"> was not enough.

This broke my HTTP dev server. To be clear, the default Vite-powered dev server worked fine. But I run my own dev server for various reasons -- one reason being that Pagefind doesn't work on the Vite dev server.

If you're weird like me and use Caddy as a dev server, you can configure it to handle subfolder paths with something like:

caddy run --adapter caddyfile --config <(echo '
http://localhost:8000 {
	handle_path /a/b* {
		root * dist
		file_server
	}
}')

QUERYING DATABASE HISTORY WITH DATASETTE OR DOLT OR DUCKDB a

For quality control, we want to audit our product catalog with queries like "collect all product descriptions" or "find products where attribute X is null". And we also want to query historical data like "view a history of all changes to product ABC-123" or "compare yesterday's catalog state to today's catalog state".

Unfortunately, our legacy CMS provider only gives us a few preset reports. So I'm scraping our product catalog and storing it in another database where we can see everything.

I just found out that we might have access to a nightly data dump. I should've asked around more.

Either way, data retrieval happens outside of public_html. I'll focus on what goes inside public_html now.

First, I need to flatten the data, because nested JSON objects and arrays are not easily imported into a database. Products currently look something like:

sku
name
description
...
price_breaks: [ 
  {quantity,price} 
  {quantity,price} 
  ... 
]
custom_attributes: { 
  type: [ {name,value} {name,value} ... ]
  type: [ {name,value} {name,value} ... ]
  ...
}

Once the data is flattened, I can start thinking about database tooling. How can I store and query a database while staying inside public_html? And what "temporal" database features exist to store a history of updates?

I investigated a few options.

1) DATASETTE

The git scraping pattern was the first solution that came to mind. Tooling already exists to convert a git history of JSON scrapes into an SQLite database, and then query the database in the browser with Datasette Lite -- which, funny enough, is built with Pyodide as mentioned earlier.

I'm not sure if the history format and diff viewer (demo) will support my use cases. I need to flesh out a prototype to be sure. But if I have to redesign the SQL and UI, I might as well consider other options.

2) DOLT

Dolt is a version controlled database, like if "Git and MySQL had a baby". Catalog management is a known use case for Dolt, so that's promising.

The import step seems simple enough. On every scrape, I could run something like dolt table import products.json and dolt commit.

Then to read the database, I think I could use their Wasm build in the browser. Reading the latest catalog state would be a simple SELECT * from products. To read historical data, I would query auto-generated tables like dolt_diff_products or dolt_history_products.

Dolt has a nice diff viewer (demo, source?) that seems to be using this React component. Maybe those components will be easy to put on my own page.

Like any new tech, Dolt is likely to have breaking changes and missing documentation. I'm not sure if the pros outweigh the cons of using un-boring technology.

I need to build a demo to confirm all my assumptions.

3) DUCKDB

Instead of bolting on extra database history tooling, I can design a "vanilla" SQL table upfront to support historical data.

I could make an append-only table where each record represents the state of a product when it was inserted or updated. I found this example from Stack Overflow:

{
  sku: ABC-123
  name: new name
  ...
  version: 2
  timestamp
},
{
  sku: ABC-123
  name: old name
  ...
  version: 1
  timestamp
},
{
  sku: XYZ-987
  name: something
  ...
  version: 1
  timestamp
},

Which seems pretty similar to historized attributes, now storing intervals instead of points in time:

sku
name
description
...
valid_from
valid_to

Apparently this is a type of "slowly changing dimension". The Dolt guy explains it better.

I'm not sure how queries would look. I know that to make queries easier, I should have one table for current catalog state and another table for historical data.

But I have no intuition for anything else. Frankly, this is my first time working with SQL beyond a trivial SELECT FROM WHERE. I have lots of research and prototyping ahead of me.

I'll leave some breadcrumbs of related tools in case they help someone:

Postgres has a Wasm build, but the temporal tables extension has not been ported
TerminusDB is a graph database similar to Dolt, but it's very serverful by design with no Wasm build (demo)
Datomic can do time traveling queries, but it seems too clever for me, and I don't have a Clojure-friendly environment anyways
JSON patch (RFC 6902) is a diff format
Is there a way to use Google Sheets for scraping and tracking historical data so I don't have to write so much code?

After I figure out the SQL design, I will still need a way to store, query, and render the data.

DuckDB seems great for storing and querying data. It can convert JSON to the efficient Parquet storage format. It can run queries in the browser using a Wasm build. And I can write a barebones UI on top of everything.

This is all part of a bigger vision to scaffold our legacy CMS with tests, observability, and version control. Some potential future steps:

scrape our HTML/CSS/JS to finally have a local copy of source code
then we can grep it, run it locally, run tests, isolate bugs
and use version control, write commit messages, develop on feature branches
then we could even roll back both the mock API and the frontend code to a particular point in time for time-travel debugging

POSTSCRIPTUM a

In hindsight, I feel like this is a lot of overengineered, overeager junior behavior.

Wouldn't obtaining a VPS, or agreeing to deploy to a spare office laptop, be easier than contorting oneself inside public_html?

I guess it depends on who makes the decisions. I am still learning to navigate the non-technical parts of the job.

I also remind myself that these are not load-bearing tools: in a sense, it's all still throwaway code outside of my day-to-day responsibilities, not too painful to dispose or replace if the team wishes.

But there will always be something elegant about whipping up a client-side tool, dropping it in public_html, and sharing a link.