hot take: indexeddb may not be as reliable as you think

-- the following post has been migrated from my old blogs and reposted here ---

as a developer building user-facing features with the indexeddb api, i expected it to be stable enough for offline recording. that assumption cost a user four hours of work and dropped a bomb on me in the most unfavorable time, early morning (5am to be exact).

what happened?

i want you to picture this:

a user on android chrome recorded a 4-hour meeting, hit stop, and nothing showed up.

what do you mean nothing showed up??

there was no file, no recording session, just nothing but a bunch of errors flagging us: "recording not found."

at first, i thought it was something our recovery system caught so we were just chilling but the more i dove deeper, we really have no clue what just happened.

cue: this is fine

i then thought, maybe they had switched to another tab, switched the phone off but eventually got user confirmation that they hadn't done any of those things so now it's totally on me.

what's weird was, the recording kept going on even after the failure. in fact the failure only happened on the end! here's the predicted timeline:

Time Range	Event Description
T+0m	user recording started (session created successfully)
T+0m → xm	user recording successfully
T+xm → xm	???????
T+xm → 240m	user hits done and woah nothing's there!!

so many questions (investigation)

first of all, why didn't the recorder stop as soon as it happened? we have checks for every chunk to make sure the sessions exist BUT it just decided to nuke itself at the very end.

i admit: this is a design flaw and we should've had backups in place but that's the issue, i trusted way too much on indexeddb.

anyways.....

understanding the system, we saved audio in 5-second chunks, tracked by a session record. during debugging, we really cannot replicate this issue. so we deleted the indexeddb entry for that session and lo and behold:

same issue, replicated successfully

so what happened? what deleted our indexeddb entry?

we still have no idea up until now because we don't have code that deletes session entries so either .put failed (it didn't) or something else so we shifted to making it more robust instead.

while figuring stuff out, i've came across this article/post by @pesterhazy named the pain and anguish of using indexeddb: problems, bugs and oddities and figured out that indexeddb was somewhat unreliable.

not just for us, but apparently also for firebase, pouchdb, amplify, and basically any library that thought using browser storage was a good idea. the post described cases where safari randomly deletes storage after a few days, where transactions hang without throwing, where .put() fails silently but resolves anyway. and even on chrome, tab throttling, memory pressure, and devtools can break your data without any trace. keep in mind while most of these happen in safari, and while chrome's implementation is the "best":

how did it end up where this situation happens when we have no code that deletes anything from indexdb related to this feature?

[start] store.put (success)
[middle-end] random deletion without even QuotaExceeded or something else.

so no, this wasn’t a bug we could just fix. we literally don't know what happened, there were no stack traces, there was absolutely nothing.

rewrite (mitigation)

since we couldn’t prevent this class of bug, we started rewriting the whole recording flow to survive it. you're looking at:

failing loudly!!!!! (yes)
having redundancy (uploading it to a storage bucket in the bg)
another redundancy (separate metadata from chunks)
and verification everywhere there's an operation to make sure stuff's actually there and in the right order.

so what now?

i no longer trust anything that resolves without checking the result. if a write says “done,” i double check and its making me paranoid. if a session disappears, the system stops. if storage breaks, i recover from another source.

indexeddb will still do what it wants but at least now when it breaks, absurdly loud.

my takeaways on this

don't, don't, don't, don't ever trust blindly and please please put redundancy even if you're sure that it will work. always do verification because goddamn i don't want to be waken up at 6am because i blindly trusted a supposedly well documented and used api.

what happened?

so many questions (investigation)

rewrite (mitigation)

so what now?

my takeaways on this

about the author:

ryana