System Designhard12 min read

Design Google Docs (Collaborative Editor)

How two people type in the same paragraph at the same time without garbling it: OT vs CRDTs, contenteditable, presence cursors, offline edits, and autosave.

Published 20 May 2026 · by Frontend Masters India

"Design Google Docs" is really one question wearing a costume: when two people edit the same sentence at the same instant, how do you end up with one document instead of two corrupted ones. Everything else (toolbars, comments, autosave) is scaffolding around that one hard problem. Interviewers ask it because it forces you to reason about concurrency, conflict resolution, and a notoriously awful browser API all at once.

1. Scope it first

Don't reach for CRDTs in the first sentence. Ask:

How many people edit at once? Two people on a memo is a different problem from forty people on a meeting agenda. It changes how aggressive your conflict handling needs to be.
What can the document contain? Plain text, or rich text with bold, headings, images, tables? Rich text is much harder because formatting spans ranges of characters.
Do we need offline editing? If someone keeps typing on a plane and reconnects an hour later, the merge story gets serious.
How fast must remote changes appear? Sub-second feels live. A few seconds feels broken.

Assume the realistic answers: rich text, dozens of concurrent editors, offline support, near-instant sync. Then design to that.

2. The document model

Never store the document as one giant HTML string. You can't reason about concurrent edits to a string. Model it as a structured tree, the way a real editor framework (ProseMirror, Slate, Lexical) does.

type Doc = {
  type: "doc";
  content: Node[];
};
type Node =
  | { type: "paragraph"; content: Inline[] }
  | { type: "heading"; level: number; content: Inline[] };
type Inline = { text: string; marks?: ("bold" | "italic")[] };

Edits are expressed as operations against this model, not as "the new HTML." An operation is something like "insert 'x' at position 42" or "add bold mark to range 10–18." This matters because operations are what you'll send over the wire and transform against each other.

3. Why contenteditable is the part everyone underestimates

The browser gives you contenteditable, and it looks like a gift. It is a trap. Different browsers produce different DOM for the same keystroke. Press Enter and you might get a <div>, a <p>, or a <br> depending on the browser. Paste from Word and you inherit a pile of junk markup. The DOM the user creates is not a DOM you control.

So the real editors do this: they keep their own model (the tree above) as the source of truth, render it into a contenteditable surface, and intercept input events to translate them back into model operations. The DOM becomes a view you write to, and you fight the browser's default editing behavior the whole way. This is the part most people fumble in interviews because they assume contenteditable "just works."

editorEl.addEventListener("beforeinput", (e) => {
  e.preventDefault();           // don't let the browser mutate the DOM
  const op = mapInputToOp(e);   // translate to a model operation
  applyLocal(op);               // update model, then re-render
  sendToServer(op);
});

4. The hard problem: merging concurrent edits

Two people type into the same line at the same time. Both edits are valid against the document they each saw, but applied naively they conflict. There are two families of solutions, and you should be able to explain both.

Operational Transformation (OT). Every client sends operations to a central server. When op B arrives but op A already happened concurrently, the server (or client) transforms B so it accounts for A. If I insert "cat" at position 5 and you insert "dog" at position 5 at the same time, transformation shifts one of them so both survive. OT needs a central server to order operations, and the transform functions are genuinely hard to get right for rich text. This is what Google Docs actually uses.

CRDTs (Conflict-free Replicated Data Types). Instead of transforming operations, every character gets a unique, sortable identity, and the merge rule is built into the data structure so any two replicas converge no matter what order changes arrive. No central authority needed, which makes them great for offline and peer-to-peer. The cost is metadata: every character carries an ID, so memory and document size grow. Libraries like Yjs and Automerge made this practical.

When to pick which: OT if you already have a central server and want a compact wire format. CRDT if you need offline-first, peer-to-peer, or want to avoid running transform logic on a server. For a from-scratch interview answer in 2026, reaching for a CRDT (Yjs) is the pragmatic call, and saying why you'd lean that way scores well.

// CRDT mental model: characters have IDs, merge is order-independent
[{ id: "a1", ch: "H" }, { id: "b2", ch: "i" }]
// two clients insert concurrently; IDs decide the final order deterministically

5. Presence: other people's cursors

Live cursors and selections are what make the doc feel alive, and they're a separate channel from the document itself. Presence data is ephemeral. You don't save it, and you don't run it through the conflict resolution path. Broadcast each user's cursor position and selection range over the realtime connection, throttled so a fast typist doesn't flood the channel.

The tricky bit: a remote cursor is a position in their view, and the document may have shifted since they sent it. Anchor remote cursors to the same operation/position system the document uses, so when text inserts above someone's cursor, their caret moves down correctly. Render the carets as absolutely positioned overlays, not as DOM inside the editable surface, or you'll corrupt the model.

6. Offline edits and reconciliation

If you chose a CRDT, offline is mostly free: queue local operations in IndexedDB while disconnected, and on reconnect, merge them. Convergence is guaranteed by the data structure. With OT you have more work: replay queued operations against the server's current state, transforming each one as you go.

Either way, persist unsynced changes locally so a refresh or crash doesn't lose work:

// on every local op
await idb.put("pendingOps", op);
// on reconnect
const pending = await idb.getAll("pendingOps");
await syncQueue(pending);

Show the user the truth: a small "All changes saved" / "Saving..." / "Offline, will sync" indicator. People trust a doc they can see is safe.

7. Autosave and performance

You're not saving on every keystroke. Debounce a snapshot to the server every couple of seconds of idle, while streaming individual operations live for collaboration. The operations keep collaborators in sync instantly; the periodic snapshot is the durable checkpoint that lets a new joiner load the doc without replaying its entire history.

For a long document, don't re-render the whole tree on each change. Diff at the node level and only repaint the paragraphs that changed. Very long docs (a 200-page contract) also benefit from virtualizing rendering so off-screen pages aren't in the DOM, though that complicates find-and-replace and printing, so call out the trade-off.

What the interviewer will push on

"OT or CRDT, and why?" Name the trade-off: OT is compact but needs a central server and hard transform logic; CRDTs merge anywhere and shine offline but carry per-character metadata. Pick one and justify it for the scope you set.
"Two people format the same word at once." Marks are range operations. With a CRDT, formatting attaches to character IDs so concurrent bold/italic both apply. With OT, you transform the range against the concurrent insert.
"How do new joiners load a huge doc fast?" Load the latest server snapshot, then apply only operations since that snapshot, instead of replaying full history.
"How do you keep it accessible?" A custom editor must reimplement what the browser gave away: arrow-key navigation, screen-reader announcements of edits, focus management, and aria-live for collaborator activity. This is genuinely hard and worth admitting.
"What if the realtime connection drops mid-edit?" Buffer ops locally, show an offline state, reconnect with backoff, then reconcile. The user should never feel a hitch.

The one-paragraph recap

A collaborative editor keeps a structured document tree as the source of truth, renders it into a tamed contenteditable surface by intercepting input events, and resolves concurrent edits with either Operational Transformation (central server, compact ops) or a CRDT (offline-friendly, merges anywhere, more metadata). Live cursors ride a separate throttled presence channel and anchor to the same position system as the text. Offline edits queue in IndexedDB and reconcile on reconnect, autosave debounces durable snapshots while operations stream live, and rendering diffs at the node level to stay fast. Lead with the concurrency problem and contenteditable's hostility, and you've shown them the parts that actually matter.

Before you leave — how confident are you with this?

Your honest rating shapes when you'll see this again. No grades, no shame.

Comments

to join the discussion.

Loading comments…