Been doing R&D on what a Chrome extension can capture without video or audio recording. The list is longer than expected:
DOM snapshots, user interactions (clicks, scrolls, hovers, selections), timing and sequences, navigation events, form inputs (masked), viewport position, element visibility, CSS computed styles, console logs, network request metadata, clipboard events, focus/blur patterns, window state changes.
That’s 13+ signals per tab. No screen recording, no audio, no permissions popup.
The question for agent builders: how do you structure this for LLM consumption?
Options I’m exploring: raw event stream summarized by a smaller model, structured JSON with semantic labels, natural language narration from events, or embeddings per “action chunk” for retrieval.
Signal-to-noise is the whole game. 13 channels of raw data is useless. 13 channels distilled into “user compared pricing on three tabs, then abandoned checkout” is gold.