Chonkology: A Mathematical Theory of Audiovisual Narratives
Highlights only. This page is a narrative skim of the paper. For the full formal development and all definitions, download the full PDF.

Introduction
A chonk is a short-form video artifact: a still image brought to life through orchestrated visual transformations synchronized with audio. This document develops the mathematical foundations of chonks, proceeding from primitive spaces through increasingly refined definitions.
We begin with the raw ingredients (images, transforms, audio), introduce the space of arbitrary audiovisual pairings, and then characterize chonks as those pairings satisfying a synchronization condition. This approach mirrors the construction of, say, measurable functions from arbitrary functions—the larger space provides context for understanding what makes the smaller space special.
Primitive Spaces
We start with the ingredients: images, camera-like transforms, and a shared mood space that both audio and motion can live in. Those three pieces let us define what “synchronization” even means.
The Image Space
Definition. Let denote the space of images:
The Similarity Transform Group
Definition. Let denote the group of 2D similarity transformations (uniform scaling composed with translation):
where
Proposition. forms a -dimensional Lie group under composition:
Remark. We restrict to similarity transforms (uniform scaling) rather than the full affine group to preserve aspect ratio—a chonk zooms and pans but does not shear or stretch.
The Mood Manifold
Definition. Let be a low-dimensional smooth manifold parameterizing narrative/emotional states:
Possible coordinates on include:
- Tension : suspense vs. resolution
- Energy : calm vs. intense
- Valence : dark vs. bright
Remark. The exact dimensionality and coordinates are aesthetic choices. The essential property is that is shared between audio and visual modalities.
The Narrative Bundle
We combine the mood state and transform state into a single product space; that is the stage on which a chonk’s narrative path lives.
Definition. The narrative bundle is the product manifold
Definition. The canonical projections are
Chonks as Synchronized Pairings
Given an audio track, the model imagines a path through the narrative bundle: mood on one axis, visual transform on the other. Chonks are precisely those audiovisual pairings where that path is synchronized with the audio. The PDF expands on the audio mood function and its construction; here we only use it.
The Narrative Path
Definition. A narrative path is a continuous map
At each moment, specifies both a mood state and a visual transform.
Definition. The visual trajectory derived from is
The Synchronization Condition
Definition. A narrative path is synchronized with audio if
Remark. The symbol "" allows for degrees of synchronization.
The Chonk
This is the core object: image + narrative path + audio. Everything else can be derived from that triple.
Definition. A chonk is a 3-tuple
where:
- is a source image,
- is a narrative path synchronized with ,
- is an audio signal.
The visual trajectory is derived as .
Definition. The chonk space consists of all synchronized pairings (viewing as ).
Remark. A chonk is minimal: image, narrative path, audio. Everything else is derived:
- The trajectory
- The mood arc
- The focal point , defined as the limit point of as zoom increases
- The final scale , defined as the scale component of
The chonk contains exactly the information needed to render, nothing more.
The Coherence Measure
Synchronization is not binary; we score how aligned the motion is with the audio. The coherence measure formalizes that and gives a knob for optimization and generation (see the PDF for the wider filtration story).
Definition. The coherence of a pairing is
where is a mood function induced by the trajectory (e.g. from its velocity or acceleration profile).
Remark. The exponential form is a convenient normalization that converts average squared mood mismatch into a similarity score in . Other monotone transforms would yield equivalent coherence orderings but different sensitivity profiles.
Proposition. Coherence satisfies:
- for all
- if and only if (perfect synchronization)
- as synchronization degrades
Summary
The punchline is simple: a chonk is a synchronized path in the narrative bundle.
Theorem (Chonk Characterization). A chonk is a synchronized path through the narrative bundle , where synchronization means the mood component tracks the audio's emotional content:
Future Directions
These are the main technical directions the full paper explores in more detail.
Optimization-Based Generation
If we can score coherence, we can optimize for it.
Definition. Given image , audio , and waypoints , the coherence-optimal trajectory problem is
Energy-Constrained Generation
Coherence alone can produce overly jittery motion; we can penalize jerk or energy.
Definition. The energy-constrained objective is
Search Over Strategy Space
Strategies can be compared by the coherence distributions they induce.
Definition. The coherence distribution of a strategy is the distribution of
over random inputs.
Learning
Problems include learning , learning , and learning direct audio-to-trajectory mappings. See the PDF for the more formal framing.
Composable Narratives
Longer-form content can be structured via chonk concatenation with appropriate boundary conditions (details in the PDF).
Multi-Image Extensions
Moving between images requires a transition object rather than a single-image chonk.
Definition. A transition chonk between images and is
Tooling Implications
The specification identifies exactly what the user must provide: image, waypoints, strategy, and audio reference. Full spec and notation are in the PDF.