As you might experience it while reading this sentence, silent reading often involves an imagery speech component: we can hear our own "inner voice" pronouncing words mentally. Recent functional magnetic resonance imaging studies have associated that component with increased metabolic activity in the auditory cortex, including voice-selective areas. It remains to be determined, however, whether this activation arises automatically from early bottom-up visual inputs or whether it depends on late top-down control processes modulated by task demands. To answer this question, we collaborated with four epileptic human patients recorded with intracranial electrodes in the auditory cortex for therapeutic purposes, and measured high-frequency (50-150 Hz) "gamma" activity as a proxy of population level spiking activity. Temporal voice-selective areas (TVAs) were identified with an auditory localizer task and monitored as participants viewed words flashed on screen. We compared neural responses depending on whether words were attended or ignored and found a significant increase of neural activity in response to words, strongly enhanced by attention. In one of the patients, we could record that response at 800 ms in TVAs, but also at 700 ms in the primary auditory cortex and at 300 ms in the ventral occipital temporal cortex. Furthermore, single-trial analysis revealed a considerable jitter between activation peaks in visual and auditory cortices. Altogether, our results demonstrate that the multimodal mental experience of reading is in fact a heterogeneous complex of asynchronous neural responses, and that auditory and visual modalities often process distinct temporal frames of our environment at the same time.