Online cloud gaming platforms stream game media to multiple endpoints (e.g., a television display and a controller-connected headset) via possibly different networks with considerably different latencies. This leads to the media being played out of sync with one another, and severely degrades user experience. Typical approaches that rely on network and software timing measurements fail to reach synchronization goals. In this work, we propose Ekho, a robust and efficient end-to-end approach for synchronizing streams transmitted to two devices. Ekho adds faint, human-inaudible pseudo-noise (PN) markers to the game audio, and listens for these markers in the chat audio captured by the player's microphone to measure inter-stream delay (ISD). The game server then compensates for the ISD to synchronize the streams. We evaluate Ekho in depth, with a corpus of audio samples from popular online games, and demonstrate that it calculates ISD with sub-millisecond accuracy, has low computational overhead, and is resilient to background chatter, compression and microphone quality. In end-to-end tests over WiFi and cellular links with frequent packet loss and playback disruption, Ekho maintains human-imperceptible ISD (<10 ms) 86.8% of the time. Without Ekho, the ISD exceeds 50 ms at all times.