When I was a kid one of the coolest toys you could get was a set of intercom units connected with a thin wire. You could set them up between your room and somewhere else as long as it was less than the 10 meters away the supplied wire would reach. You could then annoy the sh*t out of your parents or siblings with the single call button:
I live in a house with lots of rooms and I have often joked that I would find some of these and install between the living room and my lair.
Not wanting to deal with wires or horrible sound quality I opted to make something more modern that still preserve that feeling of immediacy.
I decided to make a couple of single purpose point to point intercom units that could easily be wall mounted. I wanted to change the wire for WiFi for obvious reasons.
Digital audio on (Arduino) microcontrollers
Most modern (Arduino) microcontrollers, at least the ones also capable of WiFi, has built in hardware to send and receive digital audio as I2S digital signals.
Usually some hardware inside the microcontroller handles sending and receiving buffers of audio data in the background with minimal attention from the main program.
I2S is a standard interface and lots of convenient devices are available for audio input and output.
The hardware I chose for this project consists of a ESP32 microcontroller, I2S microphone and I2S audio DAC with built in amplifier.
The microphone is a small module carrying the INMP441 integrated digital microphone. This device is actually a pretty high end component containing all that is needed to convert sound to digital audio. The chip eliminates all the potentially noisy analog circuitry that is normally associated with microphone input. In previous projects I have had lots of problems with analog noise since the power supply voltage of many microcontroller boards tends to be very noisy and it is almost impossible to filter this kind of noise out. The INMP441 and similar digital microphone chips seem immune to noise from the power supply.
For Audio output I chose a board with the MAX98357 digital amplifier. This small chip has I2S digital audio input and is capable of directly driving a speaker.
These devices are just a few of many available for I2S audio. Other devices may have analog line-in/outputs in place of microphone and speakers but operate on a similar way. Since the devices does not require any configuration in software they can be seamlessly interchangeable. I simply chose the above for convenience.
The circuit for the project is pretty simple:
The microphone and amplifier module has different names for the same I2S signals but in its basic form 3 wires are needed for I2S data: a bit-clock, a left/right select signal (aka Word Select) and a data signal. Bit and left/right can be common for the two devices but each has a dedicated data line.
The pins on the ESP32 are more or less arbitrarily selected and can be changed in (Arduino) software.
I just got a new 3D printer when I was experimenting with this project so I went a bit over board and made a 3D printable enclosure for the circuit:
The lid of the box has features where you can simply press fit the microphone, speaker and amplifier board.
If anyone is interested I have included the .STL files for the enclosure.
Originally I just wanted to use a simple UDP signaling between two of these devices via. WiFi. Once a button was pressed on one of the devices buffers of raw samples from the microphone would be sent directly to a UDP port on another. Upon receiving buffers they would just be sent to the amplifier module. I assumed that the latency and transmission time between two devices on the same WiFi network would be pretty consistent so no buffering would be needed.
The above has a couple of problems.
Each device would need to know the IP address of the other. Depending on the configuration of your local WiFi It may not always possible to set the exact IP address a device. The address may change from time to time and it would be inconvenient to have to change the IPs in the code all the time.
You would also only be able to send from one device to another and it could be interesting to be able to make a all-call where one device would send to several devices in different rooms.
I wrote a couple of Arduino sketches to experiment with different solutions.
Basic audio handling
The way Audio is handled in all the examples are the same:
A dedicated I2S system inside the ESP32 is set up to perform the sampling/playback of audio data (this is also available on other platforms such as RP2040 or STM32).
Data is sampled and played back constantly and simultaneously. This is handled in the background and does not involve the main loop of the Arduino code.
Data is played from queue of buffers. If new data arrive it can be added to the queue. When all buffers in the queue are played output is replaced with a special buffer containing silence. This silence-buffer will keep looping until new buffers are added to the queue.
Meanwhile data is constantly being recorded into a single buffer. Once the buffer is full a flag is set alerting the main program to potentially do something with it.
The crude solution
In place of sending the data to a specific IP address the data is sent as broadcast. This way all devices on the same network receive the data regardless of IP address. This is historically considered bad practice since you ‘bother’ devices with irrelevant data. This may have been a problem back in the days of slow computers and limited data speeds. Today I don’t think a few kilobytes/second of broadcast traffic even register in the big picture.
Data is continuously sampled at 16kHz in 16 bits resolution into buffers of 1024 bytes. If a button is pressed the buffers are broadcasted over the local network to a specific IP-port . Meanwhile any data arriving over the network is added to the playback queue.
The “proper” solution
In this example each (of two) devices only broadcast their IP address. This way the sender know the address of the receiver and may send data directly avoiding “large” amount of data being broadcast.
Every second each device broadcast a string with the word “hello,” followed by it’s respective IP address. When using UDP the data sent is just a chunk of data with no context. To distinguish between these ‘hello’ messages and audio data the size of the received data packages is used. If more than 200 bytes is received it is assumed that it is a audio buffer. Otherwise the program attempt to extract an IP address from the data. This address is then used when transmitting audio.
Handling packet timing issues (just for background)
The choice of continuously sending data to the DAC/speaker even if no data has been received has some advantages. Imagine data packets being transmitted at a constant rate (31 ‘ish per. second). These packets may experience some short irregular delays but overall the rate is constant. When a transmission starts the first packet will arrive while some silence buffer is being played. The arriving data is put in a queue. When the silence packet is done playing this first packet is then played. We now have that little bit of extra time when the packet was sitting in the queue as a margin for when the next packet can arrive:
If any of the successive packets are delayed by more than this margin the queue will run out of packets and a silent buffer will be inserted:
This result is that the successive packets stay in the queue while the silence plays thus extending the original margin. Obviously there will be a small interruption of the audio but in practice it is hardly noticeable.
On your average local network there is nothing that will accumulate UDP packets if the transmission gets interrupted for longer periods. If however something should happen to do this the size of the queue should be limited.
So far I have just been experimenting with this project and much to the annoyance of my partner I have yet to actually implement this system around the house.
(Update: We tried it and it works. Volume is a bit low, need to fix that.)
I want to make a nicer set of intercom units that either looks and feels like the original vintage models or perhaps even an older type phone for decoration.
I do prefer the broadcast model since you can add intercom speakers at several locations which is more practical. I don’t see the occasional extra 30kBytes/sec. will be notable on mu home network.
I think this is a fun little project that gives a good introduction to I2S audio and streaming over the network.