Reply To: Separation for 3D Audio

HARK FORUM Separation for 3D Audio Reply To: Separation for 3D Audio


Thank you for your inquiry.

But I don’t know how to record the TSP response, mainly because I don’t know how to run wios and I don’t know how to make the loudspeaker output the TSP even after reading and watching the tutorials you have.

From HARKTOOL5, we have provided a method to create a transfer function from recorded data whose playback and recording are not synchronized. In other words, instead of wios, you can record TSP with a combination of common playback tools (such as aplay) and recording tools (such as arecord).

Please refer to the following URL for the information necessary for setting to use the transfer function creation algorithm newly provided in HARKTOOL5.

Below is an example command to play a TSP for 60 seconds with the default playback device.

sox /usr/share/hark/16384.little_endian.wav tsp_60sec.wav repeat 59
play tsp_60sec.wav &

You can prepare the recording data by the following steps.

1. Keep playing the TSP repeatedly in the direction of the sound source you want to record.
It’s best if you have a monitoring speaker with a constant frequency response, but if you pay attention to the following points, you can use a normal speaker.
・In general, if the volume is lowered or raised too much, distortion will occur depending on the characteristics of the amplifier. It is better not to go below 20% or above 80%.
・It is better to use PC speakers with a certain diameter instead of small speakers built into smartphones. You can also take the approach of connecting a good quality speaker to your smartphone.
・For PCs, software effects (theater mode, etc.) attached to the Audio chipset may be enabled, and for smartphones, the vendor’s own virtual high quality technology may be enabled, so please turn it off.

2. Record the TSP of “1.” in multi-channel with your microphone array. Any recording software can be used, but it is recommended to select a sampling frequency of 16 kHz in order to use HARK with the default settings.
Respeaker has a function to output 1ch data after beamforming, but please record all channels of raw PCM without using the beamforming function. It’s a good idea to record at least 20 seconds per direction.

3. Repeat “1.” to “2.” for multiple sound source directions. Be sure to record the direction of the separation target.
Please adjust the speaker volume and microphone sensitivity to avoid clipping. Do not change the speaker volume and microphone sensitivity once determined, and use the same settings in all directions. Be especially careful if you have speakers close to the microphone array.

Regarding the distance between the speaker (sound source) and the microphone array:
Place the loudspeaker at the sound source position of interest to be localized or isolated. For voice, the speaker should be positioned close to the mouth of the person speaking. The conversational robot will be about 1m, but if you are sitting on a seat and talking to a PC, it will be about 30cm. No minimum distance is specified, but be careful not to clip.

4. After recording is completed in all directions, select a section containing sound and select at least 10 seconds or more (eg 16 seconds) to cut out. You can use any software to cut out, but please be careful not to convert the output format. With software such as sox, if you specify only the start time and length with the trim command, the format is preserved.
For ease of setting in HARKTOOL5, it would be better if the PCM data for each direction has the same length. Also, since the space reflection sound is not included immediately after the start of the section containing the sound, it is better to start cutting out about 1 second after the start of the section containing the sound, not immediately after the start. . However, if the reverberation continues for a few seconds in a hall, etc., shift the extraction start point according to the reverberation time so that the section that fully includes the reflected sound is extracted.

And I would like to know if there is a minimum distance between the loudspeaker and the microphone array when the response is being recorded?

As shown in the steps above, there is no minimum distance if you are careful with clipping. However, if the microphone array is placed directly on a table, it can be adversely affected by table reflections. It can also pick up table vibrations. If it is judged that there is an adverse effect, consider floating the microphone array with a tripod or the like.

The location where the recording is made is complex with obstacles, what is the minimum number of TSP recordings that you recommend around the microphone array?

If the sound source position is fixed, the only mandatory source direction is the direction where the source to be separated is located. The sound source positions do not have to be evenly spaced. The sound source is not limited to the object such as voice, but also includes noise from a specific direction if you want to remove it by separating it. Also, it is necessary to record in a positional relationship where the sound reaches the microphone array directly. When recording TSP, be careful not to place any obstacles between the speaker and the microphone array that would block the direct sound arrival.

As is concerned, in environments where the transfer function varies significantly with azimuth change (e.g., environments with many obstacles or rooms with complex shapes), the transfer function may not match when deviating from the known source direction, degrading the separation performance. There is no problem with simulations that synthesize inputs, but when performing live demonstrations, it is recommended to also create a transfer function for a position that is 5 to 10 degrees off the original position of the sound source. By creating such a transfer function, even if the speaker’s position deviates a little, the SourceTracker node can track the movement of the nearby sound source.
The transfer function we provide is created with 72 directions in 5 degree units for 360 degree omnidirectional moving sound source tracking assuming that the speaker walks around the robot. On the other hand, in an environment where you know there are speakers at 15 and 45 degrees, it’s fine to restrict the orientation of the transfer function to just the directions in which the speakers are. For example, 10, 15, 20, 40, 45, and 50 degrees may be enough, assuming that the position of the speaker is slightly off. Even with such a direction-limited transfer function, it is possible to separate speaker A at 10 to 20 degrees and speaker B at 40 to 50 degrees. Of course, if you take the solution of telling the speaker A and B to stay at a point marked as the sound source location when doing the demo, then we don’t need additional TSP recordings in the vicinity of the assumed sound source.

If you have any questions, please feel free to contact us.
Thank you.

Best regards,
HARK support team