New feature: Custom audio representations (!255) · Merge requests · public_projects / ketos

New feature: Custom audio representations

This merge request introduces a new but similar way to load audio data and convert to an audio representation.

TL;DR:

The AudioLoader class has two new arguments representation and representation_params in favor of just repres. This allows custom audio representation to be passed.
The AudioFrameLoader and AudioEfficientFrameLoader classes followed the same changes made to the AudioLoader class
The create_database function now requires a dict to be passed with an audio_repres containing a class rather than simply a string with the name of the class.
Several changes made to the parsing module to parse a json string or unparsed dict with strings into a python dict with classes and correct types and vice-versa. Changes were made to the following functions: encode_parameter and parse_audio_representation
Tests were appropriately added/modified

Longer explanation:

Currently, we can only use an audio representation defined in ketos (MagSpec, powerSpec, Waveform... etc). However there is no way to easily introduce new custom Audio representations. This adds a lot of flexibility for the user as they can introduce not only new audio representations but in fact, do anything with the data before it is saved into the database. Examples further below.

This merge request changes the AudioLoader class to receive two new parameters in place of the standard audio_repres dictionary. These are representation and representation_params where representation is a class (a custom user defined class or one already in ketos such as MagSpectrogram) and the representation_params are the parameters to initialize the class object. For instance, instead of calling the AudioLoader in the following way:

rep = {'type':'MagSpectrogram', 'window':0.2, 'step':0.02, 'window_func':'hamming'}
loader = AudioLoader(selection_gen=generator, repres=rep)

We now have to call it like this:

from ketos.audio.spectrogram import MagSpectrogram
rep = {'window':0.2, 'step':0.02, 'window_func':'hamming'}
loader = AudioLoader(selection_gen=generator, representation=MagSpectrogram, representation_params=rep)

Or like this:

from ketos.audio.spectrogram import MagSpectrogram
rep = {'type':MagSpectrogram, 'window':0.2, 'step':0.02, 'window_func':'hamming'}
loader = AudioLoader(selection_gen=generator, representation=rep['type'], representation_params=rep)

Note how now we are not passing a string with the class name but the class itself as a presentation. And the parameters to initialize the class as a second argument in the same way as before. To use a custom audio representation you would need to just import the custom class and use it in the same way. For instance:

import MyCustomAudioRepresentation
rep = {'type': MyCustomAudioRepresentation, 'any':'parameter', 'for':'the', 'custom':'representation'}
loader = AudioLoader(selection_gen=generator, representation=rep['type'], representation_params=rep)

Even with this change, Ketos still maintains the same functionality of specifying configuration with a slight change and few more options now. For example, before, if we want to create a hd5f database with the create_database function:

import ketos.data_handling.database_interface as dbi
config = {'type':'MagSpectrogram', 'window':0.2, 'step':0.02, 'window_func':'hamming'}
dbi.create_database(output_file, data_dir, selections, audio_repres=config)

Now, we need to pass a config with the class itself instead of class name.

import ketos.data_handling.database_interface as dbi
from ketos.audio.spectrogram import MagSpectrogram
config = {'type':MagSpectrogram, 'window':0.2, 'step':0.02, 'window_func':'hamming'}
dbi.create_database(output_file, data_dir, selections, audio_repres=config)

To maintain the functionality of writing the entire spectrogram configuration in a json file I have modified and enhanced the parsing module to convert the json string into a python dict with the appropriate classes and types and vice-versa. The parse_audio_representation will output the following given an unparsed dict:

from ketos.data_handling.parsing import parse_audio_representation
audio_representation = {'type':'MagSpectrogram', 'window':0.2, 'step':0.02, 'window_func':'hamming'}
parse_audio_representation(audio_representation)

{'type': ketos.audio.spectrogram.MagSpectrogram, 'window': 0.2, 'step': 0.02, 'window_func': 'hamming'}

Full example using load_audio_representation by reading a json file:

import json
import os
from ketos.data_handling.parsing import load_audio_representation
# create json file with spectrogram settings
json_str = '{"spectrogram": {"type": "MagSpectrogram", "rate": "20 kHz", "window": "0.1 s", "step": "0.025 s", "window_func": "hamming", "freq_min": "30Hz", "freq_max": "3000Hz"}}'
path = 'ketos/tests/assets/tmp/config.py'
file = open(path, 'w')
file.write(json_str)
file.close()
# load settings back from json file
settings = load_audio_representation(path=path, name='spectrogram')
print(settings)
os.remove(path)

The above code should output something like: {'type': <class 'ketos.audio.spectrogram.MagSpectrogram'>, 'rate': 20000.0, 'window': 0.1, 'step': 0.025, 'window_func': 'hamming', 'freq_min': 30, 'freq_max': 3000}

It is now also possible to pass a custom audio representation with the config file by also specifying a path to the module where the custom class is:

json_str = '{"custom_prepresentation": {"type": "Cepstrum", "module": "path/to/my/audio_representation.py", "any": "parameter", "for": "the", "custom": "representation"}}'
settings = load_audio_representation(path='path/to/the/json/file', name='custom_prepresentation')
{'type': <class 'audio_representation.Cepstrum'>, "module": "path/to/my/audio_representation.py", "any": "parameter", "for": "the", "custom": "representation"}

And the encode function of the parsing module encode_parameter will properly convert a class into a string. This is used by the export functions to properly create a dict with values.

Note that the custom class aren't limited to just new audio representations. They can be anything really. For instance currently, if i want to create a dataset of augmented spectrograms by mixing 2 .wav files I have to:

Create a script to load two wav files and mixes them
Save the new augmented wav file
Load this new file through ketos and create the representation I want from the list we have

With a custom class, I can skip step 2 by passing the script as a class to ketos as my "Custom audio representation". This should save a lot of processing time as well as a lot of disc space specially if I have a big dataset.

Limitation: currently, any new custom class must create a .from_wav() method to be used. But I plan to change this in the near future and have all representations in ketos initialize through the constructor by default.

Observations:

I think a very simple "tutorial" or example would be a useful reference for users.

As an additional observation, I think we should rename representation 'type' to 'class' through ketos as to not confuse between a variable type and and a representation class.

Finally, it would be a good idea to look at the necessity of the load_audio function from the database interface. It doesnt seem to be used anywhere, and it seems to have a lot of legacy code. It seems something that can be done very easily by the user.

Edited Aug 03, 2022 by Fabio Frazao