create_database indexes separate annotations table incorrectly
When using the database_interface.create data
function to create a database with a separate annotations table, the resulting annotations table has incorrect values for the "data_index" column.
This is the case used when each data sample in the data table can have multiple annotations in the data_annot table, and the data_index field is supposed to act as a key linking each annotation to its corresponding data row.
However, I found that the values of this column don't actually match the data table.
Here is a minimal example to reproduce the issue:
import pandas as pd
from ketos.data_handling import selection_table as sl
import ketos.data_handling.database_interface as dbi
from ketos.audio.waveform import Waveform
sample_audio = Waveform.cosine(rate=1000,frequency=1000,duration=30.0)
sample_audio.to_wav("sample_audio.wav")
filelist = pd.DataFrame([{"filename":"sample_audio.wav", "duration":30.0}])
annot = pd.DataFrame([{"filename":"sample_audio.wav", "start":2.0, "end":3.0, "label":"A"},
{"filename":"sample_audio.wav", "start":5.0, "end":6.0, "label":"A"},
{"filename":"sample_audio.wav", "start":21.0, "end":22.0, "label":"A"},
{"filename":"sample_audio.wav", "start":25.0, "end":27.0, "label":"A"}])
annot_std = sl.standardize(table=annot, labels=["A"],trim_table=False, start_labels_at_1=True)
selection_table = sl.select_by_segmenting(files=filelist, length=10, annotations=annot_std, step=None, discard_empty=False, pad=False)
audio_repr = {'duration': 10.0,
'rate': 1000,
'window': 0.051,
'step': 0.01955,
'freq_min': 0,
'freq_max': 500,
'window_func': 'hamming',
'type': 'MagSpectrogram'}
dbi.create_database(output_file="db.h5", data_dir="./",
dataset_name='train/', selections=selection_table[0], annotations=selection_table[1],
audio_repres=audio_repr,include_attrs=True, unique_labels=[1], include_label=False)
Once db.h5
is created, reading the `data_index column returns:
import tables
db = tables.open_file("db.h5",'r')
tbl = db.get_node("/train/data_annot")
tbl.col("data_index")
array([0, 0, 0, 0], dtype=uint32)
But what I expected was
array([0, 0, 2, 2], dtype=uint32)
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information