create_database indexes separate annotations table incorrectly (#175) · Issues · public_projects / ketos

create_database indexes separate annotations table incorrectly

When using the database_interface.create data function to create a database with a separate annotations table, the resulting annotations table has incorrect values for the "data_index" column.

This is the case used when each data sample in the data table can have multiple annotations in the data_annot table, and the data_index field is supposed to act as a key linking each annotation to its corresponding data row.

However, I found that the values of this column don't actually match the data table.

Here is a minimal example to reproduce the issue:

import pandas as pd
from ketos.data_handling import selection_table as sl
import ketos.data_handling.database_interface as dbi
from ketos.audio.waveform import Waveform




sample_audio = Waveform.cosine(rate=1000,frequency=1000,duration=30.0)
sample_audio.to_wav("sample_audio.wav")
filelist = pd.DataFrame([{"filename":"sample_audio.wav", "duration":30.0}])


annot = pd.DataFrame([{"filename":"sample_audio.wav", "start":2.0, "end":3.0, "label":"A"},
                     {"filename":"sample_audio.wav", "start":5.0, "end":6.0, "label":"A"},
                     {"filename":"sample_audio.wav", "start":21.0, "end":22.0, "label":"A"},
                     {"filename":"sample_audio.wav", "start":25.0, "end":27.0, "label":"A"}])


annot_std = sl.standardize(table=annot, labels=["A"],trim_table=False, start_labels_at_1=True)

selection_table = sl.select_by_segmenting(files=filelist, length=10, annotations=annot_std, step=None, discard_empty=False, pad=False)

audio_repr = {'duration': 10.0,
              'rate': 1000,
              'window': 0.051,
              'step': 0.01955,
              'freq_min': 0,
              'freq_max': 500,
              'window_func': 'hamming',
              'type': 'MagSpectrogram'}

dbi.create_database(output_file="db.h5", data_dir="./",
                        dataset_name='train/', selections=selection_table[0], annotations=selection_table[1],
                        audio_repres=audio_repr,include_attrs=True, unique_labels=[1], include_label=False)

Once db.h5 is created, reading the `data_index column returns:

import tables
db = tables.open_file("db.h5",'r')
tbl = db.get_node("/train/data_annot")
tbl.col("data_index")

array([0, 0, 0, 0], dtype=uint32)

But what I expected was

array([0, 0, 2, 2], dtype=uint32)

Edited May 18, 2022 by Fabio Frazao

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information