Function file_duration_table long processing time (#181) · Issues · public_projects / ketos

Function file_duration_table long processing time

I noticed an issue with the function file_duration_table below, where it takes an absurdly long time to process (10 minutes) when there is a large amount of files 20000+:

def file_duration_table(path, search_subdirs=False, datetime_format=None):
    """ Create file duration table.

        Args:
            path: str
                Path to folder with audio files (\*.wav)
            search_subdirs: bool
                If True, search include also any audio files in subdirectories.
                Default is False.
            datetime_format: str
                String defining the date-time format. 
                Example: %d_%m_%Y* would capture "14_3_1999.txt".
                See https://pypi.org/project/datetime-glob/ for a list of valid directives.
                If specified, the method will attempt to parse the datetime information from the filename.

        Returns:
            df: pandas DataFrame
                File duration table. Columns: filename, duration, (datetime)
    """
    paths = find_wave_files(path=path, return_path=True, search_subdirs=search_subdirs)
    durations = [librosa.get_duration(filename=os.path.join(path,p)) for p in paths]
    df = pd.DataFrame({'filename':paths, 'duration':durations})
    if datetime_format is None:
        return df

    df['datetime'] = df.apply(lambda x: parse_datetime(os.path.basename(x.filename), fmt=datetime_format), axis=1)
    return df

I know 20000+ is a large amount of files and i am reading it from an external hard drive connected via usb, but I dont think it should take this long, specially considering what the function is doing and what it supposedly do. The culprit here is this line:

durations = [librosa.get_duration(filename=os.path.join(path,p)) for p in paths]

Everything else processes in less than a second. The question is, I wasnt able to identify why this line takes so long, and it is not consisted. If i run my script a second time it will be much quicker (a few seconds only), maybe by then it is being saved in cache? I am not sure.

Regardless, let me explain why I think this function should be quicker anyway. Here is the librosa function we are calling:

def get_duration(
    *, y=None, sr=22050, S=None, n_fft=2048, hop_length=512, center=True, filename=None
):
    """Compute the duration (in seconds) of an audio time series,
    feature matrix, or filename.

    Examples
    --------
    >>> # Load an example audio file
    >>> y, sr = librosa.load(librosa.ex('trumpet'))
    >>> librosa.get_duration(y=y, sr=sr)
    5.333378684807256

    >>> # Or directly from an audio file
    >>> librosa.get_duration(filename=librosa.ex('trumpet'))
    5.333378684807256

    >>> # Or compute duration from an STFT matrix
    >>> y, sr = librosa.load(librosa.ex('trumpet'))
    >>> S = librosa.stft(y)
    >>> librosa.get_duration(S=S, sr=sr)
    5.317369614512471

    >>> # Or a non-centered STFT matrix
    >>> S_left = librosa.stft(y, center=False)
    >>> librosa.get_duration(S=S_left, sr=sr)
    5.224489795918367

    Parameters
    ----------
    y : np.ndarray [shape=(..., n)] or None
        audio time series. Multi-channel is supported.

    sr : number > 0 [scalar]
        audio sampling rate of ``y``

    S : np.ndarray [shape=(..., d, t)] or None
        STFT matrix, or any STFT-derived matrix (e.g., chromagram
        or mel spectrogram).
        Durations calculated from spectrogram inputs are only accurate
        up to the frame resolution. If high precision is required,
        it is better to use the audio time series directly.

    n_fft : int > 0 [scalar]
        FFT window size for ``S``

    hop_length : int > 0 [ scalar]
        number of audio samples between columns of ``S``

    center : boolean
        - If ``True``, ``S[:, t]`` is centered at ``y[t * hop_length]``
        - If ``False``, then ``S[:, t]`` begins at ``y[t * hop_length]``

    filename : str
        If provided, all other parameters are ignored, and the
        duration is calculated directly from the audio file.
        Note that this avoids loading the contents into memory,
        and is therefore useful for querying the duration of
        long files.

        As in ``load``, this can also be an integer or open file-handle
        that can be processed by ``soundfile``.

    Returns
    -------
    d : float >= 0
        Duration (in seconds) of the input time series or spectrogram.

    Raises
    ------
    ParameterError
        if none of ``y``, ``S``, or ``filename`` are provided.

    Notes
    -----
    `get_duration` can be applied to a file (``filename``), a spectrogram (``S``),
    or audio buffer (``y, sr``).  Only one of these three options should be
    provided.  If you do provide multiple options (e.g., ``filename`` and ``S``),
    then ``filename`` takes precedence over ``S``, and ``S`` takes precedence over
    ``(y, sr)``.
    """

    if filename is not None:
        try:
            return sf.info(filename).duration
        except RuntimeError:
            with audioread.audio_open(filename) as fdesc:
                return fdesc.duration

    if y is None:
        if S is None:
            raise ParameterError(
                "At least one of (y, sr), S, or filename must be provided"
            )

        n_frames = S.shape[-1]
        n_samples = n_fft + hop_length * (n_frames - 1)

        # If centered, we lose half a window from each end of S
        if center:
            n_samples = n_samples - 2 * int(n_fft // 2)

    else:
        n_samples = y.shape[-1]

    return float(n_samples) / sr

This function attempts to get the duration in a number of different ways, ranging from most efficient to least efficient. In particular, our code will trigger this line right at the start:

return sf.info(filename).duration

What this soundfile function does is read the metadata of the file to retrieve the duration. It doesnt read the wav file to calculate the duration, so it should be very efficient.

Perhaps there is not really an issue and it just takes a long time to process because I am reading it from an external harddrive from usb.

Did you guys run into a similar issue?

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information