How to detect/classify signals with very different duration
Hydrophone recordings, such as the ones made by the team in Florida, can contain sounds of very different duration, e.g., a fish grunt (less than 1 sec) and a boat passing by the hydrophone (several 10s of seconds). How do we deal with such data? A short time window (say, 1 sec) is well suited to detect the fish grunts, but may struggle to detect a boat passing because the short time window fails to capture to full information, e.g., the initial increase and subsequent drop-off in sound intensity as the boat first approaches and then moves away from the hydrophone. A simple way to deal with this particular situation would be to train two separate neural networks, one fed with 1-second clips, the other with 1-minute clips. This approach could of course be extended to more than two networks. But perhaps there are more efficient and better ways to do this? Perhaps cells with long-term memory (LSTM) could be used?
@fsfrazao, any thoughts?