for professional video editing, the problems come from finding suitable sounds, aligning sounds, video and tuning parameters
matches sound efects to video - uses quality sound efects libraries and CLIP - Concretely, the system has three parts: classification, synchronization, and mix
The classification matches efects to a video by classifying sound emitters within - To reduce the distinct sound emitters, the video is split based on absolute color histogram distances
In the synchronization part, intervals are identified comparing efects label with each frame and pinpointing consecutive matches above a threshold
In the mix part, efects are split into around one-second chunks - chunks are stitched via crossfades.