spectral-based splicing

This is a continuation of a project from a couple of years ago, in which I investigated algorithms to morph between spectra. Presentation of that project*, with audio examples, is here. The second algorithm is used later below.

What I've been tinkering with now is a suite of programs (in java, but still using perry cook's fft.c program to convert between audio and fft) that analyze files to find splice points at moments of spectral similarity. The hope is that these more-seamless splices can be used compositionally, and that a heavily spliced piece of audio can be mosaiced into something swell.

Basic Scheme - analysis for spectral similarity

So the crux is a program called SeamFinder which takes multiple fft files and selects the most similar frames among them. So for each frame in input 1, we look at each frame in input 2, and keep track of the most similar matches. Similarity is currently being defined as: the total sum, for each fft frame bin, of the absolute value of the difference between input 1's and input 2's bin energy. Lower values mean less difference. In code as the function compareFrames

private static double compareFrames(FFTFrame frame1, FFTFrame frame2){
  double eval=0;
  for(int i=1;i<frame1.bin.length; i++){
    eval+=Math.abs((frame1.bin[i])-(frame2.bin[i]));
  }
  return eval;
}

(This is simple, though it seems to work well. A future development is to experiment with evaluation schemes that, for example, give more weight to lower bins (since they convey more perceptual weight to our hearing).)

A problem with this is that, for example, silence is similar to silence, and quieter noises are more similar to one another than louder noises, due to less energy per bin. This leaads to matches at quiet moments, which may not be what we want conceptually or aesthetically. So there is a variable "integralThreshold"; a frame must have more than this amount of total energy in order to be evaluated (i.e must have an integral of at least this amount). Changing this threshold leads to widely differing results. I prefer a somewhat high threshold (usually a value of 5000 per frame of 512 bins), which means that the splice must happen at a loud (i.e. noticable) moment.

So the program SeamFinder spits out a list of matches (a frame from input 1 and a frame from input 2) sorted by degree of similarity. I decide where I want the cuts to be made, and then feed that information into another program that stitches the fft frames together

EXAMPLES: pure splicing

The first tests were among three noisy (spectrally complex) yet dissimilar sounds. Results are only moderately interesting, since even the closest match is not necessarily particularly close. Each of these splices represent the top match between the two sounds (at a particular integralThreshold)

example 1 (:06) is a splice (found using a low integralThreshold value) between a sound of a truck dumping coal and snoring
example 2 (:12) is a splice (with a high (5000) integralThreshold value) between pouring soda into a cup and a truck dumping coal
example 3 (:07) is a splice (with a high (5000) integralThreshold value) between pouring soda into a cup and snoring

The next test was to run a sound against a sound that it must necessarily be similar to: itself. I modified a variant of SeamFinder to look at only one file and find internal points of similarity. To avoid finding self-similarity at each frame, I defined that a match must be across a certain number of frames (i.e. be a certain duration away from one another).

(unmodified) source sound for example 4 (:06) (from perry cook's book)
example 4 (:03) from the list of matches generated by the program, I manually selected a sequence of 7 splices.

A major next step in development is to create a program that takes the list of matches and selects a sequence of splices. I've been doing this part by hand so far, since it inolves a certain level of aesthetic judgement: what is the acceptable range of slice durations, for example, etc. If I remember my CS, this sounds like an NP-complete problem, so as the level of resolution in a mosaic grows and the amount of sound and matches grows, the necessary computation time will grow exponentially.
Last, I ran it against two instrumental recordings of similar instrumentation: jazz ensemble with solo sax. I used Coltrane's Giant Steps and Coleman's Lonely Woman. I ran both complete files, which took quite a while to compute (since the required computation grows exponentially with the length of the sounds). Again, after looking at the list of similarites, i selected the sequence of splices. This was done at two different integral thresholds, with differing results.

example 5 (:23) (at a high integral threshold) splice points are usually found between sax notes, since those gaps are quieter, yet still above the threshold.
example 6 (:23) (at a very high (10000) integral threshold) now the threshold is above the level of the sax playing, so similarites are only found at the loudest moments (sax notes), which results in splices at sax notes (which I think is more interesting than between the notes) of, understandably, the same pitch.

EXAMPLES: splicing with morph

So I like how some of the splices among similar sounds work, but I decided to investigate using my old spectral morphing algorithm to use as the "glue" between two potentially dissimilar sounds. Here, I re-splice example 2 (soda pour to dumping coal), but morphing across X number of frames before and after the splice.

example 7 (:13) same splice point as example 2, but with a morph radius of 10 frames on each side of the splice
example 8 (:13) as above, but with a morph radius of 50 frames on each side of the splice
example 9 (:13) as above, but with a morph radius of 100 frames on each side of the splice

The effect is fairly subtle, even with such a long morph time, since both sounds are noisy. Last, I tried it on two noises with more pitch content (to hear the shift in prominent tones), the previous speech sample and some chimes.

example 10 (:11) speech to chimes, purely spliced (no morph)
example 11 (:11) speech to chimes, same splice point as above, but with morph radius of 5 frames
example 12 (:11) as above, but with morph radius of 20 frames

* - which won honorable mention and cash prize at the princton undergraduate research symposium. Quite generous of them considering I was up against people curing cancer and improving fuel efficiency. Maybe they took pity on me as the only humanities student in a room full of engineers..