Interaction designer focused on advanced analytics, data visualization, and other complex problems

Drawing Audio Waveforms with the Accelerate Framework

April 12, 2015

As I was waiting for my turn in one of the audio labs at WWDC last year, another developer asked me about how I draw audio waveforms in DanceMaster. I recommended using the Accelerate framework for doing the math, and found out that although he had heard that recommendation before, he didn’t know how to do this, or even where to start.

The Accelerate framework allows you to write code for processing large amounts of data that performs dramatically better than more straightforward implementations, but it can be hard to find practical information about how to use it. If you don’t know what you are looking for in the documentation, it can be pretty difficult to get started. What makes it even harder is that in order to use the framework effectively, you need to change the way you think about how you design algorithms for processing non-trivial amounts of data.

I ended up spending an hour with this developer walking him through my waveform drawing code, explaining how it works and why it does what it does, and in this post I’m going walk through this same process, because if you are not using Accelerate for processing your audio data, your code is probably nowhere near as fast as it could be.

What is the Accelerate Framework?

The Accelerate library is a set of optimized functions for working with large amounts of data. Many of these functions operate on vectors, or long lists of numbers, and most of these functions apply the same transformation, or a consistent transformation, to every element in the list. These functions have been optimized to take advantage of special features of the CPU that allow execution of a single instruction over multiple pieces of data at the same time. As a result, these instructions generally complete much faster than a simple loop that performs the same operation on every element in an array.

Processing audio signals, such as the raw data used to draw an audio waveform, involves quite a bit of this kind of operation— consistently transforming every element in a long list of numbers. The Accelerate framework is a great tool for this job.

Waveform Drawing Overview

Drawing a waveform from an audio file involves three steps:

  1. Getting the audio data
  2. Calculating the values you want to draw
  3. Drawing the values into a graphics context.

Most of the interesting stuff I am going to talk about happens in step 2 (calculating the values). I’m not going to talk much about the specific implementation details of steps 1 or 3, as those could be entire blog posts by themselves, and for both you have several options, and the choice depends on the exact needs of your situation. Chris Liscio wrote a great post that describes much more about these choices and offers up some great options for how to do the drawing.

I’m just going to talk about the math, and how to do it as quickly as possible.

What Audio Data Looks Like

Before I get to that, let’s talk generally about audio data.

Audio data comes to you as a big block of data and in order to work with it, you need to know what its structure is. In general, audio is a series of measurements of the amplitude of an audio signal, sampled a specific number of times per second (the “sample rate”). One common sample rate is 44.1kHz, meaning that there are 44100 measurements for every second of audio. However, because most audio is stereo, an audio file will usually have two sets of measurements (“channels”), one for each stereo channel. Usually, the samples for each track are interleaved with each other, alternating a sample from the left track with a sample from the right track. As a result, for a sample rate of 44.1kHz, you have 44,100 “frames” of data, each with 2 samples, for a total of 88,200 samples for each second of audio.

The samples themselves are usually delivered as 16 bit integers, which evenly divide up the measurement range from maximum negative amplitude, to 0 (the neutral position of a speaker or microphone), to maximum positive amplitude.

Now, if you multiply 88.2k samples per second times 2 bytes per sample times the duration of your audio file, you are going to end up with a very large number. Almost certainly, this number will be inconvenient to allocate and fit into your available memory. On the output side, though, in order to draw a waveform, we really only need as many values to draw as there are horizontal pixels in the maximum scale of the image we want to produce. This is generally on the order of several hundred pixels up to several thousand pixels. As a result, we are going to read a lot of data, but we don’t need to store most of it.

So, we will be downsampling the data, with every output value being derived from n samples, where n = (number of samples per second * number of channels)/(number of output pixels).

Because there is so much data to read, you don’t want to read a whole audio file into memory and then process it. Instead, we are going to read it in chunks, and process each chunk as we go.

Transforming the Audio Data

So now we have a buffer full of audio samples, and we want to calculate the output values that we are going to draw.

The general process that we are going to use to transform the data is this:

  1. Convert each 16 bit integer sample to a float

  2. Take the absolute value of each sample

    For drawing the envelope of the waveform, we don’t actually care about the direction of the signal, we only care about how big that signal is

  3. Convert the sample value to Decibels

    Human perception of audio loudness is not linear, it is closer to logarithmic. As a result, we generally want to measure and output audio in the logarithmic Decibel scale, rather than the linear scale that the raw audio samples use. This conversion produces a value that ranges from negative infinity (for the smallest amplitude) to 0 (the largest).

  4. Clip the Decibel value to a range from the noise floor to 0

    Having a logarithmic scale means that small-valued samples in your audio file can have an extremely large effect on the overall range of the output. We probably don’t care about any extremely quiet samples in the track. For all practical purposes, we can treat those quiet sounds as silence. So we pick a value (I usually use -50 or -40) to use as the noise floor, and clip all the values to be in the range from that value to 0. Any values outside of that range are clamped to the nearest edge of that range.

  5. Downsample

    At this point, we are ready to downsample the data into our output buffer. Recall that we already figured out how many samples we need for every pixel of output (n). There are a number of different ways you could calculate an output value for each group of samples, but I have chosen to simply average the values into a single output value.1

So that’s what we want to do. Now let’s look at a basic implementation of this algorithm in Swift:2

var output = [Float](count:renderWidth, repeatedValue:noiseFloor)
var total:Float = 0.0
var sampleTally = 0
var nextDataOffset = 0

//for each chunk of audio data read into buffer: {
  let count = buffer.count
  for (var i=0; i<count; i++) {
      //convert to float
      var sampleValue = Float(buffer[i])
      
      //take absolute value and convert to dB
      sampleValue = (20.0 * log10(abs(sampleValue)/32767.0))
      
      //clip
      sampleValue = min(0, max(noiseFloor, sampleValue))

      total += sampleValue
      sampleTally++
	
      if sampleTally > samplesPerPixel && nextDataOffset < renderWidth {
          //downsample and average
          output[nextDataOffset] = total / Float(sampleTally)
          total = 0
          sampleTally = 0
          nextDataOffset++
      }
  }
//}

This code is pretty simple. It iterates through the samples accumulating a sum and a count, and every time we have accumulated samplesPerPixel samples, it averages all the data and appends the value to the output buffer.

It turns out, though, that we can can write code that does exactly the same thing but runs much faster using Accelerate. To do this, we are not going to iterate through all the samples. Instead, we are going to pass our sample buffer to a series of vector functions which will transform all the values as quickly as possible in a batch. “As quickly as possible,” in this case, means taking advantage of special CPU instructions that can do math on multiple values at the same time.

Most programmers are used to thinking about how to break an operation like this down into a set of instructions that will be performed over each element of an array, preserving any necessary state along the way. To use the Accelerate functions, we need to think about how to transform an entire array at the same time. Once you are thinking of the problem as a series of these transformations, the only tricky part of this is finding the functions that do what you want.

Conveniently, Accelerate contains a function for every step of our algorithm. We will be using the following functions:

  • vDSP_vflt16: convert 16 bit integers to floats
  • vDSP_vabs: take the absolute value of every element in the array
  • vDSP_vdbcon: convert each value from amplitude to decibels
  • vDSP_vclip: clip each value to a specified range
  • vDSP_desamp: downsample a large array to a smaller one, using a filter to specify how much each element in the large array affects the output value. We will use a specially constructed filter array so that the output value will be the average of the contributing input elements.

Here’s what it looks like in practice:

//setup:
var output = [Float](count:renderWidth, repeatedValue:noiseFloor)
var nextDataOffset = 0
var filter = [Float](count: Int(samplesPerPixel), 
                     repeatedValue: 1.0 / Float(samplesPerPixel))

//for each chunk of audio data: {
  //- copy any unprocessed samples carried over from the previous
  //   iteration into the beginning of the samples buffer
  //- copy the new chunk data into the samples buffer
  //- copy any samples we are not going to process in this iteration 
  //   to a buffer to carry over to the next iteration

  var processingBuffer = [Float](count: Int(samplesToProcess), 
                                 repeatedValue: 0.0)

  let sampleCount = vDSP_Length(samplesToProcess)
	
  // convert the 16bit int samples to floats
  vDSP_vflt16(samples, 1, &processingBuffer, 1, sampleCount)
	
  // take the absolute values to get amplitude
  vDSP_vabs(processingBuffer, 1, &processingBuffer, 1, sampleCount);
	
  // convert do dB
  var zero:Float = 32767.0;
  vDSP_vdbcon(processingBuffer, 1, &zero, &processingBuffer, 1, 
              sampleCount, 1);
	
  // clip to [noiseFloor, 0]
  var ceil:Float = 0.0
  vDSP_vclip(processingBuffer, 1, &noiseFloor, &ceil, 
             &processingBuffer, 1, sampleCount);
	
  // downsample and average
  var downSampledLength = Int(samplesToProcess / samplesPerPixel)
  var downSampledData = [Float](count:downSampledLength, 
                                repeatedValue:0.0)
	
  vDSP_desamp(processingBuffer, 
                vDSP_Stride(samplesPerPixel),  
                filter, &downSampledData, 
                vDSP_Length(downSampledLength),
                vDSP_Length(samplesPerPixel))
                
  output[nextDataOffset..<(nextDataOffset+downSampledLength)] = 
      downSampledData[0..<downSampledLength]
  nextDataOffset += downSampledLength;
//}         

When I tested these two styles of implementation early on in my development of DanceMaster, I found the Accelerate-based code to be approximately four times faster than the basic implementation when running on actual hardware. At that point, I was using Objective-C, but I recently tested implementations based on the Swift code listed above (using Swift 1.2) and found similar or better performance gains.3

In some ways, this code is more straightforwardly structured than our basic implementation above. There are no extra loops or branches, and the core algorithm is just a sequence of transformations. In this case, though, any readability gains or structural simplifications are balanced out by a little more complexity in how we set up the data coming into and going out of this block.

Note that compared to the basic implementation, we need to do a little more housekeeping for each chunk of data in order to maintain a buffer of unprocessed samples. Since we are downsampling in a batch, we can’t produce an output value unless our sample buffer has all of the values used to calculate that output value. For this reason, we want to make sure that the number of samples we are processing is a multiple of the number of samples per pixel. For each chunk, we process all the samples we can in the buffer and store the output data into an output array. Any samples we can’t process in this iteration, we carry over to the front of the buffer for the next iteration, before reading the next chunk of sample data.

Drawing the Output Values in a Graphics Context

If you made it this far, the rest is pretty easy. At this point, you have a relatively small array with one value for each point you want to plot. From this array, it should be relatively easy to draw this data into a graphics context. In the first version of DanceMaster, I drew a series of vertically centered 1px wide boxes with a height linearly scaled from [minDataValue, maxDataValue][0, viewHeight]. Chris Liscio wrote about using Quartz paths to display the waveforms in Capo. There are many other possible options.

Conclusion

This is not just a post about how to carry out an unusual programming task— it is about how to think differently about whole classes of programming problems that involve working with large amounts of data.

When you are dealing with large blocks of data, many of the most powerful tools at your disposal use a declarative style, where you specify the transformation that you want applied to the data, rather than providing the procedural instructions for transforming the data. This is not only true of the Accelerate framework, but also of many “Big Data” tools like MapReduce. This style has a lot of advantages for these tasks because it forces programmers to structure their code in ways that are easy for the the system to parallelize and optimize, and it gives the system to ability to decide exactly how to optimize any particular block of work.

Once you learn to think about solving these kinds of problems with a declarative sequence of transformations, you will find lots of places where you can use frameworks like Accelerate to produce better performing, and sometimes even less complicated looking, code.


  1. I am only drawing one output graph covering both output channels, which means that I will include the samples from both channels in the average. If I wanted to draw separate output for the left channel and the right channel, I would need to collect two sets of output, and select the appropriate samples from the buffer for each channel, with the even samples in the buffer being averaged into one output value and the odd values being averaged into another. ↩︎

  2. You can find code following this pattern recommended in several places on Stack Overflow. The version presented here is slightly simpler than those, produces a single set of output values regardless of the number of input channels, and is implemented in Swift rather than Objective-C. ↩︎

  3. Prior to Swift 1.2, the Accelerate-based code listed here was approximately 10–20 times faster than the basic implementation, but both versions were extremely slow. As of Swift 1.2, the Accelerate-based version is considerably faster than it was previously, and it is now a more rational 4-8 times faster than the basic implementation, depending on how the inner loop of the basic code is implemented. ↩︎