Computational Audio and Image Analysis with the Spectrograms Library
A comprehensive mathematical and computational guide to FFT-based audio and image processing — covering DFT theory, window functions, STFT, perceptual scales, CQT, MFCC, image processing, and the full Python API.
Introduction
This manual provides a comprehensive mathematical and computational overview of the algorithms, optimizations, and transformations implemented in the spectrograms library. The focus is on understanding the theoretical foundations and practical implementations of FFT-based signal processing for audio and image analysis.
The material emphasizes the why and how of each algorithm: why certain design choices matter, and how they are implemented efficiently. The material covers both the mathematics and the practical computational considerations that make these methods work in production systems.
Target Audience and Prerequisites
This material assumes familiarity with:
- Linear algebra (matrix multiplication, vector operations, eigenvalues)
- Complex numbers and Euler's formula
- Discrete signals and sampling theory
- Basic programming concepts and algorithm analysis
- Calculus (integrals, derivatives, series)
No prior knowledge of Fourier analysis, psychoacoustics, or advanced signal processing is assumed.
Scope and Organization
The manual is organized into major thematic sections:
- Fourier Transform Theory — DFT, FFT algorithms, Hermitian symmetry, 2D FFT
- Window Functions and Spectral Analysis — Hanning, Hamming, Kaiser, Blackman, Gaussian
- Time-Frequency Representations — STFT and spectrograms
- Amplitude Scaling and Representations — Power, magnitude, decibels, Parseval's theorem
- Perceptual Frequency Scales — Mel, ERB, logarithmic scales
- Constant-Q Transform — Variable-length kernels, sparsity optimization
- Advanced Audio Features — MFCC, chroma, gammatone filters
- Image Processing via FFT — 2D convolution, filtering, edge detection
- Computational Optimizations — Sparse matrices, FFT plans, caching
- Numerical Considerations — Precision, normalization, dynamic range
Notation Conventions
Throughout this manual:
- denotes signal length or FFT size
- denotes frequency bin index ()
- denotes time/sample index ()
- (the imaginary unit)
- denotes frequency domain values
- denotes time domain values
- Vectors are bold:
- Complex conjugate:
Fourier Transform Foundations
The Discrete Fourier Transform (DFT)
The Discrete Fourier Transform (DFT) decomposes a finite-length sequence into its constituent frequencies. For a sequence of length , the DFT is defined as:
where represents the complex amplitude at frequency bin . Using Euler's formula , this can be expanded as:
This reveals that the DFT correlates the input signal with complex sinusoids at each frequency. The computational complexity of this direct calculation is — each of outputs requires multiplications.
The inverse DFT (IDFT) reconstructs the time-domain signal:
The normalization factor ensures perfect reconstruction: .
Different implementations use different scaling conventions for the DFT and inverse DFT. The three most common conventions are:
- No forward scaling (this library): Forward DFT uses coefficient 1, inverse DFT uses
- Symmetric scaling: Both forward and inverse use (unitary transform)
- No inverse scaling: Forward uses , inverse uses 1
This library uses Convention 1: The forward DFT has no normalization factor (coefficient of 1), and the inverse DFT includes the factor. This convention is standard in most signal processing literature and libraries (FFTW, NumPy, SciPy).
Energy implications: With this scaling, Parseval's theorem takes the form:
When comparing spectral amplitudes between implementations, always verify the scaling convention to ensure energy-consistent analysis.
Fast Fourier Transform (FFT): The Cooley-Tukey Algorithm
The Fast Fourier Transform (FFT) is not a different transform, but an efficient algorithm for computing the DFT. The Cooley-Tukey algorithm, published in 1965 (though similar ideas date to Gauss in 1805), reduces complexity from to .
The key insight is divide-and-conquer using the symmetry and periodicity properties of complex exponentials:
Let be the twiddle factor. These complex exponentials satisfy:
Periodicity means the complex exponential repeats every samples. Symmetry means that values half an FFT bin apart differ only in sign. These properties enable the FFT's recursive decomposition.
For even, split the DFT into even and odd indexed samples:
Using , this becomes:
where and are -point DFTs of the even and odd samples. This recursive splitting continues until the base cases are reached (typically or small prime factors).
The complexity analysis: At each of levels, operations (additions and a twiddle factor multiplication) are performed, yielding total complexity.
Real-to-Complex FFT and Hermitian Symmetry
When the input signal is real-valued (as in audio processing), the DFT output exhibits a special property:
For a real-valued signal , the DFT satisfies Hermitian symmetry:
where denotes complex conjugate. This means:
- (real parts are symmetric)
- (imaginary parts are antisymmetric)
- and (for even ) are purely real
This property follows directly from the DFT definition and the fact that is real for real .
Critical constraint for R2C transforms: For even , the Nyquist frequency bin must be purely real (i.e., ). This follows from Hermitian symmetry since , which requires the imaginary part to be zero. Violating this constraint results in non-real values after inverse FFT, causing numerical errors and incorrect signal reconstruction.
This symmetry makes it sufficient to store only the positive frequencies:
This yields complex values instead of , halving storage requirements. The library computes real-to-complex (R2C) FFTs with output size:
Implementation detail: The library uses specialized R2C FFT algorithms (from RealFFT or FFTW backends) that exploit this symmetry for both speed and memory efficiency.
pub const fn r2c_output_size(n: usize) -> usize {
n / 2 + 1
}
// Example: 512-point FFT on real signal
let samples = vec![0.0; 512];
let spectrum = fft(&samples, 512)?;
assert_eq!(spectrum.len(), 257); // 512/2 + 1Inverse FFT and Perfect Reconstruction
The inverse FFT (IFFT) transforms frequency-domain data back to the time domain. For a complex-to-real (C2R) inverse FFT, given Hermitian-symmetric input :
Due to Hermitian symmetry, it is sufficient to sum over the non-redundant frequencies:
The library's C2R inverse FFT ensures perfect reconstruction:
let original = vec![1.0, 2.0, 3.0, 4.0];
let spectrum = fft2d(&original_image)?;
let reconstructed = ifft2d(&spectrum, original_image.ncols())?;
// reconstructed approximately original (within floating-point precision)2D FFT via Row-Column Decomposition
For 2D data like images with dimensions , the 2D DFT is:
The separability property allows this to be decomposed into row and column transforms:
Algorithm:
- Apply 1D FFT to each row: transforms of length
- Apply 1D FFT to each column of the result: transforms of length
For a real-valued image, the output is due to Hermitian symmetry along rows.
Implementation: The library uses row-major layout and ensures contiguous memory:
pub fn fft2d(data: &Array2<f64>) -> SpectrogramResult<Array2<Complex<f64>>> {
let (nrows, ncols) = (data.nrows(), data.ncols());
if !data.is_standard_layout() {
return Err(SpectrogramError::invalid_input(
"array must be contiguous and row-major"));
}
let out_shape = r2c_output_size_2d(nrows, ncols);
// out_shape = (nrows, ncols/2 + 1)
let mut plan = planner.plan_r2c_2d(nrows, ncols)?;
// ... perform row-column FFT
}Complexity: for square images.
Window Functions for Spectral Analysis
The Windowing Problem
When the DFT of a finite signal segment is computed, the signal is implicitly multiplied by a rectangular window (ones inside the segment, zeros outside). This abrupt truncation causes spectral leakage: energy from one frequency "leaks" into adjacent frequency bins.
Consider a pure sinusoid at frequency . If sampling captures exactly an integer number of periods, the DFT produces a perfect spike at . But with non-integer periods, the energy spreads across many bins.
This occurs because multiplying by a rectangular window in the time domain is equivalent to convolving with a sinc function in the frequency domain, which has slowly-decaying sidelobes.
The fundamental trade-off:
- Narrow main lobe → better frequency resolution
- Low sidelobes → less spectral leakage
Window functions smooth the transition to zero at the segment boundaries, reducing leakage at the cost of frequency resolution.
Windowing Convention: Symmetric vs. Periodic
Window functions can be defined with two distinct conventions:
Symmetric windows use in the denominator:
Periodic windows use in the denominator:
The difference is subtle but important:
- Symmetric: Values at endpoints match (e.g., both zero for Hanning). Standard for spectral analysis.
- Periodic: The window is periodic with period , so . Required for perfect reconstruction in overlap-add systems.
For the Constant Overlap-Add (COLA) property:
where is the hop size. Periodic windows satisfy COLA for specific overlap ratios, enabling perfect signal reconstruction.
Library convention: All window implementations in this library use the symmetric formulation (), which is standard for analysis applications. For synthesis requiring COLA, use appropriate overlap ratios (e.g., 50% for Hanning).
Rectangular Window (No Windowing)
The rectangular window is simply:
Its frequency response has the narrowest main lobe but highest sidelobes (≈ −13 dB), causing severe spectral leakage.
WindowType::Rectangular => {
w.fill(1.0);
}Use when: You need maximum frequency resolution and know the signal contains exact integer periods.
Hanning Window
Also called the Hann window (named after Julius von Hann, not Richard Hamming), defined as:
This is a raised cosine function. It smoothly tapers to zero at both endpoints.
Properties:
- Main lobe width: 4 bins (2× rectangular)
- First sidelobe: ≈ −32 dB
- Sidelobe falloff: -18 dB/octave
WindowType::Hanning => {
let n1 = (n_fft - 1) as f64;
for (n, v) in w.iter_mut().enumerate() {
*v = 0.5 - 0.5 * cos(2.0 * PI * (n as f64) / n1);
}
}Why this works: The cosine taper reduces discontinuities at the edges. In the frequency domain, the Hanning window's transform is the convolution of the rectangular window's transform with two delta functions, producing lower sidelobes.
The Hanning window is the most commonly used for general-purpose spectral analysis.
Hamming Window
Named after Richard Hamming, this window uses optimized coefficients:
The coefficients (0.54, 0.46) were chosen to minimize the first sidelobe:
- Main lobe width: 4 bins (same as Hanning)
- First sidelobe: ≈ −43 dB (better than Hanning)
- Sidelobe falloff: -6 dB/octave (worse than Hanning)
WindowType::Hamming => {
let n1 = (n_fft - 1) as f64;
for (n, v) in w.iter_mut().enumerate() {
*v = 0.54 - 0.46 * cos(2.0 * PI * (n as f64) / n1);
}
}Trade-off: The Hamming window doesn't reach exactly zero at the endpoints (it's 0.08), which can cause issues for overlap-add reconstruction. However, its lower first sidelobe makes it useful when strong interfering signals exist.
Blackman Window
A three-term cosine sum with excellent sidelobe suppression:
Properties:
- Main lobe width: 6 bins (3× rectangular)
- First sidelobe: ≈ −58 dB
- Sidelobe falloff: -18 dB/octave
WindowType::Blackman => {
let n1 = (n_fft - 1) as f64;
for (n, v) in w.iter_mut().enumerate() {
let a = 2.0 * PI * (n as f64) / n1;
*v = 0.42 - 0.5 * cos(a) + 0.08 * cos(2.0 * a);
}
}Use when: You need minimal spectral leakage and can tolerate wider main lobes (reduced frequency resolution).
Kaiser Window and Bessel Functions
The Kaiser window provides a tunable parameter that controls the trade-off between main lobe width and sidelobe level:
The modified Bessel function of the first kind, order zero, is defined by:
For computational efficiency, a truncated power series is used. The series converges rapidly; typically 20–30 terms achieve machine-precision accuracy for .
Parameter effects:
- : rectangular window
- : similar to Hamming
- : similar to Blackman
- Larger : lower sidelobes, wider main lobe
fn bessel_i0(x: f64) -> f64 {
let mut sum = 1.0; // k=0 term
let mut term = 1.0;
let half_x = x / 2.0;
// Truncated series: sufficient for Kaiser windows
for k in 1..30 {
term *= (half_x / k as f64).powi(2);
sum += term;
// Early termination if converged
if term < 1e-12 * sum {
break;
}
}
sum
}
WindowType::Kaiser { beta } => {
let denominator = bessel_i0(beta);
for i in 0..n_fft {
let n = i as f64;
let n_max = (n_fft - 1) as f64;
let alpha = (n - n_max / 2.0) / (n_max / 2.0);
let bessel_arg = beta * (1.0 - alpha * alpha).sqrt();
w[i] = bessel_i0(bessel_arg) / denominator;
}
}The Kaiser window is named after James Kaiser who introduced it in 1966 for optimal FIR filter design.
Gaussian Window
A Gaussian function centered at the window midpoint:
where (standard deviation) controls the width. Smaller → narrower window → more time localization.
WindowType::Gaussian { std } => {
for i in 0..n_fft {
let n = i as f64;
let center = (n_fft - 1) as f64 / 2.0;
let exponent = -0.5 * ((n - center) / std).powi(2);
w[i] = exp(exponent);
}
}The Gaussian window has a unique property: its Fourier transform is also Gaussian. This gives optimal time-frequency localization per the uncertainty principle:
Use for: Time-frequency analysis where minimizing uncertainty is critical (e.g., Gabor transforms).
Trade-offs: Frequency Resolution vs. Spectral Leakage
Comparison table for :
| Window | Main Lobe | Peak Sidelobe | Falloff Rate |
|---|---|---|---|
| Rectangular | 2 bins | -13 dB | -6 dB/oct |
| Hanning | 4 bins | -32 dB | -18 dB/oct |
| Hamming | 4 bins | -43 dB | -6 dB/oct |
| Blackman | 6 bins | -58 dB | -18 dB/oct |
| Kaiser () | 6 bins | -60 dB | -6 dB/oct |
Selection guidelines:
- General purpose, good balance: Hanning
- Need low first sidelobe: Hamming
- Need very low sidelobes: Blackman
- Custom trade-off: Kaiser with tuned
- Optimal uncertainty: Gaussian with appropriate
Constant Overlap-Add (COLA) Constraint
When using the STFT for signal modification or synthesis (e.g., time-stretching, pitch-shifting, noise reduction), the time-domain signal must be reconstructed from modified spectral frames. The Constant Overlap-Add (COLA) constraint ensures perfect reconstruction is possible.
For a window function and hop size , the COLA property states that the sum of overlapping windows must be constant:
where is a non-zero constant (typically normalized to 1).
Physical interpretation: When frames are overlapped by hop size and summed (overlap-add), every sample in the reconstructed signal receives equal weighting from the windows. This prevents amplitude modulation artifacts.
COLA conditions for common windows:
Rectangular window: COLA for any hop size (no overlap required)
Hanning window (periodic): COLA for (50%, 75%, 87.5% overlap, etc.)
- 50% overlap (): Most common,
- 75% overlap (): Higher redundancy, smoother synthesis
Hamming window: COLA for to (requires careful tuning; 50% overlap gives )
Blackman window: COLA for (at least 67% overlap required)
Kaiser window: COLA depends on ; generally requires 50–75% overlap
Verification: To check if a window/hop combination satisfies COLA, compute:
fn verify_cola(window: &[f64], hop_size: usize) -> bool {
let n = window.len();
let n_overlaps = (n + hop_size - 1) / hop_size;
// Check multiple offset positions
for offset in 0..hop_size {
let mut sum = 0.0;
for m in 0..n_overlaps {
let idx = offset + m * hop_size;
if idx < n {
sum += window[idx];
}
}
// All positions should have same sum (within tolerance)
if (sum - window[0]).abs() > 1e-10 {
return false;
}
}
true
}Implications for spectrogram inversion:
When using the STFT for analysis-modification-synthesis:
- Choose a window/hop combination that satisfies COLA
- Apply STFT to get frames
- Modify the spectrogram (e.g., filter, denoise, time-stretch)
- Apply inverse FFT to each modified frame
- Overlap-add the frames with hop size
- Divide by the COLA constant (if )
The reconstructed signal will equal the original if no modifications were made (perfect reconstruction).
Library convention: For analysis-only applications (spectrograms, feature extraction), COLA is not required. However, this library uses 50% overlap () by default, which satisfies COLA for Hanning windows and enables optional synthesis workflows.
Short-Time Fourier Transform (STFT)
Time-Frequency Analysis Motivation
The standard DFT provides frequency content but discards all temporal information: it indicates which frequencies are present, but not when they occur. For non-stationary signals (speech, music, transient events), time-frequency analysis is required.
The Short-Time Fourier Transform (STFT) solves this by applying the DFT to short, overlapping segments (frames) of the signal, creating a time-frequency representation.
STFT Definition and Computation
For a signal with window function of length , the STFT is:
where:
- is the frame index
- is the hop size (samples between consecutive frames)
- is the frequency bin
- is the FFT size (window length)
The result is a 2D complex matrix: with dimensions (time frames) × (frequency bins).
Framing and Overlap
The signal is divided into overlapping frames:
Key parameters:
- FFT size (): Window length, determines frequency resolution:
- Hop size (): Samples between frames, determines time resolution:
- Overlap: . Common: 50% (H = N/2) or 75% (H = N/4)
Number of frames for signal of length samples:
With centering (padding zeros on each end):
Why pad zeros? Without centering, the first window is positioned with its left edge at sample 0. This means the window's center is at sample , causing a temporal shift in the resulting spectrogram.
By padding zeros before the signal, the first window's center aligns with sample 0. This ensures:
- Frame is centered at time (no offset)
- Edge effects are symmetric at signal boundaries
- Improved temporal localization for transient events
The same padding is applied at the signal's end, ensuring symmetric treatment of boundaries. This is standard practice in audio analysis libraries (e.g., librosa's center=True).
Padding type specification: This library uses zero-padding (also called zero-extension) for both the initial and final samples:
where is the original signal length.
Alternative: Reflective padding extends the signal by mirroring boundary values (e.g., , ), which can reduce edge discontinuities for certain signals. However, zero-padding is the standard convention in most audio processing libraries (NumPy, SciPy, librosa) and is used exclusively in this implementation.
Implementation consistency: All implementations must use the same padding type to ensure identical spectrograms for the same input. Using reflective padding when zero-padding is expected (or vice versa) will cause boundary frames to diverge, particularly for the first and last frames where padding contributes to the windowed signal.
Implementation detail: The library uses centering by default to avoid edge effects:
fn frame_count(&self, n_samples: usize) -> usize {
let pad = if self.centre { self.n_fft / 2 } else { 0 };
let padded_len = n_samples + 2 * pad;
if padded_len < self.n_fft {
return 1; // Single frame, even for short signals
}
let remaining = padded_len - self.n_fft;
let n_frames = remaining / self.hop + 1;
n_frames
}STFT Computation Pipeline
For each frame :
- Extract frame: Get samples
- Apply window: Compute
- FFT: Compute
- Store: Save as column of output matrix
Optimizations:
- Pre-compute window samples once
- Reuse FFT plan across all frames
- Allocate workspace buffers once
To enable efficient STFT computation, implementations use a Workspace structure that holds reusable buffers and FFT state. This avoids repeated allocations during frame processing.
Minimum workspace requirements:
- Windowed frame buffer: Real-valued array of size to hold before FFT
- FFT output buffer: Complex-valued array of size (for R2C FFT) to hold frequency-domain values
- FFT plan state: Backend-specific data structure (e.g., FFTW plan, RealFFT planner) that stores twiddle factors and algorithm configuration
- Power/magnitude spectrum buffer: Real-valued array of size to hold or for the current frame
Total memory requirement: real values and complex values, plus backend-specific FFT plan overhead (typically for precomputed twiddle factors).
Implementation example:
struct Workspace {
frame: Vec<f64>, // Size N (windowed input)
fft_out: Vec<Complex<f64>>, // Size N/2 + 1 (FFT output)
spectrum: Vec<f64>, // Size N/2 + 1 (power/magnitude)
}
impl Workspace {
fn new(n_fft: usize) -> Self {
let out_len = n_fft / 2 + 1;
Self {
frame: vec![0.0; n_fft],
fft_out: vec![Complex::zero(); out_len],
spectrum: vec![0.0; out_len],
}
}
}Performance benefit: Reusing workspace buffers across frames reduces memory allocator pressure. For a 10-second audio file at 16 kHz with and , this processes approximately 625 frames — avoiding 1,875 allocations (3 buffers × 625 frames).
fn compute_frame_spectrum(&mut self, samples: &[f64],
frame_idx: usize,
workspace: &mut Workspace) -> Result<()> {
let out = workspace.frame.as_mut_slice();
let pad = if self.centre { self.n_fft / 2 } else { 0 };
let start = frame_idx * self.hop;
// Fill windowed frame (with zero-padding)
for i in 0..self.n_fft {
let v_idx = start + i;
let s_idx = v_idx as isize - pad as isize;
let sample = if s_idx < 0 || s_idx >= samples.len() {
0.0 // Padding
} else {
samples[s_idx as usize]
};
out[i] = sample * self.window[i];
}
// Compute FFT
self.fft.process(out, workspace.fft_out)?;
// Convert to power spectrum: |X[k]|^2
for (i, c) in workspace.fft_out.iter().enumerate() {
workspace.spectrum[i] = c.norm_sqr();
}
Ok(())
}Spectrogram as STFT Magnitude
A spectrogram is the magnitude (or power) of the STFT:
or
This discards phase information but provides an intuitive visualization of energy distribution over time and frequency.
Time-Frequency Uncertainty Principle
The fundamental trade-off in STFT is governed by the uncertainty principle:
where is temporal resolution and is frequency resolution.
This is not a limitation of the algorithm, but a fundamental property of Fourier analysis related to the Heisenberg uncertainty principle in quantum mechanics. You cannot achieve arbitrary precision in both domains simultaneously.
Implications:
- Large (long window): Good frequency resolution, poor time resolution
- Small (short window): Good time resolution, poor frequency resolution
For speech (16 kHz sample rate):
- (32 ms): Hz — good for rapid changes
- (128 ms): Hz — better frequency detail
For music (44.1 kHz sample rate):
- (46 ms): Hz — standard choice
- (93 ms): Hz — for harmonic analysis
Amplitude Scaling and Representations
Power Spectrum
The power spectrum represents energy at each frequency:
For real input signals, this is proportional to the energy in frequency bin .
Properties:
- Always non-negative
- Units: (amplitude)
- Additive: energies from independent sources sum
The squared magnitude emphasizes strong components and suppresses weak ones, which can be both beneficial (noise reduction) and problematic (loss of weak signals).
Magnitude Spectrum
The magnitude spectrum is the absolute value:
This is more perceptually linear than power, as human perception of sound intensity is approximately logarithmic.
Relationship:
Use magnitude when:
- You need perceptually meaningful amplitudes
- Comparing with other magnitude-based representations
- Visualization where extreme values would dominate
Decibel Scale
The decibel (dB) scale provides logarithmic amplitude scaling:
where is a reference power. For power spectrograms without specific reference:
For magnitude-based signals:
The factor of 20 (instead of 10) accounts for the squared relationship between magnitude and power.
Numerical considerations: To avoid , a floor is applied:
where is a small threshold (e.g., ). This maps very small values to a large negative dB value instead of .
Dynamic range compression: The dB scale compresses the dynamic range, making it easier to visualize spectrograms with both loud and quiet components. A signal with power ratios of 1,000,000:1 becomes 60 dB, much more manageable.
Example:
- Power = 1.0 → 0 dB
- Power = 0.1 → -10 dB
- Power = 0.01 → -20 dB
- Power = 100.0 → 20 dB
Parseval's Theorem and Energy Conservation
Parseval's theorem states that energy is conserved between time and frequency domains:
This fundamental result ensures that:
- The FFT doesn't create or destroy energy
- Total power in frequency domain equals total power in time domain
- Inverse FFT perfectly reconstructs energy (up to numerical precision)
For windowed signals in STFT with proper overlap-add, energy is preserved:
where the proportionality constant depends on the window normalization.
Implication for normalization: Different FFT libraries use different normalization conventions. The library uses the convention where the forward FFT has no normalization factor, and the inverse FFT divides by . This is consistent with most signal processing literature.
Perceptual Frequency Scales
Motivation: Human Auditory Perception
Auditory frequency perception is not linear. Frequency differences are perceived logarithmically, especially at higher frequencies. For example:
- The difference between 100 Hz and 200 Hz is highly perceptible (one octave)
- The difference between 10,000 Hz and 10,100 Hz is barely noticeable (0.014 octaves)
Perceptual frequency scales map linear Hz to psychoacoustically-motivated scales that better match human hearing.
Mel Scale
The mel scale is based on pitch perception studies. The name comes from "melody," reflecting its musical origins. The conversion formulas are:
The mel scale is approximately linear below 1000 Hz and logarithmic above.
Mel Filterbank Construction:
To create mel-spaced filters:
- Convert frequency range to mel scale
- Create equally-spaced mel points (for triangular filter edges)
- Convert mel points back to Hz
- Map Hz to FFT bin indices
- Create triangular filters centered at these points
where are the FFT bin indices corresponding to mel points.
Implementation detail: The filterbank is stored as a sparse matrix since each triangular filter only has non-zero values over a small range of bins:
fn build_mel_filterbank_matrix(sample_rate_hz: f64, n_fft: usize,
n_mels: usize, f_min: f64,
f_max: f64) -> SparseMatrix {
let out_len = r2c_output_size(n_fft);
let df = sample_rate_hz / n_fft as f64;
// Convert to mel scale
let mel_min = hz_to_mel(f_min);
let mel_max = hz_to_mel(f_max);
// Create n_mels + 2 mel points (for triangle edges)
let n_points = n_mels + 2;
let step = (mel_max - mel_min) / (n_points - 1) as f64;
let mel_points: Vec<_> = (0..n_points)
.map(|i| i as f64 * step + mel_min)
.collect();
// Convert back to Hz, then to FFT bins
let bin_points: Vec<_> = mel_points
.iter()
.map(|&mel| mel_to_hz(mel))
.map(|hz| (hz / df).floor() as usize)
.collect();
// Build sparse filterbank
let mut fb = SparseMatrix::new(n_mels, out_len);
for m in 0..n_mels {
let left = bin_points[m];
let centre = bin_points[m + 1];
let right = bin_points[m + 2];
// Rising slope: left -> centre
for k in left..centre {
let v = (k - left) as f64 / (centre - left) as f64;
fb.set(m, k, v);
}
// Falling slope: centre -> right
for k in centre..right {
let v = (right - k) as f64 / (right - centre) as f64;
fb.set(m, k, v);
}
}
fb
}For typical speech processing (, 80 mel bands), the filterbank is approximately 95% sparse.
Numerical hygiene: To achieve this sparsity in practice, coefficients below a threshold (typically or machine epsilon for f64) are set to exactly zero during construction. This prevents accumulation of negligible numerical artifacts while maintaining filterbank fidelity.
ERB Scale: Glasberg and Moore Model
The Equivalent Rectangular Bandwidth (ERB) scale is based on psychoacoustic measurements of critical bandwidths. It models the frequency selectivity of the human auditory system.
The ERB in Hz as a function of center frequency:
To convert frequency to ERB scale (ERB-rate):
Inverse transformation:
Gammatone Filters:
ERB spectrograms often use gammatone filters, which model the impulse response of the auditory filter:
where:
- is the filter order (typically 4)
- is the bandwidth parameter related to ERB
- is the center frequency
- is the phase
The library implements gammatone filters in the frequency domain for efficiency.
Why ERB over Mel? The ERB scale more accurately models auditory filters at higher frequencies and is preferred for applications requiring high fidelity to human perception.
Logarithmic Frequency (Musical Scale)
Musical notes are logarithmically spaced: each octave doubles the frequency. For music analysis, a logarithmic frequency axis is natural.
Log-frequency bins:
for bins.
Alternatively, using equal spacing in log-space:
Linear Interpolation:
Since FFT bins are linearly spaced, linear interpolation is used between adjacent bins to get log-spaced values:
where and .
fn build_loghz_matrix(sample_rate_hz: f64, n_fft: usize,
n_bins: usize, f_min: f64,
f_max: f64) -> SparseMatrix {
let df = sample_rate_hz / n_fft as f64;
// Logarithmically-spaced frequencies
let log_f_min = f_min.ln();
let log_f_max = f_max.ln();
let log_step = (log_f_max - log_f_min) / (n_bins - 1) as f64;
let log_frequencies: Vec<_> = (0..n_bins)
.map(|i| (i as f64 * log_step + log_f_min).exp())
.collect();
// Build interpolation matrix
let mut matrix = SparseMatrix::new(n_bins, out_len);
for (bin_idx, &target_freq) in log_frequencies.iter().enumerate() {
let exact_bin = target_freq / df;
let lower_bin = exact_bin.floor() as usize;
let upper_bin = exact_bin.ceil() as usize;
if lower_bin == upper_bin {
matrix.set(bin_idx, lower_bin, 1.0);
} else {
// Linear interpolation
let frac = exact_bin - lower_bin as f64;
matrix.set(bin_idx, lower_bin, 1.0 - frac);
matrix.set(bin_idx, upper_bin, frac);
}
}
matrix
}This matrix is extremely sparse: only 1–2 non-zero values per row (>99% sparse).
Filterbank Design and Sparse Matrices
All perceptual scales use sparse matrix multiplication for efficiency. The operation is:
where is the filterbank matrix (sparse), is the linear spectrum, and is the perceptually-scaled spectrum.
Sparse Matrix Storage:
The library uses row-wise sparse format:
struct SparseMatrix {
nrows: usize,
ncols: usize,
values: Vec<Vec<f64>>, // Non-zero values per row
indices: Vec<Vec<usize>>, // Column indices per row
}Multiplication:
fn multiply_vec(&self, input: &[f64], out: &mut [f64]) {
for (row_idx, (row_values, row_indices)) in
self.values.iter().zip(&self.indices).enumerate() {
let mut acc = 0.0;
for (&value, &col_idx) in row_values.iter().zip(row_indices) {
acc += value * input[col_idx];
}
out[row_idx] = acc;
}
}For matrix-vector multiplication where is :
Dense matrix:
Sparse matrix with non-zeros:
where is the number of non-zero elements.
For filterbanks with uniform sparsity non-zeros per row:
Example: Mel filterbank with , , non-zeros/row has 1,600 operations vs. 20,560 for dense multiplication.
For mel filterbanks with bands and FFT (257 bins):
- Dense matrix: multiplications
- Sparse matrix:
multiplications (10–20 non-zeros/row)
For log-frequency with linear interpolation:
- Only 2 multiplications per output bin (vs. 257 for dense)
Constant-Q Transform (CQT)
Motivation and Q Factor
The Short-Time Fourier Transform provides constant frequency resolution across all frequencies. But musical perception is logarithmic: relative frequency differences (ratios) are more relevant than absolute differences.
The Constant-Q Transform (CQT) provides logarithmically-spaced frequency bins with constant quality factor :
where is the center frequency and is the bandwidth of bin .
For musical applications with bins per octave:
For example, with bins/octave (semitones):
Higher → narrower bandwidth → better frequency resolution but worse time resolution.
Kernel-Based Implementation
Unlike STFT which uses a fixed-length window, CQT uses variable-length windows (kernels) for each frequency:
For frequency bin with center frequency :
where is the sample rate. Higher frequencies → shorter windows, lower frequencies → longer windows.
The CQT kernel for frequency :
where is a window function (typically Hanning or Hamming).
Implementation:
fn generate_kernel_bin(center_freq: f64, kernel_length: usize,
sample_rate: f64,
window_type: WindowType) -> Vec<Complex<f64>> {
let mut kernel = Vec::with_capacity(kernel_length);
// Generate window
let window = make_window(window_type, kernel_length)?;
// Generate complex exponential kernel
for (n, w) in window.iter().enumerate() {
let t = n as f64 / sample_rate;
let phase = 2.0 * PI * center_freq * t;
// e^(i*2*pi*f*t) = cos(2*pi*f*t) + i*sin(2*pi*f*t)
let exponential = Complex::new(phase.cos(), phase.sin());
// Apply window function
let windowed = exponential * w;
kernel.push(windowed);
}
kernel
}Variable-Length Windows
The beauty of CQT is that window length scales with frequency:
- Low frequencies: Long windows → good frequency resolution, poor time resolution
- High frequencies: Short windows → good time resolution, adequate frequency resolution
This matches musical perception: precise pitch resolution is required for low notes (fundamental frequencies), while less precision is typically acceptable for high harmonics.
Example for Hz, :
- Hz (A1): samples (~305 ms)
- Hz (A4): samples (~38 ms)
- Hz (A7): samples (~4.8 ms)
Sparsity Optimization
CQT kernels naturally have many small-magnitude coefficients, especially near the edges due to windowing. This can be exploited with sparsity thresholding:
where is the sparsity threshold (e.g., 0.01 for 1%).
fn apply_sparsity_threshold(kernel: &mut [Complex<f64>],
threshold: f64) {
let max_magnitude = kernel.iter()
.map(|c| c.norm())
.fold(0.0, f64::max);
let absolute_threshold = max_magnitude * threshold;
for coefficient in kernel.iter_mut() {
if coefficient.norm() < absolute_threshold {
*coefficient = Complex::new(0.0, 0.0);
}
}
}Typical sparsity levels: 60–80% of kernel coefficients can be zeroed with with negligible quality loss.
Computational benefit: Sparse kernels reduce the number of operations in time-domain convolution by skipping zero multiplications.
Warning: Aggressive sparsity thresholds (e.g., or zeroing coefficients below ) can introduce audible artifacts:
- Spectral leakage: Truncating kernel tails creates discontinuities, causing frequency bleeding between adjacent bins.
- Musical noise: Sparse, time-varying kernels can produce tonal artifacts that vary frame-to-frame, perceived as "chirping" or "burbling" sounds.
- Time-smearing: Excessively truncated kernels lose their designed frequency selectivity.
Recommended practice: Use conservative thresholds () for high-quality audio analysis. For real-time or resource-constrained applications, validate perceptual quality through listening tests before deploying aggressive sparsity.
Advanced Audio Features
Mel-Frequency Cepstral Coefficients (MFCC)
MFCCs are the most widely used features in speech recognition, speaker identification, and audio classification. Introduced by Davis and Mermelstein in 1980, they provide a compact representation of the spectral envelope.
Computation pipeline:
- Compute mel-scaled power spectrogram:
- Apply logarithm:
- Apply Discrete Cosine Transform (DCT-II) to each frame
- Optionally apply cepstral liftering
- Extract first coefficients (typically 13)
Why DCT? The mel spectrogram has correlated bins due to overlapping triangular filters. The Discrete Cosine Transform (DCT) decorrelates them and compacts energy into the first few coefficients.
The DCT-II formula:
The DCT is closely related to the DFT but operates entirely on real numbers. It's used in JPEG image compression and many audio codecs due to its excellent energy compaction properties.
fn dct_ii(input: &[f64]) -> Vec<f64> {
let n = input.len();
let mut output = vec![0.0; n];
for k in 0..n {
let mut sum = 0.0;
for (i, &val) in input.iter().enumerate() {
sum += val * (PI * k as f64 * (i as f64 + 0.5) / n as f64).cos();
}
output[k] = sum;
}
output
}Cepstral liftering:
Liftering applies a sinusoidal weighting to emphasize mid-range coefficients:
where is the lifter parameter (commonly 22).
fn apply_liftering(mfcc: &mut Array2<f64>, lifter: usize) {
let n_mfcc = mfcc.nrows();
// Compute lifter weights
let weights: Vec<_> = (0..n_mfcc)
.map(|i| 1.0 + (lifter as f64 / 2.0)
* (PI * i as f64 / lifter as f64).sin())
.collect();
// Apply weights to each frame
for frame_idx in 0..n_frames {
for coeff_idx in 0..n_mfcc {
mfcc[[coeff_idx, frame_idx]] *= weights[coeff_idx];
}
}
}The C0 coefficient: The zeroth coefficient represents the overall energy of the frame. Some applications include it (speaker recognition), others exclude it (speech recognition) to reduce sensitivity to volume.
Why 13 coefficients? Empirically, 13 MFCCs capture most speech information. Higher coefficients represent fine spectral detail that's often noise.
Chroma Features for Music Analysis
Chroma features (also called pitch class profiles) represent the energy distribution across the 12 pitch classes of the Western musical scale, collapsing all octaves together.
The 12 pitch classes: C, C#, D, D#, E, F, F#, G, G#, A, A#, B
Construction:
Map FFT bins to musical notes using: where is a reference frequency (e.g., A4 = 440 Hz)
Determine pitch class:
Sum energy from all octaves for each pitch class:
Normalize (L1, L2, or max)
Normalization strategies:
- L1: (probabilities)
- L2: (Euclidean norm)
- Max: (relative strength)
Applications: Chord recognition, key estimation, cover song identification, music similarity.
ERB Spectrograms and Gammatone Filters
ERB spectrograms using gammatone filters provide the most perceptually accurate time-frequency representation.
The gammatone impulse response:
where:
- is amplitude
- is filter order (typically 4)
- is bandwidth parameter
- is center frequency
- is phase
The bandwidth parameter relates to ERB:
Frequency domain implementation:
For efficiency, the library implements gammatone filtering in the frequency domain. The frequency-domain transfer function is:
where is the filter order (typically 4), is the center frequency, is the bandwidth parameter, and . This fourth-order complex pole provides the characteristic asymmetric frequency response of the auditory filter.
Filtering is then performed via element-wise multiplication:
This avoids expensive time-domain convolution for each filter.
Image Processing via FFT
2D Convolution and the Convolution Theorem
The convolution theorem states that convolution in the spatial domain is equivalent to multiplication in the frequency domain:
For images:
where denotes element-wise multiplication.
Complexity comparison:
Spatial convolution for image with kernel :
FFT-based convolution:
Theoretical crossover: FFT has better asymptotic complexity when , typically around .
Kernel Padding and Phase Shifting
For correct FFT convolution, the kernel must be padded to image size with proper phase shifting. The kernel's center should map to position in the padded array:
fn pad_kernel_for_fft(kernel: &Array2<f64>,
target_shape: (usize, usize)) -> Array2<f64> {
let (ker_rows, ker_cols) = kernel.dim();
let mut result = Array2::<f64>::zeros(target_shape);
// Kernel center position
let ker_center_row = ker_rows / 2;
let ker_center_col = ker_cols / 2;
// Place kernel with center at (0, 0) using wraparound
for i in 0..ker_rows {
for j in 0..ker_cols {
let row_offset = i as isize - ker_center_row as isize;
let col_offset = j as isize - ker_center_col as isize;
// Wrap around to place center at (0, 0)
let target_row = row_offset.rem_euclid(target_rows as isize) as usize;
let target_col = col_offset.rem_euclid(target_cols as isize) as usize;
result[[target_row, target_col]] = kernel[[i, j]];
}
}
result
}This ensures the FFT interprets the kernel correctly for circular convolution.
Spatial Filtering in the Frequency Domain
Filtering is multiplication by a frequency-domain mask:
Low-pass filter: Keeps low frequencies, removes high frequencies (smoothing)
High-pass filter: Removes low frequencies, keeps high frequencies (sharpening/edges)
Band-pass filter: Keeps frequencies in a range
Gaussian Blur via FFT
The Gaussian kernel for blurring:
Implementation:
pub fn gaussian_kernel_2d(size: NonZeroUsize, sigma: f64) -> SpectrogramResult<Array2<f64>> {
if size.get().is_multiple_of(2) {
return Err(SpectrogramError::invalid_input(
"kernel size must be odd and > 0",
));
}
if sigma <= 0.0 {
return Err(SpectrogramError::invalid_input("sigma must be > 0"));
}
let size = size.get();
let center = (size / 2) as f64;
let variance = sigma * sigma;
let coeff = 1.0 / (2.0 * PI * variance);
let mut kernel = Array2::<f64>::zeros((size, size));
for i in 0..size {
for j in 0..size {
let x = i as f64 - center;
let y = j as f64 - center;
let exponent = -(x * x + y * y) / (2.0 * variance);
kernel[[i, j]] = coeff * exponent.exp();
}
}
// Normalize to sum to 1.0
let sum: f64 = kernel.iter().sum();
kernel.mapv_inplace(|v| v / sum);
Ok(kernel)
}The Gaussian has a useful property: its Fourier transform is also Gaussian, ensuring smooth frequency response.
Edge Detection and High-Pass Filtering
Edges correspond to high-frequency components. A simple high-pass filter enhances edges:
Or apply a high-pass mask directly in frequency domain:
The Laplacian operator for edge detection:
In frequency domain, differentiation becomes multiplication:
Computational Optimizations
Plan-Based Computation
FFT planning involves:
- Analyzing the transform size
- Selecting optimal algorithm (radix-2, split-radix, Bluestein, etc.)
- Precomputing twiddle factors:
- Allocating scratch buffers
For a single FFT, planning overhead is negligible. But for spectrograms with hundreds of frames, planning each FFT is wasteful.
Solution: Reusable plans
pub struct StftPlan {
n_fft: usize,
hop: usize,
window: Vec<f64>, // Pre-computed window
fft: Box<dyn R2cPlan>, // Reusable FFT plan
fft_out: Vec<Complex<f64>>, // Reusable buffer
frame: Vec<f64>, // Reusable frame buffer
}
impl StftPlan {
pub fn new(params: &SpectrogramParams) -> Result<Self> {
let n_fft = params.stft().n_fft();
let window = make_window(params.stft().window(), n_fft)?;
// Create FFT plan once
let mut planner = RealFftPlanner::new();
let fft = planner.plan_r2c(n_fft)?;
Ok(Self {
n_fft,
window,
fft,
fft_out: vec![Complex::new(0.0, 0.0); n_fft/2 + 1],
frame: vec![0.0; n_fft],
})
}
// Reuse plan for each frame
fn compute_frame(&mut self, samples: &[f64],
frame_idx: usize) -> Result<()> {
// Window frame into self.frame buffer
// ...
// Reuse FFT plan and buffers
self.fft.process(&self.frame, &mut self.fft_out)?;
Ok(())
}
}Computational benefit: For 1000-frame spectrograms, plan reuse amortizes the planning overhead across all frames.
Sparse Matrix Operations
For filterbank multiplication where is sparse:
Dense: for matrix
Sparse: where is number of non-zeros
For mel filterbanks with 80 bands and 257 FFT bins:
- Dense: ops
- Sparse (~20 non-zeros/row): ops
Row-wise sparse storage:
struct SparseMatrix {
nrows: usize,
ncols: usize,
values: Vec<Vec<f64>>, // Non-zero values per row
indices: Vec<Vec<usize>>, // Column indices per row
}
fn multiply_vec(&self, input: &[f64], out: &mut [f64]) {
for (row_idx, (row_values, row_indices)) in
self.values.iter().zip(&self.indices).enumerate() {
let mut acc = 0.0;
for (&value, &col_idx) in
row_values.iter().zip(row_indices) {
acc += value * input[col_idx];
}
out[row_idx] = acc;
}
}Cache efficiency: Row-wise format ensures sequential access to both input array and sparse values, maximizing cache hits.
Design Patterns
Minimize allocations in hot loops:
Bad: Allocate per iteration
for frame in 0..n_frames {
let mut windowed = vec![0.0; n_fft]; // Allocation!
// ... process frame
}Good: Reuse workspace
let mut workspace = vec![0.0; n_fft]; // Single allocation
for frame in 0..n_frames {
// Reuse workspace
fill_frame(&mut workspace, samples, frame);
// ... process frame
}Workspace pattern:
struct Workspace {
frame: Vec<f64>, // Windowed frame buffer
fft_out: Vec<Complex<f64>>, // FFT output buffer
spectrum: Vec<f64>, // Power spectrum buffer
mapped: Vec<f64>, // Filterbank output buffer
}
impl Workspace {
fn ensure_sizes(&mut self, n_fft: usize,
out_len: usize, n_bins: usize) {
self.frame.resize(n_fft, 0.0);
self.fft_out.resize(out_len, Complex::new(0.0, 0.0));
self.spectrum.resize(out_len, 0.0);
self.mapped.resize(n_bins, 0.0);
}
}Memory Layout and Cache Efficiency
Row-major vs. column-major:
The library uses row-major layout (C-style) for compatibility with most FFT libraries:
// Row-major: rows are contiguous
let data = Array2::<f64>::zeros((nrows, ncols));
assert!(data.is_standard_layout()); // row-majorFor column-wise operations (processing spectrogram frames), this can cause cache misses. But the benefit of contiguous memory for FFT outweighs this.
Alignment: Modern FFT libraries (FFTW, RealFFT) benefit from aligned memory. The library ensures contiguous slices are passed to FFT routines.
FFTW Integration
FFTW (Fastest Fourier Transform in the West) is considered the gold standard for FFT performance. The library supports both:
- RealFFT: Pure Rust, portable, good performance
- FFTW: C library, excellent performance, requires system installation
FFTW wisdom:
FFTW learns optimal algorithms through runtime measurements. Plans can be saved and reused:
// FFTW planning flags:
// ESTIMATE: Fast planning, ok performance
// MEASURE: Slower planning, better performance
// PATIENT: Very slow planning, best performance
let plan = fftw::plan_r2c_1d(n_fft, FFTW_MEASURE);For repeated transforms of the same size, MEASURE or PATIENT modes can provide better performance than ESTIMATE.
Numerical Considerations
Floating-Point Precision
The library uses f64 (double precision) throughout:
- 53 bits of precision (~16 decimal digits)
- Dynamic range: ~ to
- Sufficient for all audio processing needs
Why not f32? While f32 has lower memory requirements, audio processing requires:
- Accurate accumulation over many samples (STFT frames)
- Stable recursive filters (if implemented)
- Wide dynamic range for decibel conversions
f64 ensures numerical stability without performance concerns on modern CPUs.
Normalization Conventions
Different conventions exist for FFT normalization. The library uses:
Forward FFT: No normalization
Inverse FFT: Divide by
This ensures exactly.
Alternative conventions (used by some libraries):
- Both directions scaled by (symmetric)
- Inverse scaled by , forward by (rare)
Dynamic Range and Epsilon Thresholds
Logarithm floor:
To avoid :
let epsilon = 1e-10;
let db = 10.0 * (power.max(epsilon)).log10();Choosing :
- Too large: Loss of quiet signals
- Too small: Numerical instability
- Sweet spot: to for f64
Sparse matrix threshold:
For considering values "zero":
if value.abs() > 1e-10 {
store(value);
}This is much larger than machine epsilon (~ for f64) to account for accumulated rounding errors.
Conclusion and Further Reading
Summary of Key Concepts
This manual covered:
Core algorithms:
- DFT and FFT (Cooley-Tukey algorithm)
- Real-to-complex FFT with Hermitian symmetry
- 2D FFT via row-column decomposition
- STFT and time-frequency uncertainty
Signal processing methods:
- Window functions (Hanning, Hamming, Blackman, Kaiser, Gaussian)
- Amplitude scaling (power, magnitude, decibels)
- Convolution theorem for efficient filtering
Perceptual scales:
- Mel scale for speech processing
- ERB scale (Glasberg & Moore) for auditory modeling
- Logarithmic frequency for music
- CQT with variable-length kernels
Advanced features:
- MFCCs via DCT for speech recognition
- Chroma features for music analysis
- Gammatone filters for perceptual accuracy
Optimizations:
- Plan-based computation
- Sparse matrix operations
- Workspace patterns
- Cache-efficient memory layouts
Applications and Extensions
The techniques in this manual enable:
- Speech recognition and speaker identification
- Music information retrieval
- Audio classification and tagging
- Sound event detection
- Acoustic scene analysis
- Image denoising and enhancement
- Texture analysis
- Pattern recognition
Software:
- FFTW:
www.fftw.org - Librosa (Python):
librosa.org - Numpy (Python):
numpy.org - SciPy (Python):
scipy.org - This library:
github.com/jmg049/Spectrograms
Python API Reference
Introduction
This section provides a complete reference for the Python API, demonstrating how to apply the mathematical concepts from previous sections in practice. All code examples use samples (audio data as ndarray[float64]) and image (image data as ndarray[float64]).
Installation and Imports
pip install spectrogramsimport numpy as np
import spectrograms as sgAPI Organization
The API consists of four components:
- Parameter Classes: Configuration objects
- Computation Functions: Single-use convenience functions
- Planner Classes: Reusable plans for batch processing
- Result Objects: Rich results with metadata
All computation functions release Python's GIL for parallel processing.
Error Handling
try:
spec = sg.compute_mel_power_spectrogram(samples, params, mel_params)
except sg.InvalidInputError as e:
print(f"Invalid parameters: {e}")
except sg.DimensionMismatchError as e:
print(f"Array size mismatch: {e}")
except sg.FFTBackendError as e:
print(f"FFT computation error: {e}")
except sg.SpectrogramError as e:
print(f"Library error: {e}")Basic FFT Operations
1D FFT
Compute the Discrete Fourier Transform:
n_fft: int = 512
# Compute FFT
spectrum: npt.NDArray[np.complex128] = sg.compute_fft(samples[:n_fft], n_fft)
print(f"Output shape: {spectrum.shape}") # (257,) due to Hermitian symmetry
print(f"Output dtype: {spectrum.dtype}") # complex128Inverse FFT
Perfect reconstruction:
spectrum: npt.NDArray[np.complex128] = sg.compute_fft(samples[:512], 512)
reconstructed: npt.NDArray[np.float64] = sg.compute_irfft(spectrum, 512)
error: float = np.max(np.abs(samples[:512] - reconstructed))
print(f"Reconstruction error: {error:.2e}") # < 1e-14Power and Magnitude Spectra
# Power spectrum: |X[k]|^2
power: npt.NDArray[np.float64] = sg.compute_power_spectrum(samples[:512], 512)
# Magnitude spectrum: |X[k]|
magnitude: npt.NDArray[np.float64] = sg.compute_magnitude_spectrum(samples[:512], 512)
# With windowing
power_windowed: npt.NDArray[np.float64] = sg.compute_power_spectrum(samples[:512], 512, window=WindowType.hanning)Window Functions
Predefined Windows
# WindowType class methods
rect: sg.WindowType = sg.WindowTyperectangular
hann: sg.WindowType = sg.WindowType.hanning
hamming: sg.WindowType = sg.WindowType.hamming
blackman: sg.WindowType = sg.WindowType.blackman()
# Use with compute functions
power: npt.NDArray[np.float64] = sg.compute_power_spectrum(samples[:512], 512, window=hann)Parametric Windows
Kaiser and Gaussian:
# Kaiser window with beta parameter
kaiser: sg.WindowType = sg.WindowType.kaiser(beta=8.6)
power_kaiser: npt.NDArray[np.float64] = sg.compute_power_spectrum(samples[:512], 512, window=kaiser)
# Gaussian window with std
gaussian: sg.WindowType = sg.WindowType.gaussian(std=0.4)
power_gaussian: npt.NDArray[np.float64] = sg.compute_power_spectrum(samples[:512], 512, window=gaussian)STFT and Spectrograms
Configuration
STFT parameters:
# Manual configuration
window: sg.WindowType = sg.WindowType.hanning
stft: sg.StftParams = sg.StftParams(
n_fft=512,
hop_size=160,
window=window,
centre=True
)
params: sg.SpectrogramParams = sg.SpectrogramParams(stft, sample_rate=16000)
# Presets
params_speech: sg.SpectrogramParams = sg.SpectrogramParams.speech_default(16000)
params_music: sg.SpectrogramParams = sg.SpectrogramParams.music_default(44100)Linear Spectrograms
Three amplitude representations:
params: sg.SpectrogramParams = sg.SpectrogramParams.speech_default(16000)
# Power: |X[k]|^2
spec_power: sg.Spectrogram = sg.compute_linear_power_spectrogram(samples, params)
# Magnitude: |X[k]|
spec_magnitude: sg.Spectrogram = sg.compute_linear_magnitude_spectrogram(samples, params)
# Decibels: 10*log10(|X[k]|^2)
db_params: sg.LogParams = sg.LogParams(floor_db=-80.0)
spec_db: sg.Spectrogram = sg.compute_linear_db_spectrogram(samples, params, db_params)
print(f"Shape: {spec_power.shape}") # (257, n_frames)Spectrogram Result Object
spec: sg.Spectrogram = sg.compute_linear_db_spectrogram(samples, params, db_params)
# Access as ndarray via buffer protocol
data: npt.NDArray[np.float64] = np.asarray(spec) # Shape: (n_bins, n_frames)
# Direct property access
frequencies: npt.NDArray[np.float64] = spec.frequencies
times: npt.NDArray[np.float64] = spec.times
n_bins: int = spec.n_bins
n_frames: int = spec.n_frames
# Methods
duration: float = spec.duration()
freq_range: tuple[float, float] = spec.frequency_range()
db_range: tuple[float, float] | None = spec.db_range()Perceptual Frequency Scales
Mel Scale
Mel-scale spectrograms:
mel_params: sg.MelParams = sg.MelParams(
n_mels=80,
f_min=0.0,
f_max=8000.0
)
# Three amplitude types
mel_power: sg.Spectrogram = sg.compute_mel_power_spectrogram(samples, params, mel_params)
mel_magnitude: sg.Spectrogram = sg.compute_mel_magnitude_spectrogram(samples, params, mel_params)
mel_db: sg.Spectrogram = sg.compute_mel_db_spectrogram(samples, params, mel_params, db_params)
print(f"Mel shape: {mel_power.shape}") # (80, n_frames)
print(f"Mel frequencies: {mel_power.frequencies[:5]}")ERB Scale
ERB-scale spectrograms:
erb_params: sg.ErbParams = sg.ErbParams(
n_filters=64,
f_min=50.0,
f_max=8000.0
)
erb_power: sg.Spectrogram = sg.compute_erb_power_spectrogram(samples, params, erb_params)
erb_magnitude: sg.Spectrogram = sg.compute_erb_magnitude_spectrogram(samples, params, erb_params)
erb_db: sg.Spectrogram = sg.compute_erb_db_spectrogram(samples, params, erb_params, db_params)Logarithmic Hz Scale
Logarithmic frequency scale for music:
loghz_params: sg.LogHzParams = sg.LogHzParams(
n_bins=84,
f_min=55.0,
f_max=7040.0
)
loghz_power: sg.Spectrogram = sg.compute_loghz_power_spectrogram(samples, params, loghz_params)
loghz_magnitude: sg.Spectrogram = sg.compute_loghz_magnitude_spectrogram(samples, params, loghz_params)
loghz_db: sg.Spectrogram = sg.compute_loghz_db_spectrogram(samples, params, loghz_params, db_params)Constant-Q Transform
cqt_params: sg.CqtParams = sg.CqtParams(
bins_per_octave=12,
n_octaves=7,
f_min=55.0
)
cqt_power: sg.Spectrogram = sg.compute_cqt_power_spectrogram(samples, params, cqt_params)
cqt_magnitude: sg.Spectrogram = sg.compute_cqt_magnitude_spectrogram(samples, params, cqt_params)
cqt_db: sg.Spectrogram = sg.compute_cqt_db_spectrogram(samples, params, cqt_params, db_params)Audio Features
Raw STFT
Complex-valued STFT matrix:
stft_matrix: npt.NDArray[np.complex128] = sg.compute_stft(samples, params.stft)
print(f"STFT shape: {stft_matrix.shape}") # (n_bins, n_frames)
print(f"STFT dtype: {stft_matrix.dtype}") # complex128
# Extract magnitude and phase
magnitude: npt.NDArray[np.float64] = np.abs(stft_matrix)
phase: npt.NDArray[np.float64] = np.angle(stft_matrix)Inverse STFT
Reconstruct signal from STFT:
stft_matrix: npt.NDArray[np.complex128] = sg.compute_stft(samples, params.stft)
reconstructed: npt.NDArray[np.float64] = sg.compute_istft(
stft_matrix,
n_fft=params.stft.n_fft,
hop_size=params.stft.hop_size,
window=params.stft.window,
centre=params.stft.centre
)
# Verify reconstruction
error: float = np.max(np.abs(samples[:len(reconstructed)] - reconstructed))
print(f"Error: {error:.2e}")MFCC
Mel-Frequency Cepstral Coefficients:
mfcc_params: sg.MfccParams = sg.MfccParams(
n_mfcc=13,
use_dct_ortho=True,
lifter_coeff=22.0
)
mfccs: npt.NDArray[np.float64] = sg.compute_mfcc(
samples,
params.stft,
sample_rate=16000,
n_mels=40,
mfcc_params=mfcc_params
)
print(f"MFCC shape: {mfccs.shape}") # (13, n_frames)Chromagram
Pitch class profiles:
chroma_params: sg.ChromaParams = sg.ChromaParams(
tuning=440.0,
f_min=32.7,
f_max=4186.0,
norm=ChromaNorm.l2
)
chroma: sg.Chromagram = sg.compute_chromagram(
samples,
params.stft,
sample_rate=16000,
chroma_params=chroma_params
)
# Access as ndarray (12 pitch classes, n_frames)
chroma_data: npt.NDArray[np.float64] = np.asarray(chroma)
pitch_classes = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']Performance Optimization
Plan Reuse
Reuse FFT plans for batch processing:
# Single computation
spec1: sg.Spectrogram = sg.compute_mel_db_spectrogram(samples, params, mel_params, db_params)
# Batch with plan reuse
planner: sg.SpectrogramPlanner = sg.SpectrogramPlanner()
plan = planner.mel_db_plan(params, mel_params, db_params)
signals = [samples1, samples2, samples3]
spectrograms = [plan.compute(s) for s in signals]Available Plans
All spectrogram types have corresponding plans:
planner = sg.SpectrogramPlanner()
# Linear (3 types)
linear_power_plan = planner.linear_power_plan(params)
linear_magnitude_plan = planner.linear_magnitude_plan(params)
linear_db_plan = planner.linear_db_plan(params, db_params)
# Mel (3 types)
mel_power_plan = planner.mel_power_plan(params, mel_params)
mel_magnitude_plan = planner.mel_magnitude_plan(params, mel_params)
mel_db_plan = planner.mel_db_plan(params, mel_params, db_params)
# ERB (3 types)
erb_power_plan = planner.erb_power_plan(params, erb_params)
erb_magnitude_plan = planner.erb_magnitude_plan(params, erb_params)
erb_db_plan = planner.erb_db_plan(params, erb_params, db_params)
# LogHz (3 types)
loghz_power_plan = planner.loghz_power_plan(params, loghz_params)
loghz_magnitude_plan = planner.loghz_magnitude_plan(params, loghz_params)
loghz_db_plan = planner.loghz_db_plan(params, loghz_params, db_params)
# CQT (3 types)
cqt_power_plan = planner.cqt_power_plan(params, cqt_params)
cqt_magnitude_plan = planner.cqt_magnitude_plan(params, cqt_params)
cqt_db_plan = planner.cqt_db_plan(params, cqt_params, db_params)
# All plans use same interface
spec = plan.compute(samples)Frame-by-Frame Processing
Process individual frames without full spectrogram:
planner = sg.SpectrogramPlanner()
plan = planner.mel_power_plan(params, mel_params)
n_frames = (len(samples) + params.stft.n_fft - params.stft.hop_size) // params.stft.hop_size
for frame_idx in range(n_frames):
frame = plan.compute_frame(samples, frame_idx) # 1D array (n_mels,)
# Process frame...Streaming Processing
Real-time frame extraction:
planner = sg.SpectrogramPlanner()
plan = planner.mel_power_plan(params, mel_params)
buffer = np.array([], dtype=np.float64)
chunk_size = 160
for chunk in audio_stream:
buffer = np.concatenate([buffer, chunk])
while len(buffer) >= params.stft.n_fft:
frame = plan.compute_frame(buffer, 0)
# Process frame...
buffer = buffer[params.stft.hop_size:]2D FFT
Forward 2D FFT
image = np.load('test.png') # Shape: (height, width)
spectrum = sg.fft2d(image)
print(f"Spectrum shape: {spectrum.shape}") # (height, width//2 + 1)
print(f"Spectrum dtype: {spectrum.dtype}") # complex64Inverse 2D FFT
spectrum = sg.fft2d(image)
reconstructed = sg.ifft2d(spectrum, image.shape[1])
error = np.max(np.abs(image - reconstructed))
print(f"Reconstruction error: {error:.2e}")2D Power and Magnitude Spectra
power = sg.power_spectrum_2d(image) # Shape: (height, width//2 + 1)
magnitude = sg.magnitude_spectrum_2d(image)
# Log scale for visualization
power_db = 10 * np.log10(power + 1e-10)FFT Shift
Center DC component:
power = sg.power_spectrum_2d(image)
# Shift DC to center
power_shifted = sg.fftshift(power)
# Reverse shift
power_restored = sg.ifftshift(power_shifted)Batch 2D FFT
planner = sg.Fft2dPlanner()
# Forward plan
fft_plan = planner.fft2d_plan(nrows, ncols)
spectra = [fft_plan.forward(img) for img in images]
# Inverse plan
ifft_plan = planner.ifft2d_plan(nrows, ncols)
reconstructed = [ifft_plan.inverse(spec) for spec in spectra]
# Power spectrum plan
power_plan = planner.power_spectrum_plan(nrows, ncols)
powers = [power_plan.compute(img) for img in images]
# Magnitude spectrum plan
magnitude_plan = planner.magnitude_spectrum_plan(nrows, ncols)
magnitudes = [magnitude_plan.compute(img) for img in images]Image Filtering
Gaussian Blur
kernel = sg.gaussian_kernel_2d(size=15, sigma=2.0)
blurred = sg.convolve_fft(image, kernel)Frequency Filters
# Lowpass: smooth (keep low frequencies)
lowpass = sg.lowpass_filter(image, cutoff_fraction=0.1)
# Highpass: edges (keep high frequencies)
highpass = sg.highpass_filter(image, cutoff_fraction=0.1)
# Bandpass: specific frequency range
bandpass = sg.bandpass_filter(image, low_cutoff=0.1, high_cutoff=0.5)Edge Detection and Sharpening
# Edge detection
edges = sg.detect_edges_fft(image)
# Sharpening
sharpened = sg.sharpen_fft(image, amount=1.5)Complete API Reference
Parameter Classes
WindowType:
sg.WindowTyperectangular
sg.WindowType.hanning
sg.WindowType.hamming
sg.WindowType.blackman()
sg.WindowType.kaiser(beta: float)
sg.WindowType.gaussian(std: float)StftParams:
stft = sg.StftParams(
n_fft: int,
hop_size: int,
window: str | WindowType,
center: bool = True
)LogParams:
db_params = sg.LogParams(floor_db: float)SpectrogramParams:
params = sg.SpectrogramParams(stft: StftParams, sample_rate: float)
params = sg.SpectrogramParams.speech_default(sample_rate: float)
params = sg.SpectrogramParams.music_default(sample_rate: float)MelParams:
mel_params = sg.MelParams(
n_mels: int,
f_min: float = 0.0,
f_max: float | None = None
)ErbParams:
erb_params = sg.ErbParams(
n_bands: int,
f_min: float = 50.0,
f_max: float | None = None,
use_gammatone: bool = True
)LogHzParams:
loghz_params = sg.LogHzParams(
n_bins: int,
f_min: float,
f_max: float,
bins_per_octave: int = 12
)CqtParams:
cqt_params = sg.CqtParams(
n_bins: int,
f_min: float,
bins_per_octave: int = 12,
sparsity_threshold: float = 0.01
)ChromaParams:
chroma_params = sg.ChromaParams(
tuning_frequency: float = 440.0,
n_chroma: int = 12,
norm_type: str = "L2"
)
chroma_params = sg.ChromaParams.music_standard()MfccParams:
mfcc_params = sg.MfccParams(
n_mfcc: int = 13,
use_energy: bool = True,
lifter_coefficient: float = 22.0,
normalize: bool = False
)
mfcc_params = sg.MfccParams.speech_standard()Spectrogram Functions
All return Spectrogram objects with .data, .frequencies, .times, .duration(), .frequency_range(), .db_range(), however, they can also act as an ndarray via the array protocol with no additional function calls:
# Linear
sg.compute_linear_power_spectrogram(samples, params)
sg.compute_linear_magnitude_spectrogram(samples, params)
sg.compute_linear_db_spectrogram(samples, params, db_params)
# Mel
sg.compute_mel_power_spectrogram(samples, params, mel_params)
sg.compute_mel_magnitude_spectrogram(samples, params, mel_params)
sg.compute_mel_db_spectrogram(samples, params, mel_params, db_params)
# ERB
sg.compute_erb_power_spectrogram(samples, params, erb_params)
sg.compute_erb_magnitude_spectrogram(samples, params, erb_params)
sg.compute_erb_db_spectrogram(samples, params, erb_params, db_params)
# LogHz
sg.compute_loghz_power_spectrogram(samples, params, loghz_params)
sg.compute_loghz_magnitude_spectrogram(samples, params, loghz_params)
sg.compute_loghz_db_spectrogram(samples, params, loghz_params, db_params)
# CQT
sg.compute_cqt_power_spectrogram(samples, params, cqt_params)
sg.compute_cqt_magnitude_spectrogram(samples, params, cqt_params)
sg.compute_cqt_db_spectrogram(samples, params, cqt_params, db_params)Audio Feature Functions
# STFT
stft_matrix = sg.compute_stft(samples, stft_params) # complex128, (n_bins, n_frames)
# Inverse STFT
samples = sg.compute_istft(stft_matrix, n_fft, hop_size, window, center)
# MFCC
mfccs = sg.compute_mfcc(samples, stft_params, sample_rate, n_mels, mfcc_params)
# Chromagram
chroma = sg.compute_chromagram(samples, stft_params, sample_rate, chroma_params)FFT Functions
# 1D FFT
spectrum = sg.compute_fft(samples, n_fft) # complex128, (n_fft//2 + 1,)
samples = sg.compute_irfft(spectrum, n_fft)
# Power/magnitude
power = sg.compute_power_spectrum(samples, n_fft, window)
magnitude = sg.compute_magnitude_spectrum(samples, n_fft, window)2D FFT Functions
# 2D FFT
spectrum = sg.fft2d(image) # complex64, (nrows, ncols//2 + 1)
image = sg.ifft2d(spectrum, output_ncols)
# Power/magnitude
power = sg.power_spectrum_2d(image)
magnitude = sg.magnitude_spectrum_2d(image)
# Shift
shifted = sg.fftshift(array)
restored = sg.ifftshift(shifted)Image Processing Functions
# Convolution
kernel = sg.gaussian_kernel_2d(size, sigma)
result = sg.convolve_fft(image, kernel)
# Filters
lowpass = sg.lowpass_filter(image, cutoff_fraction)
highpass = sg.highpass_filter(image, cutoff_fraction)
bandpass = sg.bandpass_filter(image, low_cutoff, high_cutoff)
# Edge detection and sharpening
edges = sg.detect_edges_fft(image)
sharpened = sg.sharpen_fft(image, amount)