ECE 512 – Project Report

Raakesh Madhanagopal

CSU ID:

831020471

ECE512-Digital

Signal Processing

Separating Music and Voice Signals from an Audio

Signal

Introduction:

Music

is the most glorious creation of humans, which touches the soul and life of the

human beings. The effect of music in humans is one which separates him from

other animals. Music has a tremendous effect in the life of human being, it helps

to unite people from different background and cultural heritage by breaking

boundaries. Songs play a crucial part in our everyday life. A song usually consists

of two basic things such as vocal and the background music. In general, a song

is a mixture of two components (vocal and background music). The background

music is a mixture of different musical instruments and vocal is usually the

voice of the singer.

Separating

music and vocals from the song helps us to the study the characteristics, pattern,

pitch, rhythm, texture and lyrics of the song which is useful for analyzing, teaching

and composing purposes. A music and vocal separator system works on the

principle of repetition in the music and the blind source separation and does

not depend on prior training, particular features and complex frameworks like

the existing separation models.

Repetition

is the important principle in a song where there will be a repeating background

and a non-repeating foreground vocal. The music and vocal separation system

makes use of this principle to separate the music and vocals from a given audio

signal. Blind source separation (BSS) is the process of separating a set of

signals from the given mixed set of signals without any or little information

about the source or other process involved. This blind source separation technique

can be applied effectively on multidimensional data.

The

music and vocal separation system involves the extraction of the repeating

structure in the song. The period of the repeating structure is initially found

and then a spectrogram is segmented at the period boundaries and then averaged

to create a repeating segment model. Later individual time frequency bin in the

segment is compared with the model and binary time frequency masking is used

for partitioning the mixture by labelling the bins similar to the model as the

repeating background.

In

this project, a simple music and vocal separator system has been built by using

the REpeating Pattern Extraction Technique. Unlike

existing techniques, this is only based on self-similarity, and can be

implemented on any audio signal as long as there is a repeating structure. This

method is simple, fast, blind, and is also completely automatable.

Theory:

In general, the music, vocal

separation involves three major parts: Finding the repeating period, repeating

segment model and binary time frequency masking. The repeating period p can be

found out by initially estimating the period of repeating musical structure, estimating

periodicities by using autocorrelation function and by using the peaks. The

repeating segment model is estimated by using the estimated period p

of the repeating musical structure and evenly segmenting the spectrogram of

length p. The binary time frequency mask M is calculated by taking the

logarithm of each bin and obtaining modified spectrogram and by adding

tolerance to the binary time frequency mask.

Repeating

Period:

The repeating period p can be estimated by initially identifying

the repeating segments in a song and then by calculating the periodicities by

using the autocorrelation function. This can be used to measure the similarity

between the segment and the lagged version in the successive time intervals.

The spectrogram V of the mixture can be calculated by using the Short Time Fourier

Transform and then by estimating their magnitude along with the hamming window

of length N. This process helps in discarding the symmetric part and helps to

retain the DC component. Then autocorrelation is performed on each frequency

component V2 to obtain the autocorrelation matrix B. The V2

also helps in enhancing the peaks in the obtained autocorrelation matrix.

Incase if the given audio signal is mixture in nature,

then the V2 is obtained by averaging. The vector b which estimates

the self-similarity of the given song as a function of time lag is estimated by

using the means over the autocorrelation matrix. Later the vector b is normalized.

Thus, a similarity matrix is not calculated explicitly and the used method

provides a precise beat pattern visualization. After the calculation of the

beat spectrum (b) its first coefficient estimates any similarity that is present

in the signal and if any repeating pattern is present, the peaks are formed

according to the repeating pattern.

The period p is defined as the period of the longest

strong repeating pattern and is represented by the peaks with the largest level

and repeating at the longest period in b.

Repeating

Segment Model:

The repeating segment model can be estimated by using the obtained period

p. The spectrogram is now divided into equal segments of length p and r

segments. It is assumed that the time frequency bins consisting the repeating

pattern would have same values at each period p, and be same as the repeating

segment model. By observing at the peaks obtained it is seen that the geometric

means provide good extraction of the repeating musical structure than the

arithmetic means. The calculation of repeating segment model is given by,

Binary

Time Frequency Masking:

The third important part is the calculation of binary time frequency

mask M. From the calculated repeating

segment the spectrogram is divided by each time-frequency

corresponding to the .

The absolute logarithm of each bin is calculated to get a modified spectrogram .

The V can later be partitioned by assigning time-frequency bins with

values near 0 in to the repeating background. Based on the assumption that the repeating

background structure and the varying foreground sound have sparse and disjoint

time-frequency representations. In real time, the time-frequency bins of music

and voice can overlap, and furthermore the repeating musical structure

generally involves variations. Therefore, a tolerance t is added to the binary

time-frequency mask M. It is seen that tolerance of t = 1 gives good separation

results, for both the repeating background (music) and non-repeating foreground

(voice).

After the computation of the binary time-frequency mask M, it is then symmetrized

and applied to the Short Time Fourier Transform X of the audio signal x to get

the STFT of the music and the STFT of the voice . The estimated music signal and voice signal are finally obtained by performing inverse

Short Time Fourier in the time domain.

Results and Discussions:

Conclusion: