Using TensorFlow to make a robot sing a song to you

Visit flyai.club to create your AI project with one click

Author | aliceyangxi

http://blog.csdn.net/aliceyangxi1987/article/details/70790405

Today I want to see how AI composes music.

This article will use TensorFlow to write a music generator.

What happens when you say to a robot: I want a song that expresses hope and wonder?

The computer will first convert your speech into text, and extract keywords and convert them into word vectors.

Then we will use some data of music that has been tagged, and these tags are various human emotions. Then, by training a model on these data, after the model is trained, it can generate music that meets the required keywords.

The final output of the program is some chords, and he will select some chords that are closest to the emotional keywords required by the master to output.

Of course you can not only listen to it, but also use it as a reference for creation, so you can easily create music, even if you haven't done 10,000 hours of deliberate practice.

Machine learning is actually about expanding our brains and expanding our capabilities.

DeepMind published a paper called WaveNet, which introduced the art of music generation and text-to-speech.

Generally speaking, speech generation models are concatenated. This means that if we want to generate speech from some text samples, we need a very large database of speech fragments, cut out parts of them, and reassemble them together to form a complete sentence.

The same is true for generative music, but it has a big difficulty: when you put together some static components, the generated sound needs to be natural and emotional, which is very difficult.

Ideally, we could store all the information needed to generate music into the parameters of the model. That's what the paper talks about.

We do not need to pass the output result to the signal processing algorithm to obtain the speech signal, but directly process the wave of the speech signal.

The model they use is CNN. In each hidden layer of this model, each dilation factor can be interconnected and grow exponentially. The samples generated at each step are re-inputted into the network and used to generate the next step.

We can take a look at a diagram of this model. The input data is a single node. As a rough sound wave, it first needs to be preprocessed to facilitate the following operations.

Then we encode it to generate a Tensor with some samples and channels.

Then throw it into the first layer of the CNN network. This layer generates the number of channels for simpler processing.

Then combine all the output results and increase its dimension. Then increase the dimension to the original number of channels.

Throw this result into a loss function to measure how well our model trained.

Finally, this result is fed back into the network to generate the sonic data needed for the next point in time.

Repeat this process to generate more speech.

This network is huge, takes ninety minutes on their GPU cluster, and only generates one second of audio.

Next we will implement an audio generator on TensorFlow with a simpler model.

1. Introduce packages:

The data science package Numpy, the data analysis package Pandas, and tqdm can generate a progress bar that shows the progress during training.

import numpy as np
import pandas as pd
import msgpack
import glob
import tensorflow as tf
from tensorflow.python.ops import control_flow_ops
from tqdm import tqdm
import midi_manipulation

We will use a neural network model RBM-Restricted Boltzmann Machine as a generative model.

It is a two-layer network: the first layer is the visible layer and the second layer is the hidden layer. There is no connection between nodes in the same layer, and nodes in different layers are connected to each other. Each node has to decide whether it needs to send the data it has received to the next layer, and this decision is random.

2. Define hyperparameters:

First define the range of notes that need to be generated by the model

lowest_note = midi_manipulation.lowerBound #the index of the lowest note on the piano roll
highest_note = midi_manipulation.upperBound #the index of the highest note on the piano roll
note_range = highest_note-lowest_note #the note range

Then you need to define the timestep, the size of the visible layer and the hidden layer.

num_timesteps = 15 #This is the number of timesteps that we will create at a time
n_visible = 2*note_range*num_timesteps #This is the size of the visible layer.
n_hidden = 50 #This is the size of the hidden layer

The number of training sessions, the batch size, and the learning rate.

num_epochs = 200 #The number of training epochs that we are going to run. For each epoch we go through the entire data set.
batch_size = 100 #The number of training examples that we are going to send through the RBM at a time.
lr = tf.constant(0.005, tf.float32) #The learning rate of our model

3. Define variables:

x is the data put into the network

w is used to store the weight matrix, or the relationship between the two layers

In addition, two biases are required, one is the bh of the hidden layer, and the other is the bv of the visible layer.

x = tf.placeholder(tf.float32, [None, n_visible], name="x") #The placeholder variable that holds our data
W = tf.Variable(tf.random_normal([n_visible, n_hidden], 0.01), name="W") #The weight matrix that stores the edge weights
bh = tf.Variable(tf.zeros([1, n_hidden], tf.float32, name="bh")) #The bias vector for the hidden layer
bv = tf.Variable(tf.zeros([1, n_visible], tf.float32, name="bv")) #The bias vector for the visible layer

Next, use the helper method gibbs_sample to build samples from the input data x, as well as samples for the hidden layer:

gibbs_sample is an algorithm that can draw samples from multiple probability distributions.

It can generate a statistical model where each state depends on the previous state and randomly generates samples that fit the distribution.

#The sample of x
x_sample = gibbs_sample(1)
#The sample of the hidden nodes, starting from the visible state of x
h = sample(tf.sigmoid(tf.matmul(x, W) bh))
#The sample of the hidden nodes, starting from the visible state of x_sample

4. Update variables:

size_bt = tf.cast(tf.shape(x)[0], tf.float32)
W_adder = tf.mul(lr/size_bt, tf.sub(tf.matmul(tf.transpose(x), h), tf.matmul(tf.transpose(x_sample), h_sample)))
bv_adder = tf.mul(lr/size_bt, tf.reduce_sum(tf.sub(x, x_sample), 0, True))
bh_adder = tf.mul(lr/size_bt, tf.reduce_sum(tf.sub(h, h_sample), 0, True))
#When we do sess.run(updt), TensorFlow will run all 3 update steps
updt = [W.assign_add(W_adder), bv.assign_add(bv_adder), bh.assign_add(bh_adder)]

5. Next, run the Graph algorithm graph:

1. Initialize the variable first

with tf.Session() as sess:
#First, we train the model
#initialize the variables of the model
init = tf.initialize_all_variables()
sess.run(init)

First we need to reshape each song so that the corresponding vector representation can be better used to train the model.

for epoch in tqdm(range(num_epochs)):
for song in songs:
#The songs are stored in a time x notes format. The size of each song is timesteps_in_song x 2*note_range
#Here we reshape the songs so that each training example is a vector with num_timesteps x 2*note_range elements
song = np.array(song)
song = song[:np.floor(song.shape[0]/num_timesteps)*num_timesteps]
song = np.reshape(song, [song.shape[0]/num_timesteps, song.shape[1]*num_timesteps])

2. Next, let's train the RBM model, one sample at a time

for i in range(1, len(song), batch_size):
tr_x = song[i:i batch_size]
sess.run(updt, feed_dict={x: tr_x})

After the model is fully trained, it can be used to generate music.

3. Gibbs chain needs to be trained

The visible nodes are initialized to 0 first to generate some samples.

Then reshape the vector into a better format for playback.

sample = gibbs_sample(1).eval(session=sess, feed_dict={x: np.zeros((10, n_visible))})
for i in range(sample.shape[0]):
if not any(sample[i,:]):
continue
#Here we reshape the vector to be time x notes, and then save the vector as a midi file
S = np.reshape(sample[i,:], (num_timesteps, 2*note_range))

4. Finally, print out the generated chords

midi_manipulation.noteStateMatrixToMidi(S, "generated_chord_{}".format(i))

In summary, CNN is used to generate sound waves parametrically,
RBM can easily generate audio samples based on training data, and
Gibbs algorithm can help us get training samples based on probability distribution.

— End —