Molecular Generation
In materials science field, the forward problem, e.g., given materials structural information, seek the property of the material (structure-property relationship), is relatively straightforward and easy (well-established experiments, theory, and simulations). While the inverse problem (property space to material space), often presented in materials design when only materials with desired properties are needed, is more complicated. There are many generative models (GAN, etc.) that can be employed for this goal. In this post, the RNN is applied for molecular generative task.
The Problem and the Model
The objective of the RNN model is to generate a sequence, which denotes a polymer (specially its monomer unit) by the SMILES string [1]. Let a sequence be , the probability to find this sequence is:
Unfortunately, there is little known about the probability distribution function. Thus, one can only resort to a large training database. The data seen by the Model is therefore the model’s “world”.
Here is the model of RNN molecular generation:
RNN molecular generation model.
For simplicity, there are three layers in the model:
- an embedding layer transforming input tokens into vectorized features
- an LSTM layer with hidden information
- a time distributed dense layer
In the preprocessing step, all inputs (SMILES strings) are tokenized into vectors using a dictionary containing all the characters in the strings. The strings are padded to match a maximum length. The training process, is essentially to find a classifier predicting a new character given previous ones.
Here using pyTorch, code snippet is demonstrated. Forward pass of the model:
def forward(self, x, hidden):
"""Forward pass: char to char level prediction model"""
# embedding layer
x1 = self.embedding(x)
# lstm layer
x1, next_hidden = self.lstm(x1, hidden)
# forward connected layer
out = self.fc(x1)
return out, next_hidden
The training step:
def fit(self, dataloader, epochs, print_freq):
"""model training function"""
loss_avg = 0.0
losses = []
for epoch in range(epochs):
# loop over batches
for step, (x_padded, y_padded, x_lens, y_lens) in enumerate(dataloader):
# the input and output
inp = x_padded.to(self.device)
target = y_padded.to(self.device)
# hidden variables
hidden = self.init_hidden(x_padded.shape[0])
pred, hidden = self(inp, x_lens, hidden) # pred in [batch_size, seq_len, nb_tags]
# sum up loss
batch_loss = 0.0
for i in range(pred.size(0)):
ce_loss = F.cross_entropy(pred[i], target[i], ignore_index=49)
batch_loss += ce_loss
loss_avg += batch_loss.item() # total loss of all samples
# back-propagation
self.optimizer.zero_grad()
batch_loss.backward()
self.optimizer.step()
losses.append(loss_avg/len(dataX)) # average loss of a sample
if epoch % print_freq == 0:
print("Epoch [{}/{}], Loss {:.4e}".format(epoch+1, epochs, loss_avg/len(dataX)))
loss_avg = 0.0
return losses
After the training step, the generation step is:
def generation(self, tokens, max_seq_len=150):
"""generation step on CPU"""
# start with '<'
start = tokens[0]
newSMILES = ''.join(start)
# prepare the inp for generation
inp = tc.tensor(self.char2int[start]).view(1,-1)
# hidden variable
hidden = self.init_hidden(1, device='cpu')
for timestep in range(max_seq_len):
# forward pass
x1 = self.embedding(inp)
x1, hidden = self.lstm(x1, hidden)
output = self.fc(x1)
# use softmax to get the most probable idx
max_prob = tc.softmax(output, dim=2).view(-1)
top_idx = tc.multinomial(max_prob, 1)[0].cpu().numpy() # has to be set for random generation
# get the most chars
pred_char = self.int2char[int(top_idx)]
# update SMILES
newSMILES += pred_char
# update inp
inp = tc.tensor(self.char2int[pred_char]).view(1,-1)
# determine the length
if pred_char == tokens[1]: # end token '>'
break
return newSMILES
Note that this example is just about molecular generation, no any materials property is involved in training.
References
[1] SMILES: Simplified molecular-input line-entry system [Wiki].