Getting-started tutorial No. 2: Isolated methanol

In this tutorial, we give an comprehensive tutorial to prepare training data and train models, taking isolated methanol as an example.

Requirements

To perform the tutorial, you also need to insall CPMD package for DFT/MD calculations. See https://github.com/CPMD-code.

Tutorial data

The reference files of this tutorial are given in examples/tutorial/2_gasmethanol/ directory.

$cd examples/tutorial/2_gasmethanol
$tree
.
├── cpmd
│   ├── pseudo
│   └── tmp
├── make_xyz
│   ├── csv2xyz.py
│   └── methanol.csv
├── pred
├── train

Prepare training data

The acquisition of the training data requires two steps: generation of the structure and calculation of the Wannier centers. In the case of liquid structures, the acquisition of the structure is done with classical MD and only the Wannier center calculation is done with DFT due to computational cost considerations, while in the case of gaseous structures, the structure acquisition and Wannier center calculation can be done simultaneously with ab initio MD due to the lower computational cost. We will test the sequence using the CPMD package.

Convert smiles to xyz

Let us go to the make_xyz directory. To begin with, we build an initial structure for CPMD. Let us prepare a csv file containing smiles.

$cat methanol.csv
Smiles,Name,density
CO,METHANOL,0.791

The following simple python code csv2xyz.py will generate methaol.xyz (and methanol.mol for later usage).

csv2xyz.py

import shutil
import os
import pandas as pd
import rdkit.Chem
import rdkit.Chem.AllChem
import ase
import ase.io
# read csv
input_file:str = "methanol.csv" # read csv
poly = pd.read_csv(input_file)
print(" --------- ")
print(poly)
print(" --------- ")
# read smiles
smiles:str = poly["Smiles"].to_list()[0]
molname:str = poly["Name"].to_list()[0]
# build molecule
mol=rdkit.Chem.MolFromSmiles(smiles)
print(mol)
molH = rdkit.Chem.AddHs(mol) # add hydrogen
print(molH)
print(rdkit.Chem.MolToMolBlock(molH, includeStereo=False))

rdkit.Chem.AllChem.EmbedMolecule(molH, rdkit.Chem.AllChem.ETKDGv3())
print(molH)
print(rdkit.Chem.MolToMolBlock(molH, includeStereo=False))
rdkit.Chem.MolToMolFile(molH,'methanol.mol')
rdkit.Chem.MolToXYZFile(molH,'methanol.xyz')
# xyz = rdkit.Chem.MolToXYZBlock(molH)
# load xyz via ase to add cell parameters
data = ase.io.read("methanol.xyz")
data.set_cell([20,20,20])
ase.io.write("methanol.xyz",data)

It is important to add the cell parameters to xyz. Here we adopt L=20Ang. The generated xyz file can be visualized using various tools including nglview in python and VESTA.

Prepare input for CPMD

Let us go to the cpmd/ directory. CPmake.py will yield input files for CPMD from methanol.xyz as follows.

$CPmake.py cpmd workflow --i methanol.xyz -n 40000 -t 10
*****************************************************************
                        CPmake.py
                    Version. 0.0.1
*****************************************************************

---------
input geometry file ::  methanol.xyz
output georelax calculation        :: georelax.inp
output bomdrelax calculation       :: bomdrelax.inp
output bomd restart+wf calculation :: bomd-wan-restart.inp
output bomd restart+wf accumulator calculation :: bomd-wan-restart2.inp
# of steps for restart      ::  40000
timestep [a.u.] for restart ::  10
atomic arrangement type     ::  default

-n and -t specify the number of steps and the time step (in a.u.) for MD, respectively. Therefore, we will run 400,000 [a.u.] ~ 9.7 [ps] calculation.

Four input files are for 1: geometry optimization, 2: initial relaxation, and 3&4: production run.

Note

Generated inputs are just samples. You should tune parameters for serious calculations.

We slightly modify the inputs for later convenience. The line DIPOLE DYNAMICS WANNIER SAMPLE decides how often the structure will be calculated. Set it to 100 to reduce computational cost.

DIPOLE DYNAMICS WANNIER SAMPLE
100

We create tmp/ and pseudo/ directories to stock outputs and pseudo potentials, respectively. You also have to prepare C_MT_GIA_BLYP, O_MT_GIA_BLYP, and H_MT_BLYP.psp from CPMD pseudo potential directories and store them in pseudo/ directory.

Run CPMD

We execute three runs: geometry optimization, initial relaxation, and production Wannier run. They will take a few hours depending on your machine. We strongly recommend you to use supercomputers. Please be patient.

mpirun cpmd.x georelax.inp >> georelax.out
mpirun cpmd.x bomd-relax.inp >> bomd-relax.out
mpirun cpmd.x bomd-wan-restart.inp >> bomd-wan-restart.out

After the calculation, you will see IONS+CENTERS.xyz in the tmp/ directory, which contains atomic and WC coordinates.

Postprocess CPMD data

IONS+CENTERS.xyz does not include the lattice information, which we need to add manually. We can use CPextract.py to do this.

$CPextract.py extract -i IONS+CENTERS.xyz -s bomd-wan-restart.out IONS+CENTERS_cell.xyz

-s specifies the stdout file of the CPMD calculation. The output file IONS+CENTERS_cell.xyz is extended xyz format, and can be processed by ase package.

Train models

Let us go to the train/ directory.

Train models

The previously prepared IONS+CENTERS_cell.xyz and methanol.mol are used for training ML models. As methanol has CH, CO, OH bonds and O lone pair, we have to train four independent ML models. The input file for CPtrain.py is given in yaml format. The input file for the CH bond is as follows.

input.yaml

model:
modelname: model_ch  # specify name
nfeature:  48        # length of descriptor
M:         20        # M  (embedding matrix size)
Mb:        6         # Mb (embedding matrix size, smaller than M)

learning_rate:
type: fix

loss:
type: mse        # mean square error

data:
type: xyz
file:
    - "../cpmd/IONS+CENTERS+cell_sorted_merge.xyz"
itp_file: methanol.mol
bond_type: CH # CH, CO, OH, O

traininig:
device:     cpu # Torch device (cpu/mps/cuda)
batch_size: 32  # batch size for training
validation_vatch_size: 32 # batch size for validation
max_epochs: 50
learnint_rate: 1e-2 # starting learning rate
n_train:   9000    # the number of training data
n_val:     1000    # the number of validation data
modeldir:  model_ch # directory to save models
restart:   False    # If restart training

For gas systems, we can reduce the model size without losing accuracy. We chose nfeature=48 so that all the atoms are included in the descriptors.

Then, you can train the CH bond model

$CPtrain.py train -i input.yaml

After the training, RMSE should be about 0.001[D] to 0.01[D] for isolated systems.

Next, you can change modelname, bond_type, and modeldir to corresponding bonds, and re-run CPtrain.py to train other 4 models.

Test a model

We can check the quality of the trained model as follows.

$CPtrain.py test -m chmodel_test/model_ch_python.pt -x ../cpmd/IONS+CENTERS_cell.xyz -i ../make_xyz/methanol.mol

Calculate molecular dipole moment

Let us go to the pred/ directory. Finally, we will calculate the average molecular dipole moment of methanol. The experimental value is 1.62[D]. For this purpose, we invoke C++ interface with the following input. The calculation of molecular dipole moments is done without specifying any flag.

config.yaml

general:
    itpfilename: methanol.acpype/input_GMX.mol
    bondfilename: methanol.mol
    savedir: pred_dipole/
    temperature: 300
    timestep: 0.484
descriptor:
    calc: 1
    directory: ./
    xyzfilename: IONS+CENTERS+cell_sorted_merge.xyz
    savedir: pred_dipole/
    descmode: 2
    desctype: allinone
    haswannier: 1
    interval: 1
    desc_coh: 0
predict:
    calc: 1
    desc_dir: dipole_10ps/
    model_dir: model_rotate_methanol/
    modelmode: rotate
    bondspecies: 4
    save_truey: 0

We perform the calculation

dieltools.x

The corresponding output file is DIELCONST, which contains the mean molecular dipole moment, and molecule_dipole.txt, which involve all the molecular dipole moments along the MD trajectory. We can see the mean absolute dipole moment as

$cat DIELCONST

and we confirmed that the simulated value well agrees with the experimental one.