Getting-started tutorial No. 2: Isolated methanol
In this tutorial, we give an comprehensive tutorial to prepare training data and train models, taking isolated methanol as an example.
Requirements
To perform the tutorial, you also need to insall CPMD
package for DFT/MD calculations. See https://github.com/CPMD-code.
Tutorial data
The reference files of this tutorial are given in examples/tutorial/2_gasmethanol/
directory.
$cd examples/tutorial/2_gasmethanol
$tree
.
├── cpmd
│ ├── pseudo
│ └── tmp
├── make_xyz
│ ├── csv2xyz.py
│ └── methanol.csv
├── pred
├── train
Prepare training data
The acquisition of the training data requires two steps: generation of the structure and calculation of the Wannier centers. In the case of liquid structures, the acquisition of the structure is done with classical MD and only the Wannier center calculation is done with DFT due to computational cost considerations, while in the case of gaseous structures, the structure acquisition and Wannier center calculation can be done simultaneously with ab initio MD due to the lower computational cost. We will test the sequence using the CPMD package.
Convert smiles to xyz
Let us go to the make_xyz
directory. To begin with, we build an initial structure for CPMD. Let us prepare a csv file containing smiles.
$cat methanol.csv
Smiles,Name,density
CO,METHANOL,0.791
The following simple python code csv2xyz.py
will generate methaol.xyz
(and methanol.mol
for later usage).
import shutil
import os
import pandas as pd
import rdkit.Chem
import rdkit.Chem.AllChem
import ase
import ase.io
# read csv
input_file:str = "methanol.csv" # read csv
poly = pd.read_csv(input_file)
print(" --------- ")
print(poly)
print(" --------- ")
# read smiles
smiles:str = poly["Smiles"].to_list()[0]
molname:str = poly["Name"].to_list()[0]
# build molecule
mol=rdkit.Chem.MolFromSmiles(smiles)
print(mol)
molH = rdkit.Chem.AddHs(mol) # add hydrogen
print(molH)
print(rdkit.Chem.MolToMolBlock(molH, includeStereo=False))
rdkit.Chem.AllChem.EmbedMolecule(molH, rdkit.Chem.AllChem.ETKDGv3())
print(molH)
print(rdkit.Chem.MolToMolBlock(molH, includeStereo=False))
rdkit.Chem.MolToMolFile(molH,'methanol.mol')
rdkit.Chem.MolToXYZFile(molH,'methanol.xyz')
# xyz = rdkit.Chem.MolToXYZBlock(molH)
# load xyz via ase to add cell parameters
data = ase.io.read("methanol.xyz")
data.set_cell([20,20,20])
ase.io.write("methanol.xyz",data)
It is important to add the cell parameters to xyz
. Here we adopt L=20Ang
.
The generated xyz
file can be visualized using various tools including nglview
in python and VESTA
.
Prepare input for CPMD
Let us go to the cpmd/
directory. CPmake.py
will yield input files for CPMD
from methanol.xyz
as follows.
$CPmake.py cpmd workflow --i methanol.xyz -n 40000 -t 10
*****************************************************************
CPmake.py
Version. 0.0.1
*****************************************************************
---------
input geometry file :: methanol.xyz
output georelax calculation :: georelax.inp
output bomdrelax calculation :: bomdrelax.inp
output bomd restart+wf calculation :: bomd-wan-restart.inp
output bomd restart+wf accumulator calculation :: bomd-wan-restart2.inp
# of steps for restart :: 40000
timestep [a.u.] for restart :: 10
atomic arrangement type :: default
-n
and -t
specify the number of steps and the time step (in a.u.) for MD, respectively. Therefore, we will run 400,000 [a.u.] ~ 9.7 [ps] calculation.
Four input files are for 1: geometry optimization, 2: initial relaxation, and 3&4: production run.
Note
Generated inputs are just samples. You should tune parameters for serious calculations.
We slightly modify the inputs for later convenience. The line DIPOLE DYNAMICS WANNIER SAMPLE
decides how often the structure will be calculated. Set it to 100
to reduce computational cost.
DIPOLE DYNAMICS WANNIER SAMPLE
100
We create tmp/
and pseudo/
directories to stock outputs and pseudo potentials, respectively. You also have to prepare C_MT_GIA_BLYP
, O_MT_GIA_BLYP
, and H_MT_BLYP.psp
from CPMD pseudo potential directories and store them in pseudo/
directory.
Run CPMD
We execute three runs: geometry optimization, initial relaxation, and production Wannier run. They will take a few hours depending on your machine. We strongly recommend you to use supercomputers. Please be patient.
mpirun cpmd.x georelax.inp >> georelax.out
mpirun cpmd.x bomd-relax.inp >> bomd-relax.out
mpirun cpmd.x bomd-wan-restart.inp >> bomd-wan-restart.out
After the calculation, you will see IONS+CENTERS.xyz
in the tmp/
directory, which contains atomic and WC coordinates.
Postprocess CPMD data
IONS+CENTERS.xyz
does not include the lattice information, which we need to add manually. We can use CPextract.py
to do this.
$CPextract.py extract -i IONS+CENTERS.xyz -s bomd-wan-restart.out IONS+CENTERS_cell.xyz
-s
specifies the stdout file of the CPMD calculation. The output file IONS+CENTERS_cell.xyz
is extended xyz
format, and can be processed by ase
package.
Train models
Let us go to the train/
directory.
Train models
The previously prepared IONS+CENTERS_cell.xyz
and methanol.mol
are used for training ML models. As methanol has CH
, CO
, OH
bonds and O
lone pair, we have to train four independent ML models. The input file for CPtrain.py
is given in yaml
format.
The input file for the CH bond is as follows.
model:
modelname: model_ch # specify name
nfeature: 48 # length of descriptor
M: 20 # M (embedding matrix size)
Mb: 6 # Mb (embedding matrix size, smaller than M)
learning_rate:
type: fix
loss:
type: mse # mean square error
data:
type: xyz
file:
- "../cpmd/IONS+CENTERS+cell_sorted_merge.xyz"
itp_file: methanol.mol
bond_type: CH # CH, CO, OH, O
traininig:
device: cpu # Torch device (cpu/mps/cuda)
batch_size: 32 # batch size for training
validation_vatch_size: 32 # batch size for validation
max_epochs: 50
learnint_rate: 1e-2 # starting learning rate
n_train: 9000 # the number of training data
n_val: 1000 # the number of validation data
modeldir: model_ch # directory to save models
restart: False # If restart training
For gas systems, we can reduce the model size without losing accuracy. We chose nfeature=48
so that all the atoms are included in the descriptors.
Then, you can train the CH bond model
$CPtrain.py train -i input.yaml
After the training, RMSE should be about 0.001[D]
to 0.01[D]
for isolated systems.
Next, you can change modelname
, bond_type
, and modeldir
to corresponding bonds, and re-run CPtrain.py
to train other 4 models.
Test a model
We can check the quality of the trained model as follows.
$CPtrain.py test -m chmodel_test/model_ch_python.pt -x ../cpmd/IONS+CENTERS_cell.xyz -i ../make_xyz/methanol.mol
Calculate molecular dipole moment
Let us go to the pred/
directory. Finally, we will calculate the average molecular dipole moment of methanol. The experimental value is 1.62[D]
.
For this purpose, we invoke C++ interface with the following input. The calculation of molecular dipole moments is done without specifying any flag.
general:
itpfilename: methanol.acpype/input_GMX.mol
bondfilename: methanol.mol
savedir: pred_dipole/
temperature: 300
timestep: 0.484
descriptor:
calc: 1
directory: ./
xyzfilename: IONS+CENTERS+cell_sorted_merge.xyz
savedir: pred_dipole/
descmode: 2
desctype: allinone
haswannier: 1
interval: 1
desc_coh: 0
predict:
calc: 1
desc_dir: dipole_10ps/
model_dir: model_rotate_methanol/
modelmode: rotate
bondspecies: 4
save_truey: 0
We perform the calculation
dieltools.x
The corresponding output file is DIELCONST
, which contains the mean molecular dipole moment, and molecule_dipole.txt
, which involve all the molecular dipole moments along the MD trajectory.
We can see the mean absolute dipole moment as
$cat DIELCONST
and we confirmed that the simulated value well agrees with the experimental one.