#################################################### Getting-started tutorial No. 2: Isolated methanol #################################################### In this tutorial, we give an comprehensive tutorial to prepare training data and train models, taking isolated methanol as an example. ************************************* Requirements ************************************* To perform the tutorial, you also need to insall ``CPMD`` package for DFT/MD calculations. See https://github.com/CPMD-code. ************************************* Tutorial data ************************************* The reference files of this tutorial are given in ``examples/tutorial/2_gasmethanol/`` directory. .. code-block:: bash $cd examples/tutorial/2_gasmethanol $tree . ├── cpmd │   ├── pseudo │   └── tmp ├── make_xyz │   ├── csv2xyz.py │   └── methanol.csv ├── pred ├── train ************************************* Prepare training data ************************************* The acquisition of the training data requires two steps: generation of the structure and calculation of the Wannier centers. In the case of liquid structures, the acquisition of the structure is done with classical MD and only the Wannier center calculation is done with DFT due to computational cost considerations, while in the case of gaseous structures, the structure acquisition and Wannier center calculation can be done simultaneously with ab initio MD due to the lower computational cost. We will test the sequence using the CPMD package. Convert smiles to xyz ===================================== Let us go to the ``make_xyz`` directory. To begin with, we build an initial structure for CPMD. Let us prepare a csv file containing smiles. .. code-block:: bash $cat methanol.csv Smiles,Name,density CO,METHANOL,0.791 The following simple python code ``csv2xyz.py`` will generate ``methaol.xyz`` (and ``methanol.mol`` for later usage). .. code-block:: python :caption: csv2xyz.py import shutil import os import pandas as pd import rdkit.Chem import rdkit.Chem.AllChem import ase import ase.io # read csv input_file:str = "methanol.csv" # read csv poly = pd.read_csv(input_file) print(" --------- ") print(poly) print(" --------- ") # read smiles smiles:str = poly["Smiles"].to_list()[0] molname:str = poly["Name"].to_list()[0] # build molecule mol=rdkit.Chem.MolFromSmiles(smiles) print(mol) molH = rdkit.Chem.AddHs(mol) # add hydrogen print(molH) print(rdkit.Chem.MolToMolBlock(molH, includeStereo=False)) rdkit.Chem.AllChem.EmbedMolecule(molH, rdkit.Chem.AllChem.ETKDGv3()) print(molH) print(rdkit.Chem.MolToMolBlock(molH, includeStereo=False)) rdkit.Chem.MolToMolFile(molH,'methanol.mol') rdkit.Chem.MolToXYZFile(molH,'methanol.xyz') # xyz = rdkit.Chem.MolToXYZBlock(molH) # load xyz via ase to add cell parameters data = ase.io.read("methanol.xyz") data.set_cell([20,20,20]) ase.io.write("methanol.xyz",data) It is important to add the cell parameters to ``xyz``. Here we adopt ``L=20Ang``. The generated ``xyz`` file can be visualized using various tools including ``nglview`` in python and ``VESTA``. .. image:: ../image/methanol.png :width: 400 :align: center Prepare input for CPMD ===================================== Let us go to the ``cpmd/`` directory. ``CPmake.py`` will yield input files for ``CPMD`` from ``methanol.xyz`` as follows. .. code-block:: bash $CPmake.py cpmd workflow --i methanol.xyz -n 40000 -t 10 ***************************************************************** CPmake.py Version. 0.0.1 ***************************************************************** --------- input geometry file :: methanol.xyz output georelax calculation :: georelax.inp output bomdrelax calculation :: bomdrelax.inp output bomd restart+wf calculation :: bomd-wan-restart.inp output bomd restart+wf accumulator calculation :: bomd-wan-restart2.inp # of steps for restart :: 40000 timestep [a.u.] for restart :: 10 atomic arrangement type :: default ``-n`` and ``-t`` specify the number of steps and the time step (in a.u.) for MD, respectively. Therefore, we will run 400,000 [a.u.] ~ 9.7 [ps] calculation. Four input files are for 1: geometry optimization, 2: initial relaxation, and 3&4: production run. .. note:: Generated inputs are just samples. You should tune parameters for serious calculations. We slightly modify the inputs for later convenience. The line ``DIPOLE DYNAMICS WANNIER SAMPLE`` decides how often the structure will be calculated. Set it to ``100`` to reduce computational cost. .. code-block:: bash DIPOLE DYNAMICS WANNIER SAMPLE 100 We create ``tmp/`` and ``pseudo/`` directories to stock outputs and pseudo potentials, respectively. You also have to prepare ``C_MT_GIA_BLYP``, ``O_MT_GIA_BLYP``, and ``H_MT_BLYP.psp`` from CPMD pseudo potential directories and store them in ``pseudo/`` directory. Run CPMD ===================================== We execute three runs: geometry optimization, initial relaxation, and production Wannier run. They will take a few hours depending on your machine. We strongly recommend you to use supercomputers. Please be patient. .. code-block:: bash mpirun cpmd.x georelax.inp >> georelax.out mpirun cpmd.x bomd-relax.inp >> bomd-relax.out mpirun cpmd.x bomd-wan-restart.inp >> bomd-wan-restart.out After the calculation, you will see ``IONS+CENTERS.xyz`` in the ``tmp/`` directory, which contains atomic and WC coordinates. Postprocess CPMD data ===================================== ``IONS+CENTERS.xyz`` does not include the lattice information, which we need to add manually. We can use ``CPextract.py`` to do this. .. code-block:: bash $CPextract.py extract -i IONS+CENTERS.xyz -s bomd-wan-restart.out IONS+CENTERS_cell.xyz ``-s`` specifies the stdout file of the CPMD calculation. The output file ``IONS+CENTERS_cell.xyz`` is ``extended xyz`` format, and can be processed by ``ase`` package. ************************************* Train models ************************************* Let us go to the ``train/`` directory. Train models ================== The previously prepared ``IONS+CENTERS_cell.xyz`` and ``methanol.mol`` are used for training ML models. As methanol has ``CH``, ``CO``, ``OH`` bonds and ``O`` lone pair, we have to train four independent ML models. The input file for ``CPtrain.py`` is given in ``yaml`` format. The input file for the CH bond is as follows. .. code-block:: yaml :caption: input.yaml model: modelname: model_ch # specify name nfeature: 48 # length of descriptor M: 20 # M (embedding matrix size) Mb: 6 # Mb (embedding matrix size, smaller than M) learning_rate: type: fix loss: type: mse # mean square error data: type: xyz file: - "../cpmd/IONS+CENTERS+cell_sorted_merge.xyz" itp_file: methanol.mol bond_type: CH # CH, CO, OH, O traininig: device: cpu # Torch device (cpu/mps/cuda) batch_size: 32 # batch size for training validation_vatch_size: 32 # batch size for validation max_epochs: 50 learnint_rate: 1e-2 # starting learning rate n_train: 9000 # the number of training data n_val: 1000 # the number of validation data modeldir: model_ch # directory to save models restart: False # If restart training For gas systems, we can reduce the model size without losing accuracy. We chose ``nfeature=48`` so that all the atoms are included in the descriptors. Then, you can train the CH bond model .. code-block:: bash $CPtrain.py train -i input.yaml After the training, RMSE should be about ``0.001[D]`` to ``0.01[D]`` for isolated systems. Next, you can change ``modelname``, ``bond_type``, and ``modeldir`` to corresponding bonds, and re-run ``CPtrain.py`` to train other 4 models. Test a model ================== We can check the quality of the trained model as follows. .. code-block:: bash $CPtrain.py test -m chmodel_test/model_ch_python.pt -x ../cpmd/IONS+CENTERS_cell.xyz -i ../make_xyz/methanol.mol *********************************** Calculate molecular dipole moment *********************************** Let us go to the ``pred/`` directory. Finally, we will calculate the average molecular dipole moment of methanol. The experimental value is ``1.62[D]``. For this purpose, we invoke C++ interface with the following input. The calculation of molecular dipole moments is done without specifying any flag. .. code-block:: yaml :caption: config.yaml general: itpfilename: methanol.acpype/input_GMX.mol bondfilename: methanol.mol savedir: pred_dipole/ temperature: 300 timestep: 0.484 descriptor: calc: 1 directory: ./ xyzfilename: IONS+CENTERS+cell_sorted_merge.xyz savedir: pred_dipole/ descmode: 2 desctype: allinone haswannier: 1 interval: 1 desc_coh: 0 predict: calc: 1 desc_dir: dipole_10ps/ model_dir: model_rotate_methanol/ modelmode: rotate bondspecies: 4 save_truey: 0 We perform the calculation .. code-block:: bash dieltools.x The corresponding output file is ``DIELCONST``, which contains the mean molecular dipole moment, and ``molecule_dipole.txt``, which involve all the molecular dipole moments along the MD trajectory. We can see the mean absolute dipole moment as .. code-block:: bash $cat DIELCONST and we confirmed that the simulated value well agrees with the experimental one.