Disclaimer: All the information in this article comes from my own understanding after reading the original literatures. If something is wrong, that is totally on my foolishness. You should use this article with the literature to learn this important method.

# Key References

All the contents in this article came from several papers published by Jörg Behler and co-workers:

- Phys. Rev. Lett., 2007, 98, 146401 (First propose the idea of using symmetry functions as feature vector and combining neural network in creating NNP for Si)
- J. Chem. Phys., 2011, 134, 074106 (Detailed discussion about the properties of symmetry functions and how to select them in a systematic way)
- Phys. Rev. B, 2011, 83, 153101 (Treat long-range electrostatic interaction explicitly with environmental-based atomic charge, now can be used for constructing multi-component NNPs)
- J. Chem. Phys., 2012, 136, 194111 (Extend the decomposition of total energy from atom-centered to pair-centered, showing pair-centered NNP could achieve higher accuracy than atom-centered NNP.)

Besides those research articles, Behler et al. also published a series of really good review papers focused on HDNNP:

- J. Phys.: Condens. Matter, 2014, 26, 183001
- Int. J. Quan. Chem., 2015, 115, 1032-1050 (could be used as an introdutory material)
- J. Chem. Phys., 2016, 145, 170901
- Chem. Rev., 2021, 121(16), 10037-10072

If you want to go deeper into this interesting topic, after reading this article, you could immerse yourself in those papers, they are totally worth your time.

# What is Neural Network Potential?

When we are using molecular dynamics (MD) or Monte Carlo simulation (MC) to calculate the dynamic and statistical properties of the system, we need an accurate description of the potential energy surface (PES), which gives us the mapping between structure \(\{\vec{R}_I\}\) and total energy \(E_{tot}\). Nowadays people often use first-principle simulation such as density functional theory (DFT) to calculate the total energy with a given structure, but this approach is very time-consuming and cannot be used in large-scale long-trajectory molecular dynamics. So in order to accelerate the process, we need an efficient but yet accurate enough surrogate model for approximating PES.

Two approaches emerges from this demand: (1) physics-based force field and (2) mathematical force field. The first one is often called classical force field, which decompose the total energy into two-body (bond length), three-body (bond angle) and four-body terms (dihedral angle). But in here we are going to talk about the second option, which is constructing the potential not from physics, but from mathematics, to get the surrogate model for PES in an efficient and accurate way. And in this article, we focus ourselves in one specific type of potential: neural-network potential (NNP), which uses neural network as the model for PES.

# Framework for constructing NNP

Before going into the details, we need to understand NNP from a bird-eye view, especially how to construct one from scratch. This will help us understand the details in below (which will be a lot of math and discussions).

First we need to be clear of our goal: create an approximate mapping from chemical structure to its total energy. So our input is the Cartesian coordinates of all atoms in the system, represented as \(\{\vec{R}_I\}\), our results is the total energy of the system (\(E_{tot}\)) (or could also be forces \(\vec{f}_x, \vec{f}_y, \vec{f}_z\)). There are several steps in constructing a NNP:

** Step 1**: Convert the Cartesian coordinates into a series of descriptors (or representations) of the system, usually these descriptors have certain symmetry (e.g. permutational, translational, rotational symmetry), they are used as the input vector (or feature vector) for neural network (NN).

** Step 2**: Create an architecture of NN, with: (1) input layer (feature vector) (2) hidden layers (3) output layer (final results).

** Step 3**: Traning this NN to generate an accurate mapping from input to output for all training data.

** Step 4**: Test the accuracy of NN and the completeness of the training data (this could be done iteratively).

** Step 5**: Once the NN is accurate enough and training data covers all important areas of the PES, then we could use this NN in our production round. In this way we can accelerate the MD or MC calculation by 4-5 order of magnitude.

In the following we will tackle these steps one by one, first we will talk about symmetry functions, which is the foundation of HDNNP, then we will talk about the architecture of neural network and how to train it more effectively. Last we will briefly show some applications of HDNNP and directions for its further developments.

# Featurization: Cartesian coordinates to symmetry functions

Since our goal is to map cartesian coordinates of the system to its total energy, ideally the input and output should have similar symmetry, thus no consideration of symmetry needed to be considered in between. But clearly the symmetry of the total energy (permutate two identical atoms, translate, and rotate the system doesn't change its energy) is different from Cartesian coordinats. We need to find a way to symmetrize those coordinates. There are many ways to do this: dpgen uses inner coordination system, Gabor Csanyi et al. uses smooth overlap atomic position (SOAP). But today we will introduce the solution given by Behler and Parrinello in their seminal work in 2007 (Phys. Rev. Lett., 2007, 98, 146401), which is the symmetry functions.

## Atom-Centered Symmetry Function (ACSF)

First we need to understand one important aspect of HDNNP and other machine-learning-based methods, that is the usage of local environment, which means the total energy is decomposed into the contribution of each atoms:

\begin{equation}E_{tot}=\sum_{i=1}^{N_{atom}}E_i\end{equation}

For each atom, there is a corresponding neural network (NN), which calculate the contribution of that atom. Behler et al. assume that the contribution of one atom only depends on the atom itself and its environment, up to a certain cutoff radius (\(R_c\)), so that for atom \(i\), all the relevant coordinates are: \(\{\vec{R}_I; \vec{R}_{j}, \text{which} |\vec{R}_i-\vec{R}_j|\le R_c\}\).

For the discussion, first we will define a cutoff function (\(f_c(R_{ij})\)):

\begin{equation}f_c(R_{ij})=\begin{cases}\displaystyle 0.5\times[\cos(\frac{\pi R_{ij}}{R_c})+1] & \text{if } R_{ij}<R_c \\0 & \text{if } R_{ij}\ge R_c\end{cases}\end{equation}

Another form of cutoff function is:

\begin{equation}f_c(R_{ij})=\begin{cases}\displaystyle \tanh^3[1-\frac{R_{ij}}{R_c}] & \text{if } R_{ij}<R_c \\0 & \text{if } R_{ij}\ge R_c\end{cases}\end{equation}

Both functions can achieve energy conservation during molecular dynamics simulation (**This part I don't fully understand yet ...**)

Then we need to understand the purpose of the symmetry functions. They have two main purposes: (1) keep the same symmetry as the total energy (2) as the fingerprint for structures, which means if one atom has different local environment as other atom, they should have different feature vectors, which constitues multiple symmetry functions.

This could be a little bit hard to understand in the beginning, in mathematical language, we would transfer our local Cartesian coordinates \(\{\vec{R}_I; \vec{R}_{j}, \text{which} |\vec{R}_i-\vec{R}_j|\le R_c\}\) into a feature vector: \((a_1,a_2,\cdots,a_N)\), which each \(a_i\) is the value evaluated from a symmetry function considering all atoms in the local cluster. Then this vector is used as the feature vector for the neural network.

Now comes to the symmetry functions, there are two types of symmetry function: (1) radial and (2) angular. This is similar as the wave function for hydrogen atom, which also contains radial term and angular term. The formulation of symmetry function is not unique, but it contains some clever thinking. In the paper, Behler et al. proposed three different forms of symmetry function:

\begin{equation}G_i^1=\sum_jf_c(R_{ij})\end{equation}

\begin{equation}G_i^2=\sum_j e^{-\eta(R_{ij}-R_s)^2}\cdot f_c(R_{ij})\end{equation}

\begin{equation}G_i^3=\sum_j\cos(\kappa R_{ij})\cdot f_c(R_{ij})\end{equation}

For \(G_i^1\), there is no additional parameters. For \(G_i^2\), there are two parameters that we can adjust: (1) \(\eta\) (2) \(R_s\). For \(G_i^3\), one parameters: \(\kappa\). By choosing different parameters, we get different symmetry functions which can be used in describing a wide range of geometries in the local structure.

Besides radial symmetry function, we also have angular symmetry function, there are two forms proposed in the literature:

\begin{equation}G_i^4=2^{1-\zeta}\sum_{j,k\neq i}^{\text{all}}(1+\lambda\cos(\theta_{ijk}))^{\zeta}e^{-\eta(R_{ij}^2+R_{jk}^2+R_{ik}^2)}f_c(R_{ij})f_c(R_{jk})f_c(R_{ik})\end{equation}

\begin{equation}G_i^5=2^{1-\zeta}\sum_{j,k\neq i}^{\text{all}}(1+\lambda\cos(\theta_{ijk}))^{\zeta}e^{-\eta(R_{ij}^2+R_{ik}^2)}f_c(R_{ij})f_c(R_{ik})\end{equation}

The difference between \(G_i^4\) and \(G_i^5\) is that in \(G_i^5\), there is no term about \(R_{jk}\), which make sense because they are not related to the central atom. For \(G_i^4\) and \(G_i^5\) they all have two parameters: (1) \(\lambda\) and (2) \(\eta\).

Why does this form of function have same symmetry as total energy? The trick is that: (1) it sums up all the atoms in the local cluster in an unbaised way, which means if you switch the positions of two identical particles, the symmetry function doesn't change, this fits the permutational symmetry. (2) All the symmetry functions focus on the distance between two atoms, the distance doesn't change during translation and rotation, which fits the other criteria.

Now we have those different symmetry functions, now some important questions arise: (1) how many symmetry functions should we use? (2) what's the distribution between radial and angular symmetry functions? (3) What parameters should we use? We need to be clear that all those questions do not have definitive answer, in this field, usually it is the end justify the means. As long as your prediction is accurate, there is no point to discuss why you choose such parameters.

But as Behler has mentioned in his papers, there are some rule-of-thumb that we could use:

- The number of symmetry functions in total should be larger than the total degree of freedom in the system (\(3N_{\text{atom}}\)). Usually 50 symmetry functions were needed for each atom.
- The key criteria for choosing symmetry function is
**its ability to distinguish different local structures**. e.g. if two atoms have the same symmetry function but don't have the same environment, then we need to adjust parameters or incorporate in more symmetry functions. - The set of symmetry function should cover all range of bond lengths and angles (with different parameters). This can be done by really understanding the meaning and function of the parameters in the symmetry function, more details can be found in this paper: J. Chem. Phys., 2011, 134, 074106
- The minimum and maximum of symmetry functions can be used to understand the validity of the trained NNP. If in a structure there is a value out of the range, then we should be careful about the correctness of the extrapolation.

Either way, constructing symmetry functions is like choosing the basis set for quantum chemistry simulations, we need to compare different options in order to find the best fit, it is largely empirical knowledge.

## Pair-Centered Symmetry Function (PCSF)

In this paper (J. Chem. Phys., 2012, 136, 194111), Behler et al. discussed a new framework of decomposition of the total energy, not by individual atoms, but by individual pairs:

\begin{equation}\displaystyle E_{tot}=\frac{1}{2}\sum_{i\neq j}E_{ij}\end{equation}

This decomposition was inspired by the Tersoff potential.

But there is a problem with this approach, it is not scalable. Since for atomic-centered decomposition, the computational cost is linearly correlated with the number of atoms, it is easily scalable. But for pair-centered approach, the computational cost is \(O(N^2)\), which is not linear scaling. Yet it appears for some cases it can achieve slightly better results than ACSF.

In below we will give some forms of Pair-Centered Symmetry Functions (PCSF) proposed in the literature, since this approach is not as widely used as ACSF based on the reason I gave above (could be more reasons), we do not talk it in more details.

For the radial symmetry functions:

\begin{equation}G_{ij}^1=f_c(R_{ij})\end{equation}

\begin{equation}G_{ij}^2=f_c(R_{ij})[\sum_kf_c(R_{ik})+\sum_kf_c(R_{jk})]\end{equation}

\begin{equation}G_{ij}^3=f_c(R_{ij})e^{-\eta R_{ij}^2}[\sum_kf_c(R_{ik})e^{-\eta R_{ik}^2}+\sum_kf_c(R_{jk})e^{-\eta R_{jk}^2}]\end{equation}

(**My question: In here should we keep all three \(\eta\) the same? All we could add some flexibility there?**)

For the angular symmetry function:

\begin{equation}G_{ij}^4=f_c(R_{ij})e^{-\eta R_{ij}^2}\cdot 2^{1-\zeta}\sum_{\alpha}[(1+\lambda\cos\alpha)^{\zeta}e^{-\eta(R_{ik}^2+R_{jk}^2)}f_c(R_{ik})f_c(R_{jk})]\end{equation}

where the choice of \(\alpha\) is not trivial, for homogeneous and heterogeneous cases \(\alpha\) can have different meanings (more see the reference mentioned in the beginning of this section)

# Architecture of the Neural Network and the construction of training set

Once we have all the symmetry functions (radial or angular) with certain parameters, we could construct the feature vector and pass it to the neural network.

In here we will not discuss in detail about the formulation of neural network. The more important issue in here is the iterative process of training neural network and expanding the training set. This approach makes training NNP can be done automatically with minimal human intervention:

** Step 1**. First, we need to select several structures and calculate their corresponding energies and forces by using certain flavor of electronic structure methods. This is our initial training sets.

** Step 2**. Based on this training set, we training several neural network with different initial parameters (but the architecture should be the same for comparison)

** Step 3**. We use one of the trained NNs to calculate a trajectory in molecular dynamics. Then evaluate the structure and forces along this trajectory by other NNs, and compare their accuracy.

** Step 4**. If large deviation has been detected, that means there are less information in the training set about that region, then more structures from that area should be calculated and put in the training set. Then return to step 2 to proceed the next iteration.

** Step 5**. Once there are no significant deviation found (should set some convergence parameters), then we have achieved both training data and neural network potential (NNP) at the same time. This is an iterative, self-consistent approach which can be easily automated. DPGEN also adopted similar strategies.

This approach is referred as "multi-NNP" method in the literature.

Usuall in the end, we will have up to thousands or tens of thousands of DFT data as our training set and a very accurate (accuracy for energy around 5meV/atom, accuracy for forces around 0.02eV/Å), efficient potential (4-5 order of magnitude faster than DFT) that we can use for further molecular dynamics simulations to simulate free energies, activation barriers, phase transition and even dynamic properties such as diffusion coefficient.

# Application

The application of HDNNP is vast, examples can be found in the review papers mentioned above. One of the earliest example is constructing the NNP for Si.

I hope you have a good time reading through this article. This is a very preliminary article about HDNNP based on my own understanding. If you find any mistakes and incorrect part, please send to me an email to **zhengdahe.electrocatalysis@gmail.com**. I'm happy to communicate with you further.

Have a nice day!

Best regards,

Zhengda