Grotendorst j (ed ) high performance computing in chemistry (2004)

HF and DFT

To prepare for the following sections, it is essential to outline the key characteristics of Hartree-Fock (HF) and Density Functional Theory (DFT) Both methods are single reference approaches defined by molecular orbitals (MOs) and their occupation numbers, focusing on closed shell systems with two occupied MOs and zero virtual MOs The total electronic energy in HF and DFT encompasses the one-electron term, the Coulomb interaction among electrons, the HF exchange, and the exchange-correlation term specific to DFT.

The DFT expression is applicable exclusively to non-hybrid functionals, while hybrid functionals incorporate a portion of Hartree-Fock (HF) exchange The evaluation of this integral is efficient and rapid, as it is defined as a three-dimensional integral.

The actual functional employed in the evaluation of Eq (3) is determined through numerical integration (quadrature), utilizing an efficient and numerically stable procedure implemented in TURBOMOLE For larger molecular cases, the CPU time increases linearly with molecular size, as demonstrated in the following analysis.

For DFT it remains to consider , which is defined as

The evaluation of is typically the most demanding part of DFT treatments With the usual expansion of in a set of basis functions

(6) one gets the density " and the density matrix

(11) where is given for completeness The MOs are now specified by the coefficients

- and the chosen basis set, of course Optimization of 8 within the variation principle yields the

HF and Kohn-Sham (KS) equations to be solved

(13) where = denotes the overlap matrix.

In Hartree-Fock (HF) theory, the evaluation process is collaborative, providing a significant advantage over other methods Unlike alternative procedures, which may require more complex treatment, Density Functional Theory (DFT) simplifies this process and has been effectively utilized since its inception.

RI technique

One of the successful procedures [16, 17] was to approximate " in terms of an auxiliary or fitting basis P

The free parameters A are obtained from a least squares requirement

Theoretical background: HF, DFT, and the RI technique

It remains to specify the scalar product occurring in the last two equations A careful analysis by Alml¨of et al has identified the best choice [18]

The equation (18) represents the matrix elements of the inverse of B &F C, with all scalar products defined as in (17) This formulation has given rise to the term "RI," which stands for resolution of the identity, highlighting the significance of this technique in the context of matrix analysis.

With the basis set expansion for " , Eq (7), it then remains to compute as the essential term

The formal behavior of (9) is transformed into a formal scaling in (19), resulting in significant CPU time savings By utilizing Gaussian basis functions, certain calculations can be simplified or neglected.

- / if the corresponding centers are sufficiently far apart; the number of significant products

- / thus increases for large molecules only as + (N) This results in an asymptotic + (N ) scaling for RI and conventional treatments - with a much smaller prefactor for the RI technique.

Despite the implementation of the RI procedure in several DFT programs, its accuracy remains unverified, as these programs can only calculate @ rather than the rigorous expression (9) for Additionally, the optimization of the crucial auxiliary functions B has not been thoroughly addressed.

We initiated a comprehensive effort to optimize auxiliary basis sets for all elements in the periodic table and to document the errors associated with the RI technique This optimization not only enhances reliability but also improves efficiency, as optimized sets can yield more accurate results while often being smaller than previously estimated bases The Karlsruhe auxiliary basis sets are now accessible for various accuracy requirements in RI-DFT, RI, RI-MP2, and RI-CC2 calculations, although the latter topics are outside the scope of this discussion These high-quality bases are also available for other projects within HPC-Chem, and currently, there are no comparable auxiliary basis sets that match their accuracy and efficiency.

Gradients

Single point calculations provide the molecular electronic structure and electronic energy for specific nuclear coordinates, but they are insufficient for efficiently determining key molecular properties For instance, understanding molecular equilibrium geometries—representing the structures of one or more isomers of a molecule—requires more comprehensive analysis beyond single point evaluations.

L denotes structure constants, e.g coordinates of nuclei An attempt to locate structures by single point calculations would hardly be feasible even for small molecules with ten degrees of freedom,

A solution to this problem was achieved by analytical gradient methods, which evaluate

K simultaneously for all degrees of freedom [24] The computation of K is surprisingly simple in principle, if one recalls that E depends explicitly only on

(location of nuclei including the centers of basis functions) and on the density matrix, i.e

The first term is easily manageable due to its structural similarity to evaluating a single Hartree-Fock (HF) or Density Functional Theory (DFT) iteration, albeit requiring approximately three times the computational effort The second term can be simplified by leveraging previously solved HF or Kohn-Sham (KS) equations, taking advantage of the fact that the molecular orbitals (MOs) have already been optimized and are orthonormal.

NO =K (22) where =K denotes the derivative of the overlap matrix and O the ’energy weighted’ density matrix P -/

With the capability to compute K it is a standard task to locate in an iterative procedure structures that fulfill (20):

2 solve the HF or DFT equations to get optimized MOs

The advancement of efficient gradient techniques and stable relaxation methods has been crucial for the success of quantum chemistry Typically, convergence in the relaxation process is achieved within about half a cycle, and the computation of K is often significantly faster than traditional DFT or HF calculations As a result, structure determinations, which are fundamental to quantum chemistry, have become a routine part of the field.

The MARI- T (Multipole Assisted RI- T ) procedure

4 The MARI- U (Multipole Assisted RI- U ) procedure

The RI- method is an efficient procedure for large molecules with over 100 atoms, while the evaluation of other computational tasks scales as + (N) This project aimed to enhance the efficiency of the RI- procedure by utilizing the multipole expansion for the Coulomb interaction of non-overlapping charge distributions For detailed technical derivations, please refer to our publication [25].

The multipole expansion deals with the Coulomb interaction of two charge distributions " and "

, provided they do not overlap Let " be centered around A and " around B We then compute the moments of " as

(25) where B WX denote associated Legendre polynomials, and similarly for V BWX referring to "

(27) where f denotes the vector pointing from A to B: f

B-A, and the angles _ and of respective vectors are defined in an arbitrary fixed coordinate system Eq (26) effects a separation of Coulomb interactions between " and " if they do not overlap.

The computation of " & B C , Eq (19), is the only demanding task within RI- , and we apply the multipole expansion to accelerate the evaluation For this purpose we decompose

" into contributions associated with nuclei N, which are additionally characterized by an extension `

We then compute the moments V g Hh

WX from (24), where we have put "

In this study, we designate nucleus N as the central point, with auxiliary functions B constructed to be atom-centered These functions are specifically selected to exhibit angular behavior represented by spherical harmonics, facilitating the evaluation of the corresponding moment V B i.

Hh b Hc is thus trivial.

The crucial point of this procedure is the decomposition (28), which is based on a detailed consideration of products of basis functions

The product is linked to specific nuclei N where the steeper function is focused, necessitating an appropriate extension It is essential to analyze individual charge distributions "g Hh" and "B." If these charge distributions are well separated, a new and rigorous test can be applied, allowing the use of the multipole expansion known as the 'far field' (FF) contribution.

The formula B i Hh b Hc (29) applies to the field where f points from the nucleus to B However, for the other terms, the formula cannot be utilized due to overlapping charge distributions In this case, it is necessary to use the conventional integral code to address the 'near field' (NF) component.

Our goal is to establish parameter sets for the MARI-J method that optimize CPU times while keeping errors from multipole expansions below those of the RI-J approximation We present two sets of parameters, designated as high- and low-precision sets, each designed to minimize CPU time while ensuring adequate precision for various molecules The high-precision set achieves errors below 1.0 × 10^(-10) h, equating to a maximum of 1.0 × 10^(-10) k h per atom for the largest molecules analyzed Notably, the smallest errors, under 1 × 10^(-11) h, are found in two-dimensional graphitic sheets and lower-density zeolite clusters In contrast, the low-precision set also maintains errors below acceptable thresholds.

The MARI-J calculation for the insulin molecule reveals a total energy difference of 1.3 j 10, m h when compared to the full RI-J calculation, while a low-precision calculation shows a smaller difference of 2.2 j 10, m h This indicates that low-precision MARI-J calculations are sufficiently accurate for most applications, particularly for zeolite fragments However, for dense three-dimensional systems or those with very diffuse basis sets, it is advisable to use the high-precision parameter set Ultimately, the errors from multipole expansions are minimal compared to the inaccuracies of the RI-J approximation, incomplete basis sets, and numerical integrations.

Demonstrative tests

This section explores the application of the MARI-J method to various model systems, including graphitic sheets, zeolite fragments, and the insulin molecule (Figure 1) We argue that these selected systems better align with the typical challenges addressed by DFT methods, in contrast to the conventional one- and two-dimensional model systems commonly used for testing algorithms.

Figure 1: Schematic draw of the insulin molecule used for our calculations.

All DFT calculations employ the Becke-Perdew (BP86) exchange-correlation functional

In our study, we utilized split-valence basis sets with polarization functions on all non-hydrogen atoms, referred to as SV(P), along with corresponding auxiliary bases To accurately assess the differences in total energies between the MARI-J method and the full RI-J treatment, we ensured that the DFT energies were converged to better than 1 x 10.

The numerical integrations utilize grids m3 and m4, with grid m3 recommended for smaller molecules and grid m4 for larger ones For timing runs, an energy convergence criterion of 1 x 10 is applied All calculations are conducted on an HP J6700 workstation For more details, refer to the relevant literature.

PA RISC HP785 processor (750 MHz) and 6 GB main memory.

The 2-D series of graphitic sheets, C n o

In this study, the models employed, characterized by D 6h symmetry, feature C-C and C-H bond lengths of 1.42 Å and 1.0 Å, respectively These parameters align with the methodologies utilized by Strain et al and Pérez-Jordá and Yang to evaluate the effectiveness of their multipole-based techniques The largest sheet analyzed in this research is denoted as C m.

Table 1 presents selected results for the largest model systems analyzed, detailing the number of atoms (p at), basis functions, and auxiliary basis functions (p bf) It includes CPU times per iteration for the NF (q NF) and FF (q FF) components of the MARI-J calculations, as well as the comprehensive RI-J treatment (q RI-J) Additionally, the table highlights absolute errors in total energies (r h) relative to full RI-J calculations (s r MA) The results encompass both high-precision (hp) and low-precision (lp) MARI-J calculations, with CPU timings for grid construction (grid m4) provided for comparison.

Symmetry D 6h C 1 C 1 at 936 360 787 bf 12240 4848 6456 bf (aux) 32328 12072 16912

The pure-silica zeolite chabazite fragments are derived from an experimental crystal structure, featuring a unit cell that includes a double six-membered silica ring unit These zeolite fragments range from one to eight units in C symmetry, with dangling Si-O bonds saturated by hydrogen atoms The insulin molecule, consisting of 787 atoms and 6456 basis functions, is represented in C symmetry and its coordinates are sourced from the PDB database A summary of the largest molecules analyzed in this study is provided in Table 1, and the coordinates for all structures can be accessed online at ftp://ftp.chemie.uni-karlsruhe.de/pub/marij.

The RI-J method's computational effort, particularly during the density fitting step, is often discussed due to its complexity However, the TURBOMOLE implementation utilizes a rapid Cholesky decomposition of the positive definite matrix B &F C, which significantly reduces calculation times For symmetric molecules, the time required to compute the fully symmetric part of two-center repulsion integrals and perform the Cholesky decomposition is minimal In the case of the insulin molecule, which has C symmetry and consists of 787 atoms, these computational demands remain manageable.

16912 auxiliary basis functions this step takes approximately 20 min, and is done only once

The MARI-T (Multipole Assisted RI-T) procedure is utilized at the onset of the SCF process for both RI-J and MARI-J calculations In practical applications, the cost and scaling of the RI-J method primarily hinge on the computation of three-center Coulomb integrals For large systems, techniques can be employed to significantly lower the computational expense associated with the density fitting step.

Figures 2 and 3 illustrate CPU times per SCF iteration for the studied systems using RI-J and MARI-J methods, while Table 1 summarizes the results for the largest molecules The evaluation times for exchange-correlation energies with grids m3 and m4 are included, excluding the initial grid formation costs Notably, the MARI-J method significantly reduces the computational effort for the Coulomb term, making it comparable to the exchange-correlation energy calculations This method demonstrates optimal performance for two-dimensional graphitic sheets and zeolite fragments, achieving CPU time reductions of 4.7 and 5.9 times for high and low-precision parameters in the largest graphitic sheet, and reductions of 4.6 and 6.5 times for the largest zeolite fragment Additionally, for the insulin molecule, speedups of 3.8 and 5.3 times are achieved.

In the analysis of various systems, the "crossover point" with full RI-J treatment is achieved even in the smallest systems For graphitic sheets and zeolite fragments, MARI-J calculations become faster with approximately 250-350 basis functions, depending on the desired accuracy Additionally, preliminary tests on smaller systems indicate that MARI-J does not significantly increase computational overhead compared to the full RI-J treatment.

The impact of precision requirements on CPU timings for the MARI-J method varies based on the system analyzed For materials such as graphitic sheets, zeolite clusters, and diamond pieces, the difference in CPU times between high-precision and low-precision MARI-J calculations is approximately 30%.

Table 1 presents a comparison of CPU times for the non-factorized (NF) and factorized (FF) components of Coulomb calculations in large molecular systems Despite only a small percentage of three-center electron repulsion integrals (ERIs) being evaluated analytically, the NF portion remains the primary contributor to computation time For molecules exhibiting C symmetry, the FF component accounts for 10% or less of the total CPU time, while for symmetric molecules, this figure rises to 20-30% The current MARI-J method implementation does not fully leverage symmetry in the FF calculations; enhancing symmetry utilization across all MARI-J algorithm components could decrease these times, albeit with minimal impact on overall computation duration.

All calculations in this study utilize the standard Self-Consistent Field (SCF) procedure, where diagonalization of the Fock matrix is not the primary focus For the insulin molecule, the average CPU times per SCF iteration are recorded as 42 minutes for diagonalization, 17 minutes for exchange-correlation, and 39 minutes or 28 minutes for high- or low-precision MARI-J steps, respectively.

MARI-J Total 1.44 1.54 high precision NF 1.41 1.54

MARI-J Total 1.47 1.56 low precision NF 1.45 1.57

Table 2 presents the scaling exponents for various steps in calculating the Coulomb and exchange-correlation terms using grids m3 and m4 Additionally, it includes a comparison of the scaling exponents for notable shell-pairs of basis functions (p dist).

The CPU time required for each SCF iteration in calculating the Coulomb term is analyzed based on the number of basis functions in a series of graphitic sheets (C vwx H vw , y z { | } } } | ~{ ) This includes results from full RI-J calculations alongside MARI-J using both high-precision (hp) and low-precision (lp) parameter sets Additionally, CPU times for evaluating exchange-correlation energy with grids m3 and m4 are provided for comparison.

MARI- Gradient evaluation

Geometry optimization involves a series of energy and gradient calculations, making it essential to utilize multipole expansion effectively in both processes Implementing the MARI-gradient presents significant technical challenges, so we will focus on presenting our results instead of delving into the specifics As illustrated in Figure 4, we compare CPU times for different aspects of gradient calculations in graphitic sheets, revealing that the timing for the Coulomb term has improved dramatically, decreasing by a factor of 15 and now being comparable to the other terms involved.

First-order derivatives K can be calculated more quickly than energy in Hartree-Fock (HF) or Density Functional Theory (DFT), which is essential for routine theoretical analyses of molecules Moreover, obtaining second derivatives would be even more advantageous for these applications.

The Hessian matrix is crucial for identifying the nature of a stationary point, as it distinguishes between a local minimum, indicative of a specific isomer, and a saddle point, which represents a transition state in a reaction, particularly when only one eigenvalue is negative Additionally, the Hessian provides direct insights into the frequencies of infrared (IR) and Raman spectra under the harmonic approximation.

K is best obtained by differentiating

, Eq (21) to Eq (23), once more with respect to a second coordinate

The detailed formulae, which are quite lengthy, need not concern us here; the most impor- tant aspect of (32) is that one now has to compute the perturbed MOs, i.e

The coupled perturbed Hartree-Fock (HF) or Kohn-Sham (KS) equations, known as CPHF or CPKS, are typically solved by expressing the perturbed molecular orbitals (MOs) in relation to the unperturbed ones through a transformation matrix K.

(33) which is determined by HF-type equations

Solving a CPHF or CPKS equation for each degree of freedom is essential in this technical process The evaluation of the Hessian is significantly more resource-intensive, costing at least as much as a single point or gradient calculation This computational expense increases linearly with the size of the molecule.

The computation of the Hessian matrix requires O(N) time, which is O(N) more than the time needed for gradient calculations However, this additional effort yields O(N) more information, enhancing the overall analysis As demonstrated, both energy and gradient calculations can be performed efficiently within this framework.

+(N) effort - but second derivatives can presently only be treated for molecular sizes for which the reduced scaling does not apply.

The development of DFT and HF implementations faces challenges not only due to high computational demands but also because of the algorithmic complexity involved in managing lengthy expressions Initially, the plan was to enhance the second derivatives HF code from TURBOMOLE by incorporating DFT However, it was ultimately decided to restructure the existing code to avoid efficiency issues when dealing with systems containing 50 or more atoms.

Our publication details the implementation of DFT second derivatives, showcasing its efficiency and accuracy through various applications The code incorporates common features found in other programs like GAUSSIAN, including integral direct and multi-grid techniques for the CPKS equations, as well as weight derivatives in the quadrature Additionally, TURBOMOLE offers unique features designed to enhance computational efficiency.

The iterative solution of CPKS utilizes a preconditioned conjugate gradient method enhanced by subspace acceleration, where all solution vectors are generated within a single subspace that expands with each iteration This approach ensures effective convergence, typically requiring only four to six iterations to reduce the residual norm below the target of 10^(-t).

We break down the internal coordinate space into irreducible subspaces aligned with the molecular symmetry group, which significantly lowers memory and disk storage needs due to symmetry-blocked matrices This approach not only simplifies the treatment of CPKS but also boosts overall efficiency Additionally, the evaluation of the Hessian can be restricted to specific irreducible representations, particularly those linked to infrared or Raman active modes.

The results presented were derived from force constant calculations utilizing the TURBO-MOLE module AOFORCE with the BP86 DFT method and an SV(P) basis set, which includes split valence plus polarization (excluding hydrogen) To illustrate the computational cost associated with systems of varying sizes and symmetries, Table 3 provides total CPU times, noting that wall times vary by no more than 2% The benchmark calculations focus on several hydrocarbons, specifically n-alkanes.

), planar graphitic sheets (which results in compositions from to

J m m), and diamond like carbon clusters (starting with adamantane,

, and creating further clusters by adding larger sheets from one side up to t t

) The alkanes were treated in their structure, the aromatic sheets in 0

The CPKS equation solver required multiple iterations for different compounds, specifically four for each alkane and diamond cluster, and five for each aromatic sheet, necessitating the formation of matrices d multiple times The total CPU time escalates with larger systems, approximately following a quadratic relationship, primarily due to the evaluation of the Coulomb part of d The computational effort for the initial weight derivatives in both the equations is minimal However, for smaller molecules, the DFT quadrature proves to be more computationally intensive compared to the differentiated four-center integrals, especially in larger systems.

J m m these two contributions exhibit similar timings.

As a demonstration for the ’IR-only’ and ’Raman-only’ option we have treated fullerene

, again on AMD-Athlon, 1.2 GHz:

, ơ Ơ Ư Đ z â ư © ¦ which is a contribution to the ® ¯ ° ± ¦ ² in Eq (34)

Table 3 presents the CPU times, measured in hours, for BP86/SV(P) force constant calculations on various classes of hydrocarbons using an AMD-Athlon (1.2 GHz, 768 MB RAM) system The table includes key components such as the number of degrees of freedom (³) and the number of basis functions (p BF), which are essential for understanding the total computational time required The results specifically highlight the CPU time contributions associated with linear alkanes.

Implementation of RI- for second derivatives

In our exploration of analytical second derivatives, we identified the solution of the CPKS equations as the most computationally intensive aspect Each iteration of CPKS requires the evaluation of a Coulomb term, and for hybrid functionals, an additional exchange term, both of which significantly increase CPU time However, utilizing the RI- technique for non-hybrid functionals can help reduce this computational burden, specifically addressing the first term in Eq (35), which involves a Coulomb matrix.

8 à (38) where 8 is the MO coefficient matrix from Eq (6) With RI- we get

The replacement of ´ by ´ @ requires ’only’ to import the RI- machinery into the AOFORCE module.

Our implementation of RI- for second analytical derivatives is described in a publication

The reliability and efficiency of RI- are documented, showing that CPU times for evaluation are reduced to approximately 10%, which is less impressive compared to energy or gradient calculations This is due to the CPKS solver's ability to treat a set of CPKS equations simultaneously, allowing two-electron integrals—typically the most computationally expensive component—to be evaluated just once for the entire set Overall, total timings are reduced by a factor of 2.5, while the main bottleneck now lies in the treatment of the second term in Eq (35), which is believed to be efficiently implemented.

Demonstrative tests

For this more realistic system (

, '0, symmetry) we carried out BP86 partial RI- DFT second nuclear derivative calculations On Intel Xeon (2.4 GHz) com- puters, we obtained the following timings:

Indinavir has some floppy modes with frequencies below 10 cm , This requires a careful structure optimization since otherwise the computed frequencies can be imaginary, i.e.

E ạ cm , We recommend to include derivatives of quadrature weights in the structure optimization to make sure the energy minimum has been accurately located and to avoid spurious imaginary frequencies.

As a last example we report timings for the computation of the second derivatives of cyanocobalamin (vitamin B12, C

Using an SV(P) basis with 1492 basis functions on grid m4, the calculation duration was 18 days and 22 hours A finer grid m4 was chosen over the coarser grid m3 for systems with more than 50 atoms, as this is generally recommended Notably, the RI-part consumed only 13% of the total computation time, while matrix algebra, as represented in Eq (38), accounted for just 3% of the CPU time.

In this study, we compare the experimental solid-state infrared absorption spectrum with our computed results, broadening the computed lines by 30 cm and globally scaling the intensities to align with the experimental data Notably, the peak at 2152 cm corresponds to the CN stretch of the central CoCN group, showing good agreement with the experimental findings Additionally, at 2731 cm, we observe an intramolecular O stretch, further validating our computational approach.

The H-O mode, occurring around 3170 cm, encompasses various NH stretches localized at the molecular surface Intermolecular interactions influence these modes, leading to frequency shifts and broadening, as evidenced by experimental findings.

600 1000 1400 1800 2200 2600 3000 3400 3800 arbitrary units cm^(-1) exp spectrum calc spectrum line spectrum

The experimental solid-state absorption infrared spectrum of cyanocobalamin (solid line) is compared with the computed spectrum (dashed line), revealing that at 1754 cm, the surface of B12 is influenced by packing effects The theoretical analysis of vibrations indicates that the information derived from solid-state spectra is inherently limited, a conclusion that applies broadly to similar studies.

IR spectra reported in polar solvents D

O, ethanol and 75% glycerol [42] There are three peaks denoted B, C and D between 1530 and 1680 cm , In this range we find numerous modes including surface modes affected by solvation.

We have successfully integrated the MARI-technique into the TURBOMOLE modules RIDFT and RDGRAD, enhancing the optimization of wavefunctions and the computation of nuclear forces within Density Functional Theory (DFT) This advancement significantly reduces the computational burden associated with interelectronic Coulomb repulsion, which was previously the most time-consuming step Given that DFT is primarily employed for larger molecules containing over 100 atoms, and that RIDFT and RDGRAD are crucial for the iterative process of determining molecular structures, our improvements have markedly increased the efficiency of TURBOMOLE Notably, the performance gains are especially evident for larger systems, accommodating up to 1000 atoms, as illustrated in Figures 2-4.

The recent project focused on enhancing functionality by enabling the AOFORCE module to handle second analytical derivatives within Density Functional Theory (DFT), a feature not previously available The module has undergone a complete redesign, improving its efficiency for both closed and open shell states treated by Hartree-Fock (HF) and DFT methods As illustrated in Figure 5, this advancement allows for the computation of Infrared (IR) and Raman frequencies for molecules with larger structures.

The timings presented in this study and related publications are conservative Following the conclusion of the HPC-Chem project, the authors have reengineered the integral routines used for computing " & B C in RI-energy and gradient calculations, which are integral to all other modules that utilize these computations.

RI technique This has increased efficiency, CPU times for the NF part of MARI- are reduced by 30 % compared to the timings reported in Table 1.

[4] J Alml¨of, K Faegri, and K Korsell, J Comput Chem 3, 385 (1982).

[5] M H¨aser and R Ahlrichs, J Comput Chem 10, 104 (1989).

[6] R Ahlrichs, M Bär, M Häser, H Horn, and C Kölmel, Chem Phys Lett 162,

[7] O Treutler and R Ahlrichs, J Chem Phys 102, 346 (1995).

[8] F Haase and R Ahlrichs, J Comp Chem 14, 907 (1993).

[9] F Weigend and H¨aser, Theor Chem Acc 97, 331 (1997).

[10] C H¨attig and F Weigend, J Chem Phys 113, 5154 (2000).

[12] R Bauernschmitt and R Ahlrichs, J Chem Phys 104, 9047 (1996).

[13] R Bauernschmitt and R Ahlrichs, Chem Phys Lett 256, 454 (1996).

[14] O Christiansen, H Koch, and P Jứrgensen, Chem Phys Lett 243, 409 (1995).

[15] K Eichkorn, H Treutler, O ¨Ohm, M H¨aser, and R Ahlrichs, Chem Phys Lett.

[16] B Dunlap, J Conolly, and J Sabin, J Chem Phys 71, 3396 (1979).

[17] J Mintmire and B Dunlap, Phys Rev A 25, 88 (1982).

[18] O Vahtras, J Alml¨of, and M Feyereisen, Chem Phys Lett 213, 514 (1993).

[19] K Eichkorn, F Weigend, O Treutler, and R Ahlrichs, Theor Chim Acta 97, 119 (1997).

[20] F Weigend, Phys Chem Chem Phys 4, 4285 (2002).

[21] C H¨attig and A K¨ohn, J Chem Phys 117, 6939 (2002).

[22] F Weigend, M H¨aser, H Patzelt, and R Ahlrichs, Chem Phys Letters 294, 143 (1998).

[23] F Weigend, A K¨ohn, and C H¨attig, J Chem Phys 116, 3175 (2002).

[24] P Pulay, G Fogarasi, F Pang, and J E Boggs, J Am Chem Soc 101, 2550 (1979).

[25] M Sierka, A Hogekamp, and R Ahlrichs, J Chem Phys 118, 9136 (2003).

[28] S Vosko, L Wilk, and M Nussair, Can J Phys 58, 1200 (1980).

[29] A Sch¨afer, H Horn, and R Ahlrichs, J Chem Phys 97, 2571 (1992).

[30] M Strain, G Scuseria, and M Frisch, Science 271, 51 (1996).

[31] J P´erez-Jord´a and W Yang, J Chem Phys 107, 1218 (1997).

[32] C Baerlocher, W Meier, and D Olson, Atlas of Zeolite Framework Types, Elsevier Science, Amsterdam, 2001.

[33] A Wlodawer, H Savage, and G Dosdon, Acta Crystallogr B 45, 99 (1989).

[34] H Berman, J Westbrook, Z Feng, G Gilliland, T Bhat, H Weissig,

I Shindyalov, and P Bourne, Nucleic Acids Res 28, 235 (2000).

[35] C Fonseca-Guerra, J Snijders, G Te Velde, and E Baerends, Theor Chim Acta

[36] A St-Amant and R Gallant, Chem Phys Lett 256, 569 (1996).

[38] P Deglmann, F Furche, and R Ahlrichs, Chem Phys Lett 362, 511 (2002).

[39] P Deglmann, K May, F Furche, and R Ahlrichs, Chem Phys Lett 384, 103 (2004).

[40] A Sch¨afer, C Huber, and R Ahlrichs, J Chem Phys 100, 5829 (1994).

The Coblentz Society, Inc provides evaluated infrared reference spectra in the NIST Chemistry WebBook, a comprehensive resource published by the National Institute of Standards and Technology (NIST) This reference is part of the NIST Standard Reference Database Number 69, edited by P.J Linstrom and W.G Mallard, and was released in March 2003 For more information, visit the NIST WebBook at http://webbook.nist.gov.

[42] K Taraszka, C Eefei, T Metzger, and M Chance, Biochemistry 30, 1222 (1991).

Q UICKSTEP : Make the Atoms Dance

Matthias Krack and Michele Parrinello

Computational Science Department of Chemistry and Applied Biosciences

ETH Z¨urich USI-Campus, via Giuseppe Buffi 13

E-mail: krack@phys.chem.ethz.ch

Over the past decade, density functional theory (DFT) has emerged as a powerful tool for electronic structure calculations across various fields, including material science, chemistry, and biochemistry DFT typically employs either plane waves or Gaussian-type functions to expand Kohn-Sham orbitals, each offering distinct advantages Plane waves provide an orthogonal basis set and simplify force calculations, benefiting from efficient Hartree potential computation via fast Fourier transformation (FFT) However, they require a large number of plane waves due to significant wave function variations near atomic nuclei, leading to inefficiencies in low-density systems like biological structures In contrast, Gaussian-type functions offer a more compact representation of atomic charge densities and eliminate the need for atomic pseudo potentials, although they complicate force calculations and may introduce basis set superposition errors The Gaussian plane waves (GPW) method seeks to merge the strengths of both approaches, allowing for a Kohn-Sham operator matrix construction that scales linearly with system size.

The HPC-Chem project has successfully implemented a new version of the GPW method known as QUICKSTEP, which is designed for modularity and efficient parallelization As part of the open-source CP2K project, QUICKSTEP ensures ongoing development beyond the HPC-Chem project's conclusion The upcoming section will outline the GPW method and detail the pseudopotentials and Gaussian basis sets utilized by QUICKSTEP, concluding with an assessment of the accuracy and efficiency of this new parallelized implementation.

2 Gaussian and plane waves method

The energy functional for molecular or crystalline systems, as defined by the Gaussian plane waves (GPW) method, utilizes the Kohn-Sham formulation of density functional theory (DFT).

& where T is the kinetic energy, V is the electronic interaction with the ionic cores,

H is the electronic Hartree (Coulomb) energy and XC is the exchange–correlation

Pseudo potentials energy refers to the interaction energies between ionic cores and charges, denoted as II The electronic interactions with these ionic cores are characterized by norm-conserving pseudo potentials, which include a local potential component represented as PP loc.

$and a fully non- local part ắ nl PP #ẳ % ẳ ¿ $ (see section 3).

$ (2) is expanded in a set of contracted Gaussian functions ằ /

/ - is a density matrix element, G #ẳ $ is a primitive Gaussian function, and

/ is the corresponding contraction coefficient The density matrix Ì fulfills normalization and idempotency conditions Íẻ

# è ẽ $ (5) where ẽ is the overlap matrix of the Gaussian basis functions

$ẵ (6) and is the number of electrons.

In the study by Lippert et al [1], an auxiliary basis approximation was applied to both Hartree and exchange-correlation energy This approach was refined by relaxing the constraint, allowing for the use of two independent density approximations, represented as @ #Ã $ for Hartree energy and Ä.

$for the exchange-correlation energy Both approximate electronic charge densities are functions of the density matrix Ì

The GPW method works like plane waves methods with atomic pseudo potentials, since an expansion of Gaussian functions with large exponents is numerically not efficient or even not feasible.

The GPW method currently utilizes only the Goe-decker, Teter, and Hutter (GTH) pseudo potentials These separable dual-space GTH pseudo potentials are composed of a local component, enhancing computational efficiency in quantum simulations.

PP loc and a non-local part ắ PP nl

& ẳ Ú ẵ (8) with the Gaussian-type projectors º ẳ

The GTH pseudo potentials, as outlined in Eq 1, provide a fully analytical formulation that requires only a small set of parameters for each element These transferable and norm-conserving pseudo potentials are primarily utilized in plane wave methods for reference calculations, although they necessitate relatively high cut-off values, leading to the use of more plane waves In contrast, the GPW method eliminates such limitations, allowing for the analytical calculation of contributions through integrals over Gaussian functions Consequently, GTH pseudo potentials are particularly compatible with QUICKSTEP, which exclusively supports them The parameters for these pseudo potentials were optimized against all-electron wavefunctions derived from fully relativistic density functional calculations This optimization includes scalar relativistic corrections via an averaged potential, essential for applications involving heavier elements A comprehensive database of GTH pseudo potential parameter sets optimized for various exchange-correlation potentials is available, covering nearly the entire periodic table based on the local density approximation (LDA) and including optimized sets for generalized gradient approximation (GGA) methods such as BLYP, BP, HCTH/120, HCTH/407, and PBE.

Traditional diagonalization (TD)

The traditional diagonalization scheme uses an eigensolver from a standard parallel pro- gram library called ScaLAPACK to solve the general eigenvalue problem à á ẽ á

(9) where à is the Kohn-Sham matrix and ẽ is the overlap matrix of the system The eigenvectors á represent the orbital coefficients, and the

The eigenvalues correspond to the system under consideration; however, the overlap matrix ẽ deviates from being a unit matrix due to QUICKSTEP's use of a non-orthogonal Gaussian-type orbital basis set Consequently, it is necessary to reformulate the eigenvalue problem into its specialized form.

(pdsyevxorpdsyevd) (12) using a Cholesky decomposition of the overlap matrix ẽ ó ọ ó

The default method for this process is pdpotrf (13) Equation 12 can be resolved through the diagonalization of ồ ¿ Ultimately, the orbital coefficients ỏ in the non-orthogonal basis are derived through the back-transformation á ¿ ã á.

The names in brackets denote the ScaLAPACK routines employed for the respective oper- ation by QUICKSTEP.

Alternatively, a symmetric orthogonalization instead of a Cholesky decomposition can be applied by using ã ẽ ổ

The calculation of the density matrix ẽ ổ involves a more computationally intensive diagonalization process compared to Cholesky decomposition Detecting linear dependencies in the basis set, often introduced by small Gaussian function exponents, is facilitated by the diagonalization of ẽ; eigenvalues below a certain threshold typically indicate significant dependencies, suggesting that filtering the corresponding eigenvectors can alleviate numerical challenges during the Self-Consistent Field (SCF) iteration QUICKSTEP implements both orthogonalization schemes, with minimal performance impact for small systems, as orthogonalization occurs only once per configuration during SCF initialization In contrast, the eigenvectors and eigenvalues of the complete Kohn-Sham matrix must be recalculated in each iteration, utilizing either a divide-and-conquer scheme (pdsyevd) or an expert driver (pdsyevx) that allows for selective eigenvector computation The divide-and-conquer approach is more efficient when all eigenvectors are needed, particularly for constructing the new density matrix.

In scenarios where only occupied orbitals are necessary, expert drivers demonstrate superior performance This is particularly evident with standard basis sets, where merely 10-20% of the orbitals are occupied The orthonormalization of the required eigenvectors becomes a time-intensive process, especially in parallel computing environments that demand significant communication between processes.

The TD scheme is often integrated with techniques designed to enhance the convergence of the SCF iteration process Among these methods, the most effective for accelerating SCF convergence is the direct inversion in the iterative sub-space (DIIS), which leverages the commutator relation.

The TD/DIIS scheme is a well-established method for electronic structure calculations, particularly effective in achieving convergence from a pre-converged density However, while it can significantly reduce the number of iterations needed, the computational cost scales cubically with the size of the basis set, presenting challenges as the number of basis functions increases Additionally, the DIIS method may struggle to converge, especially in electronically complex systems, such as spin-polarized systems or those with a small energy gap between the highest occupied (HOMO) and lowest unoccupied (LUMO) orbitals, common in semiconductors and metals.

Pseudo diagonalization (PD)

Instead of using the TD scheme, a pseudo diagonalization method can be utilized once a sufficiently pre-converged wavefunction is achieved This process involves transforming the Kohn-Sham matrix from the atomic orbital (AO) basis to the molecular orbital (MO) basis for each calculation.

In the process of utilizing the MO coefficients from the previous SCF step, the converged matrix becomes diagonal, with its eigenvalues represented as the diagonal elements After only a few SCF iterations, the matrix demonstrates diagonal dominance Additionally, the matrix exhibits natural blocking characteristics, attributed to the two subsets of molecular orbitals: the occupied and unoccupied states.

During the Self-Consistent Field (SCF) iteration, eigenvectors are utilized to compute the new density matrix, while eigenvalues are not required The total energy is solely dependent on the occupied Molecular Orbitals (MOs), enabling a block diagonalization that separates occupied from unoccupied MOs This process facilitates the convergence of wavefunctions, as only the elements of the relevant block need to approach zero, given that the matrix is symmetric Consequently, the transformation into the Molecular Orbital basis is achieved through the appropriate mathematical representation.

(PDSYMMandPDGEMM) (20) has only to be performed for that matrix block Then the decoupling can be achieved iteratively by consecutive

The angle of rotation, denoted as _, is calculated based on the difference between the eigenvalues of the molecular orbitals (MOs) and the corresponding matrix element in the relevant block This relationship is expressed through the equations DSCALandDAXPY (26).

Jacobi rotations can be efficiently executed using BLAS level 1 routines DSCAL and DAXPY The ợợ block is notably smaller than the ù ù block, as only 10–20% of the molecular orbitals (MOs) are occupied with a standard basis set As a result, the ợù or ù ợ block contains merely 10–20% of the à ỡ ở matrix Additionally, the need for costly re-orthonormalization of the MO eigenvectors is eliminated, since Jacobi rotations maintain their orthonormality.

Orbital transformations (OT)

The QUICKSTEP implementation features an orbital transformation (OT) method that directly minimizes wavefunctions, ensuring convergence The scaling of this method is dependent on the preconditioner, with a complexity of $O(d^3)$, where $d$ represents the total number of molecular orbitals (MOs) or basis functions, and $n$ indicates the number of occupied MOs For a comprehensive understanding of the OT method, refer to Reference [19].

$is minimized using the constraint á ọ ẽ á ÷

(27) where ỏ , ẽ , and ÷ are the matrix of the orbital coefficients, the overlap matrix, and the identity matrix, respectively Given the constant start vectors á

(28) a new set of vectors ỏ # ứ $ is obtained by á

The optimization of energy can be effectively achieved using standard methods such as the conjugate gradient combined with line search, as the permissible variables span a linear space Consequently, the OT method serves as a direct minimization approach, effectively addressing the shortcomings associated with traditional methods.

The PD scheme is a reliable method known for its guaranteed convergence and scalability, which is influenced by the choice of preconditioner Specifically, the OT method exhibits distinct scaling behaviors, such as the matrix product transitioning from sparse to full representation.

$ ü matrix products full-full like # à õ

The computational cost of the optimal transport (OT) method is primarily influenced by the calculations of specific terms, which can vary significantly based on the type of preconditioner used With a sparse preconditioner, the cost is typically linear, while using a non-sparse preconditioner can lead to higher computational demands Additionally, the efficiency of the TD/DIIS method compared to OT is affected by several factors, including the size of the system, the basis set, and the network's latency and bandwidth.

In our initial accuracy test for QUICKSTEP, we utilized the newly introduced basis sets from section 4 to optimize the geometries of small molecules using the local density approximation (LDA) The CP2K geometry optimizer employs first analytic derivatives, while the second derivatives are calculated through an enhanced Hessian method This approach was applied to a test set comprising 39 small molecules.

The optimization of HCl, CHCl, and LiCl using Cartesian coordinates was analyzed, with Figure 1 illustrating the bond distances obtained via QUICKSTEP and comparing them to the NUMOL results by Dickson and Becke, a numerical DFT code deemed free of basis set effects The smallest basis set, DZVP, tends to yield slightly longer bond distances on average, while the TZVP basis set performs well for most molecules Notably, the TZV2P, QZV2P, and QZV3P basis sets demonstrate excellent agreement across all bond distances Figure 2 presents the optimized bond and dihedral angles, revealing that the DZVP and TZVP basis sets also show remarkable agreement, with only one outlier related to the dihedral angle of H.

The sensitivity of the dihedral angle to the employed polarization functions is evident, as a single set of polarization functions proves inadequate, particularly with the DZVP and TZVP basis sets In contrast, the TZV2P basis set yields a dihedral angle close to the reference value, while the QZV3P basis set demonstrates a converged result Table 1 provides a comprehensive overview of the geometry optimization results, detailing the maximum and root mean square deviations of 52 bond distances and 18 angles and dihedral angles compared to NUMOL results As expected, the errors diminish with increasing basis set size, with the TZV2P achieving excellent overall agreement and the QZV3P showing most distances aligning within expected errors However, complete agreement with NUMOL values is unattainable due to differences in LDA implementation and the frozen core approximation used for elements beyond Beryllium, contrasting with the GTH pseudo potentials employed by QUICK.

STEP These difference may cause a change of the bond distances of about A This˚ small error also shows that the effect of the pseudo potential is negligible compared to basis

Table 1: Maximum ( s ÿ ) and root mean square deviation ( ) of bond distances ( ˚ A), bond angles, and dihedral angles (

) compared to the NUMOL results for different basis sets. basis set distances [ ˚A] angles [ ] u u

In the context of structural properties, the QZV3P basis set, with values of 0.011, 0.004, 0.7, and 0.3, can be adjusted based on the accuracy needs of the specific application However, the overall accuracy of QUICKSTEP ultimately hinges on the error associated with the chosen exchange-correlation potential.

Figure 1: The optimized bond distances for 39 small molecules calculated with Q UICKSTEP using different basis sets are compared to the NUMOL results of Dickson and Becke [21].

Figure 2: The optimized bond angles and dihedral angles for 39 small molecules calculated with Q UICKSTEP using different basis sets are compared to the NUMOL results of Dickson and Becke [21].

This section demonstrates that QUICKSTEP not only achieves high accuracy but also excels in computational efficiency To illustrate this, we selected liquid water at ambient conditions as a benchmark system, showcasing both the serial performance of QUICKSTEP and its scalability on parallel computing systems Additionally, we will present performance results for geometry optimizations of various molecular and crystalline systems.

Liquid water

Liquid water serves as a key benchmark system due to its scalability, allowing for easy adjustments by simply doubling the number of water molecules in the unit cell It is utilized as a standard reference in the CPMD code to evaluate performance and scalability across different parallel computing systems Additionally, water plays a crucial role in numerous biochemical applications as a natural solvent, with molecular dynamics (MD) simulations conducted to investigate the properties and behaviors of these systems.

Table 2: Detailed characteristics of the employed benchmark systems for liquid water at ambient conditions

The cubic simulation cell has an edge length of 300 K and operates at a pressure of 1 bar It includes a specific number of atoms and electrons, as well as Gaussian-type orbitals and occupied orbitals (p) Additionally, the simulation employs plane waves, represented by grid points, to accurately model the system.

Molecular dynamics (MD) simulations for pure liquid water at ambient conditions (300 K, 1 bar) were performed to establish benchmarks using realistic input parameters suitable for production runs All benchmark simulations utilized a GTH pseudopotential and a TZV2P basis set for hydrogen and oxygen, incorporating 40 contracted spherical Gaussian-type orbital functions per water molecule The high accuracy of the TZV2P basis set was previously demonstrated, and the detailed characteristics of the benchmark systems are summarized in Table 2, which includes configurations ranging from 32 water molecules in a cubic unit cell with an edge length of 9.9 ˚A.

1024 water molecules in a cubic unit cell of 31.3 ˚A edge length These unit cell sizes required up to

In the benchmark calculations of liquid water, a density cut-off of 280 Ry was employed, utilizing M plane waves as an auxiliary basis set for the electronic density expansion The orbital basis set progressively increased from 1280 to 40,960 Gaussian-type orbital functions However, the matrices involved, such as the overlap and Kohn-Sham matrices, exhibit quadratic growth, complicating the Kohn-Sham matrix calculation for 1024 hydrogen atoms.

To effectively manage 7x7 matrices, it is crucial to leverage the localized nature of atomic interactions Table 3 illustrates the overlap matrix occupancy for each benchmark system, utilizing a TZV2P basis set and a numerical threshold value for the overlap integral between two primitive Gaussian functions This approach is particularly relevant for systems containing 32 and 64 hydrogen atoms.

In a unit cell, each water molecule interacts with approximately 200 other water molecules, creating a confined interaction sphere As the system size increases, more water molecules within the unit cell cease to interact with one another This phenomenon can be analyzed through the overlap matrix occupations, beginning with 256 hydrogen atoms.

O, since the occupation is halved for each doubling of the simulation cell Thus beginning with

In the unit cell, the number of interactions increases linearly while the sparsity of the matrices rises continuously QUICKSTEP efficiently utilizes this matrix sparsity, but its effectiveness is only realized when simulating more than 200 water molecules in the cell Additionally, it is crucial to acknowledge the significance of the number of occupied orbitals.

Table 3: Occupation of the overlap matrix applying a numerical threshold value of ~ for the overlap contribution of two primitive Gaussian orbital functions. system occupation

In a benchmark using the TZV2P basis set, only 10% of the d orbitals are occupied, which is significantly smaller than the total number of d orbitals, as shown in Table 2 This indicates that operations focusing solely on the occupied orbitals are more efficient compared to those involving the full matrix, highlighting a key performance advantage when comparing the eigensolvers in QUICKSTEP Timings for the benchmark systems were measured on the IBM Regatta p690+ system at the Research Centre Jülich, known as Jump, which features 39 compute nodes with 32 interconnected Power4+ processors at 1.7 GHz each The results, presented on a double logarithmic scale, illustrate the scaling of the TD and PD schemes Each molecular dynamics (MD) step involved a complete wavefunction optimization and force calculations for each atom, achieving total energy convergence to 10^-6 a.u and an electron count deviation of less than 10^-4 for the converged density, with ten MD steps executed for each benchmark system, except for the 1024 H case.

Using a time step of 0.5 femtoseconds, the CPU timings for the last five molecular dynamics (MD) steps were averaged Figure 3 illustrates the CPU timings per MD step across different CPU counts and system sizes, comparing the results obtained with the time-dependent (TD) method.

The PD scheme faced challenges due to limited CPU memory, which restricted the ability to operate larger systems with a minimal number of CPUs Consequently, this limitation necessitated the use of smaller systems.

O can efficiently be run on a small number of CPUs 64 H

O need roughly one CPU minute per MD step, i.e 2 CPU minutes per fs simulation time, when using 16 CPUs The larger systems with 128 and 256 H

O run efficiently on 32 and 64 CPUs, respectively However, 14 minutes per MD step for 256 H

O does not allow to obtain appropriate MD trajectories in reasonable time It was not possible to run 512 H

Utilizing 256 CPUs with the TD scheme, which relies on ScaLAPACK/BLACS, poses a challenge as it necessitates managing the distribution of multiple full matrices during the SCF process, surpassing the available memory capacity.

A direct comparison of the two panels of Figure 3 shows that the PD scheme scales slightly

1 This benchmark was run on the Jump system before the major software update (PTF7) in July 2004 which improved the MPI communication performance significantly.

The CPU time per molecular dynamics (MD) step was analyzed using both traditional diagonalization (TD) and pseudo diagonalization (PD) schemes, as illustrated in Figure 3 These calculations were conducted on an IBM Regatta p690+ system equipped with 32 Power4+ processors (1.7 GHz) per node, all interconnected via an IBM High Performance Switch (HPS), as detailed in Table 2.

Benchmarks better than the TD scheme The small systems with 32 and 64 H

O scale up to 32 CPUs and the largest system with 256 H

The PD scheme can scale up to 128 CPUs, demonstrating a close similarity in absolute CPU times per MD step when compared to the TD scheme Notably, the PD scheme requires less communication than the TD scheme, with significant performance improvements observed at 256 H.

The PD scheme demonstrates significantly shorter CPU times per MD step compared to the TD scheme, but it can only be applied to sufficiently pre-converged wavefunctions Initially, the TD scheme is used until convergence is achieved, which means there is no speed advantage during the first SCF iteration steps Additionally, the PD scheme requires a one-time diagonalization of the Kohn-Sham matrix, which includes calculating all eigenvectors and is computationally expensive The orthonormalization of a large eigenvector set is particularly resource-intensive, often consuming two to three times more CPU time than a standard TD SCF step, making it a bottleneck for larger systems However, once the PD scheme is established, subsequent iteration steps become less costly.

The cost of PD steps is decreasing as the number of matrix elements processed by Jacobi rotations diminishes Typically, a molecular dynamics (MD) step requires about eight self-consistent field (SCF) iterations, including two to three normal TD steps and one costly TD step for the complete eigenvector set Consequently, only four to five SCF steps remain for the accelerated PD scheme, resulting in minimal advantages over the traditional TD scheme for most test systems.

By contrast, the OT method shows a much better performance as shown in Figure 4 The

Tiêu đề	High Performance Computing in Chemistry
Người hướng dẫn	Johannes Grotendorst, Editor
Trường học	University of Karlsruhe
Chuyên ngành	High Performance Computing
Thể loại	Report
Năm xuất bản	2004
Thành phố	Jülich

Định dạng
Số trang	166
Dung lượng	4,47 MB