ParaBands
ParaBands is a code that builds and diagonalizes the DFT Hamiltonian. Instead of using iterative algorithms like most DFT programs, ParaBands builds the dense DFT Hamiltonian in plane wave basis set and calls a direct solver to obtain the requested number of empty bands. This approach is suitable when the desired number of bands is a large fraction (more than 5%) of the size of the Hamiltonian.
ParaBands has the following features:

Support for 3 different classes of solvers using:
 LAPACK/ScaLAPACK (Bisection, Divide and Conquer, and MR3)
 ELPA
 PRIMME

Restart feature over kpoints

Fully parallel HDF5 output for the wave function file

Hybrid MPI and OpenMP parallelism
ParaBands also hosts the stochastic pseudobands code. Stochastic pseudobands is a method to dramatically accelerate manybody perturbation theory (MBPT) calculations such as GW and GWBSE calculations performed by BerkeleyGW. The method reduces the scaling of the GW calculations from quartic to *quasiquadratic, and reduces the algorithmic prefactor, with *speedups of 101000\times** for typical system sizes of 10's to 100's of atoms, with negligible error. The use of stochastic pseudobands is HIGHLY RECOMMENDED.
See the documentation for stochastic pseudobands in stochastic pseudobands
.
Note that you MUST compile BerkeleyGW with HDF5 support, otherwise ParaBands will not be able to write any wave function file. Also, ParaBands currently only works on top of calculations performed with Quantum ESPRESSO.
ParaBands was written by Felipe H. da Jornada, and it is inspired and motivated by the SAPO code written by Georgy Samsonidze. Stochastic pseudobands was written by Aaron R. Altman and Felipe H. da Jornada.
Basic usage
In order to run ParaBands, you will first need to perform a meanfield calculation with Quantum ESPRESSO with the desired kpoints, and with a small number of bands (containing at least one fully empty band) and export the following quantities:
WFN_in
: the wave function file (we need this to get the electron occupations);VKB
: the selfconsistent potential in Gspace; andVSC
: the KleinmanBylander nonlocal projectors in Gspace.
Note that only QE can export the VSC and VKB files right now.
If you plan on doing selfenergy calculations, you should also export the
exchangecorrelation potential in Gspace (VXC
), because degenerate wave
functions generated by ParaBands will likely differ from those generated by the
meanfield program, so vxc.dat
= \langle nkV_{XC}n'k\rangle will not be consistent.
The recommended input for pw2bgw.inp has the following structure:
&input_pw2bgw
prefix = 'SYSTEM_NAME'
wfng_flag = .true.
wfng_file = 'WFN_in'
real_or_complex = 2
wfng_kgrid = .true.
wfng_nk1 = 6
wfng_nk2 = 6
wfng_nk3 = 6
wfng_dk1 = 0.0
wfng_dk2 = 0.0
wfng_dk3 = 0.0
rhog_flag = .false.
vxc_flag = .false.
vxcg_flag = .true.
vxcg_file = 'VXC'
vscg_flag = .true.
vscg_file = 'VSC'
vkbg_flag = .true.
vkbg_file = 'VKB'
/
The documentation for the ParaBands input file is in parabands.inp
. Note that
only the input and output file names are actually required to run ParaBands.
Note that the output wavefunction file is written in HDF5 format. However, only
Fortran binary I/O of wavefunctions is supported in the rest of BerkeleyGW. You
will need to use the utility hdf2wfn.x
to convert to Fortran binary format.
FAQs
Q) How many atoms can ParaBands handle?
A) ParaBands does not care (much) about the number of atoms. The bottleneck is the size of the Hamiltonian in Gspace.
Q) Ok, what is the maximum size of the Hamiltonian ParaBands can handle?
A) This is equivalent to asking "how long does it take to diagonalize a matrix of size NxN". Here are some benchmarks on the Cori2 supercomputer at NERSC, using the ELPA solver, for a single kpoint, and to obtain all bands:
N  Nodes  Time 

17000  4  10 s 
60000  16  10 min 
100000  64  15 min 
150000  256  17 min 
200000  512  25 min 
250000  512  38 min 
320000  512  68 min 
416000  2560  80 min 
Note that I/O time is not included.
Q) How do I determine the size of the Hamiltonian before I run ParaBands?
A) The easiest way is to run an scf calculation with QE, and look for the part on "Parallelization Info" at the very top. Example:
Parallelization info

sticks: dense smooth PW Gvecs: dense smooth PW
Min 282 282 70 19098 19098 2384
Max 284 284 74 19112 19112 2396
Sum 18167 18167 4543 1222827 1222827 152845
If your calculation has a single kpoint, the row "Sum", column "PW" has the number of Gvectors to represent the wavefunction (152845), which is the size of the Hamiltonian constructed by ParaBands. If you have more kpoints in your calculation, that number contains all possible Gvectors needed to represent any wavefunction for all kpoints, which is typically ~10% larger than the number of Gvectors at any given kpoint.
Q) Are there any other caveats?
A) ParaBands sometimes doesn't work with the real version. If you see a warning that the difference between input and output energies is large, try to run ParaBands again with the complex version of the code and complex input wavefunctions.
Performance tips

The best solver algorithm for ParaBands is ELPA. Make sure you compile BerkeleyGW with support for ELPA (see section below).

If you are using a parallel file system, make sure to stripe the output directory before running ParaBands.

Out of the ScaLAPACK solvers, Bisection is typically the best. But you might experience some performance gain with the Divide and Conquer algorithm if you ask for all bands from your DFT Hamiltonian.

When using a direct solver (LAPACK or ScaLAPCK, which is the default), you won't save much run time on the diagonalization if you ask for a smaller number of bands. But you may save some run time overall because of the I/O.

The performance of each solver depends a lot on the LAPACK/ScaLAPACK implementation, MPI stack, and interconnect. You are encouraged to test all solvers on a small system when you run ParaBands on a new architecture.

Use the default
block_size
of 32 when running ParaBands without threads. If you increase the number of threads, you'll need to increase the block size to 64 or 128 to get any performance gain. 
Note that the memory required to run ScaLAPACK increases with the
block_size
!
ELPA Solver
Using the ELPA solver is HIGHLY RECOMMENDED. To enable it, you should:

Download and compile ELPA VERSION 2017.05.001 OR ABOVE.

Add the flag
DUSEELPA
in theMATHFLAGS
variable in yourarch.mk
file. 
Add a variable
ELPALIB
to yourarch.mk
which points tolibelpa.a

Add a variable
ELPAINCLUDE
to yourarch.mk
, which points to the directory that contains the modules generated by ELPA. This has typically the structure:ELPA_PREFIX/include/elpaVERSION/modules
You can control the kernel that ELPA uses by setting the environment variables
COMPLEX_ELPA_KERNEL
and REAL_ELPA_KERNEL
. Take a look at
ELPA/src/elpa2_utilities.F90
for the available kernels for your architecture.
A somewhat safe and good option is to include the following lines in your
submission script:
export REAL_ELPA_KERNEL=REAL_ELPA_KERNEL_SSE export
COMPLEX_ELPA_KERNEL=COMPLEX_ELPA_KERNEL_SSE
MR3 Solver
Not all ScaLAPACK implementations support Relatively Robust Representations
(MR3). You'll need to include the flag DUSEMR3
in the MATHFLAGS
variable
in your arch.mk
file to enable this solver. From our experience, this solver
does not increase the performance so much of ParaBands, so we make the older
bisection the default solver for ScaLAPACK.