SyTen
FAQ

Table of Contents

Overview, Design and Comprehension Questions

What is a truncation threshold, weight and maximal number of states?

The toolkit supports four different criteria based on which it discards singular values. Any of these criteria is sufficient to discard a particular singular value. Two are only sensible relative to the norm of the input tensor, which is why it is useful to call Truncation::scale() prior to a SVD, passing in the norm of the tensor.

The four criteria are:

  • maximal blocksize: truncMaxBlocksize or b in DMRG stage stanzas, somtimes -b option: If the maximal size of a dense block exceeds this size, the smallest singular values are discarded.
  • maximal states: truncMaxStates or m in DMRG stage stanzas, usually -m option: If the sum of the sizes of all dense blocks exceeds this size, the smallest singular values are discarded.
  • threshold: truncThreshold or t in DMRG stage stanzas, usually -t option: If a singular value is smaller than this, it will be discarded.
  • discarded weight: truncWeight or w in DMRG stage stanzas, usually -w option: The smallest singular values are discarded until the sum of their squares reaches just below this value. Note that this is usually a very small number. If you truncate a normalised MPS with specified weight \( \omega \), you will incur an error \( \sqrt{\left|\left| \left|\psi\right\rangle - \left|\psi^\prime\right\rangle \right|\right|} \) of approximately \( \sqrt{2 L \omega} \).

Also note that the error typically reported is calculated as \( \left( \sum_{i=1}^L \right) \sqrt{2 - 2 \sqrt{1 - \omega/N^2}} \) where \(N\) is the norm of the state and we either report a single truncation (without the sum) or the maximal accumulation of all truncations (with the sum). This is of roughly the same order as \( \sqrt{2 L \omega} \), but in the case of multiple truncations, both expressions are upper bounds, not the exact error. You can get the exact error from syten-overlap --norm --subtract original.mps truncated.mps which calculates \( 1 - \left\langle \psi^\prime \middle| \psi \right\rangle / \left(\left|\left| \left|\psi\right\rangle \right| \right| \cdot \left|\left| \left|\psi^\prime\right\rangle \right| \right|\right) \). Taking the square root of this value will give you the approximate error.

How can I specify an operator on the command line? Which functions are available?

Many tools require a tensor product operator (at the moment either MPO or BTTO) as an input. Tensor product operators are stored in lattice files and are accessed using the generic <lattice file>:<operator expression> syntax. <lattice file> here is the usual name of the file, <operator expression> can be the name of a specific global operator or a more complicated expression. This expression is parsed using reverse prefix notation. That is, assuming two operands a and b and a binary operation f, the result f(a, b) is expressed in this syntax as b a f. RPN has the advantage that no parantheses are necessary to uniquely specify operation ordering and that writing a parser for it is extremely simple. For example, the expression I 3 * takes the identity operator from the lattice and multiplies it by three. The expression I 3 * H + does the same as above and then adds the result to the operator defined on the lattice as H.

The operand types of the operator parser are:

  • unevaluated strings, e.g. H or q3,0
  • numbers, e.g. 3 or (0,-2.123e-4)
  • sectors, e.g. the \( \mathrm{U}(1)_N = 3 ; \mathrm{SU}(2)_S = \frac{1}{2} \) sector
  • TPOs, e.g. Hamiltonians, constructed single-site operators etc.

By default, each element of the operator expression is interpreted as a number unless that conversion fails. If this conversion fails, the element is stored as a string. Strings can implicitly be converted to numbers, sectors and TPOs. If after evaluation of all operations the final operand is not a TPO, it will implicitly be converted into a TPO.

Matrix Product Operators

For matrix product operators, the following functions/operators are available:

  • <x> ? or <x> _operator: explicitly interprets the string <x> as a global TPO. Example: H ?, N ?, both equivalent to just H and N.
  • <x> <y> @ or <x> <y> _at_site: if <y> can be parsed as a number, constructs the single-site operator <x>_<y>, otherwise constructs the single-site operator <y>_<x>. Example: sz 0 @, 17 s @, n 2 _at_site.
  • <x> # or <x> _sector: Converts <x> starting at its second position into a quantum number sector description with respect to the current lattice. Example: q0,3 or a15. Note that the first position is ignored but necessary to avoid interpreting e.g. 15 as a number (which cannot be converted into a sector any more).
  • <x> … or <x> _distribute. Builds the MPO \( x \mathbb{1} \), i.e. the identity times <x>, with the absolute value of the factor distributed evenly over all tensors. The phase is stored in the first tensor. Example: 15 _distribute.
  • <x> : or <x> _truncate. Explicitly truncates the MPO <x>, even if the parsing component was invoked with truncation disabled. Useful to (pre-)truncate the arguments for syten-truncate or syten-print, which normally don't truncate in the parser. Example sz 0 @ sz 1 @ + _truncate.
  • <a> <b> + adds two numbers or two operators together.
  • <a> <b> * or <a> <b> _mult. If <a> and <b> are both numbers, calculates the product of the two numbers. If one is an operator and the other a number, scales the operator by the number. If both are operators, calculates \( a b \) (or, in SyTen C++, b * a). That is, <b> is applied first and <a> is applied second. Example: 2 3 * is 6, 4 I * is the identity scaled by four, A B * is \( A B (|\psi\rangle)\), chu 0 @ cu 0 @ * is nu 0 @.
  • <a> <b> · or <a> <b> _dot calculates the dot product of two operators <a> and <b> as dot(b, a), which applies b first and dual-space-conjugated a second. Example: c 0 @ c 1 @ _dot is \( c_0^\dagger \cdot c_1 \), if the dual-space conjugate of cu is chu, cu 1 @ cu 0 @ _dot is the same as chu 1 @ cu 0 @ *. If <b> is already a sector, the function forwards to _dotq, see below.
  • <a> <b> <q> _dotq calculates the dot product of two operators <a> and <b> projected into the sector <q> as dot(b, a, q). Example: c 0 @ c 0 @ q0,1 _dotq is proportional to s 0 @ in a Fermi-Hubbard system and hence c 0 @ c 0 @ q0,1 _dotq c 0 @ c 0 @ q0,1 _dotq _dot 0.75 * is the same as s 0 @ s 0 @ _dot or \( (\hat s_0)^2 \). The first position of <q> is ignored but potentially necessary to avoid interpretation as a number. The string following the first position is interpreted in exactly the same way as in e.g. syten-random -s or syten-apply-op -s.
  • <a> ^m2 or <a> _msquare is the same as <a> <a> *, i.e. calculates the square of a number or an operator. Example: H 1.2345 -1 * _distribute _msquare calculates \( \left( \hat H - 1.2345 \right)^2 \).
  • <a> ^d2 or <a> _dsquare calculates either dot(a, a) (for operators) or a * conj(a) (for numbers).

Binary Tree Tensor Operators

For binary tree tensor operators, the following operations are defined:

  • <x> ? or <x> _operator: explicitly interprets the string <x> as a global TPO. Example: H ?, N ?, both equivalent to just H and N.
  • <x> <y> @ or <x> <y> _at_site: if <y> can be parsed as a number, constructs the single-site operator <x>_<y>, otherwise constructs the single-site operator <y>_<x>. Example: sz c2:3 @, c17:1 s @, n c2:1 _at_site. For details of the accepted coordinates, see operator[].
  • <x> # or <x> _sector: Converts <x> starting at its second position into a quantum number sector description with respect to the current lattice. Example: q0,3 or a15. Note that the first position is ignored but necessary to avoid interpreting e.g. 15 as a number (which cannot be converted into a sector any more).
  • <a> <b> * or <a> <b> _mult. If <a> and <b> are both numbers, calculates the product of the two numbers. If one is an operator and the other a number, scales the operator by the number. If both are operators, calculates \( a b \) (or, in SyTen C++, b * a). That is, <b> is applied first and <a> is applied second. Example: 2 3 * is 6, 4 I * is the identity scaled by four, A B * is \( A B (|\psi\rangle)\), chu c0:0 @ cu c0:0 @ * is nu c0:0 @.
  • <a> <b> · or <a> <b> _dot calculates the dot product of two operators <a> and <b> as dot(b, a), which applies b first and dual-space-conjugated a second. Example: c c0:0 @ c c0:1 @ _dot is \( c_0^\dagger \cdot c_1 \), if the dual-space conjugate of cu is chu, cu c0:1 @ cu c0:2 @ _dot is the same as chu c0:1 @ cu c0:2 @ *. If <b> is already a sector, the function forwards to _dotq, see below.
  • <a> <b> <q> _dotq calculates the dot product of two operators <a> and <b> projected into the sector <q> as dot(b, a, q).
  • <a> <b> + adds two numbers or two operators together.

Common Examples

  • The Hamiltonian of a system can often be split into a fixed part and a part which one wants to scale by a scalar. Examples include the Hubbard- \( U \), i.e. \( \hat H = \hat H_{\mathrm{kin}} + U \hat H_{\mathrm{int}} \), dynamic laser fields and more. In such situations, it is advisable to store the individual parts, e.g. Hkin and Hint as global operators in the lattice. The full Hamiltonian can then be constructed on the command line as

    $ syten-dmrg -l fh:'Hkin Hint 4 *' …
    

    for \( \hat H_{\mathrm{kin}} + 4 \hat H_{\mathrm{int}} \). The parameter can then be changed as desired from the command line without regenerating the lattice.

  • Single-site operators include e.g. sp-u1:sz 2 @ for \( \hat s^z_2 \) on a \(\mathrm{U}(1)_{S^z}\)-symmetric spin lattice.
  • _dot may be used to dual-space conjugate an operator: A I _dot multiplies operators A and the identity I (defined on every lattice), but dual-space conjugates A in the process. The result is the proper \( A^\dagger \).
  • _dot and _dotq may also be used to project an operator onto a specific quantum number sector, e.g. I A q4 _dot effectively projects A onto the 4 sector.

The toolkit uses complex scalars by default? Why and how can I disable it?

The decision whether to use real scalars in tensors (double) or complex scalars (std::complex<double>) is currently a compile-time decision, i.e. there is no run-time introspection. This implies that once you decided to use real-valued scalars, you cannot do certain things to your wave function, e.g. apply an operator which requires complex numbers or do real-time evolution.

In comparison, if you are using only complex-valued scalars, all these things are possible even if you did not initially think of them. Hence, by default, the toolkit uses complex scalars. If you are sure that you only want to use real-valued scalars in your tensors, set -DSYTEN_COMPLEX_SCALAR=0 (c.f. SYTEN_COMPLEX_SCALAR).

Which time-evolution algorithm should I use?

Short answer: Use syten-tdvp or syten-krylov (for gapped systems or if your error requirements are very stringent). If you only wish to do imaginary-time evolution, consider SYTEN_COMPLEX_SCALAR.

Long answer: There are currently five time evolution implementations in the toolkit:

  • syten-tebd: Implements the classical TEBD algorithm and a real-space parallelised variant thereof of orders 1, 2 and 4, but requires the time-evolution operator (usually the Hamiltonian) to be split by hand into an ‘even’ and an ‘odd’ part, with those parts saved as individual MPOs in the toolkit. Furthermore, it currently only works on spin chains, as it requires all local operators occurring in the Hamiltonian ( \( \hat S^z, \hat S^+, \hat S^-\) etc.) to have zero overlap with the identity, i.e. \( \mathrm{tr}\left( \hat O \hat I \right) = 0 \). Consider this to be currently in the testing phase, though it should work (at least the unparallelised method) reliably if your system permits it.
  • syten-tebd-swappable: Also implements the standard TEBD algorithm (1st and 2nd order) but handles long-range interactions and arbitrary bosonic systems (Fermions may also work, I don't know). Requires an explicit list of summands of the Hamiltonian to be written to a text file first, which is then parsed and each summand exponentiated individually. See commit 00b6ea17 for details regarding the terms file. Unfortunately due to the inherent TEBD error of O(t^2) at best, it requires smaller time-steps than those which would be possible using the Krylov method and hence is usually slower.
  • syten-tdvp: Based on the TDVP principle applied to MPS as described by Jutho Haegemann et.al. Both the one-site and two-site variants are implemented. The one-site variant is very fast, but incurs a fairly large error, as it does not increase the bond dimension. The two-site variant is slower, but supposedly exact if there are only nearest-neighbour interactions. However, we currently do not quantify the projection error properly.
  • syten-taylor: Based on a simple Taylor series expansion of \( \hat U = \mathrm{exp}\left\{ -i \hat H t \right\} \) and then repeated application of the constructed \( \hat U \). Extremely fast if you need many very short time steps, but the error is not well-controlled – what may seem like a decent time evolution may veer of uncontrollably within just a few time steps. Worse, changing the timestep just slightly may completely alleviate the problem. However, if the time step is very small and you need very many steps, the method has the advantage of only building \( \hat U \) once in the beginning.
  • syten-krylov: Based on the MPS-supported Krylov subspace method. At each step, the Krylov subspace is built up and the Hamiltonian exactly diagonalised within this subspace. The method is well-controlled and you can in principle push every error to zero. In principle, the method allows arbitrarily large time steps, but may require very many Krylov vectors (iterations) to achieve this result. Various options for the internal eigensolver are available, check the output of syten-krylov --help, but the defaults should work reliably, though not necessarily as fast as possible. If your iteration exits with "maxIter" (last column of output), decrease the size of the time step or increase the maximal number of Krylov vectors to achieve a converged result. If your initial state is reasonably complex, you may want to try the 'W' or 'V' orthonormalisation submodes instead of the default 'A' mode. The method works best in gapped systems with little entanglement.

DMRG is sloooooow!

DMRG and MPS in general are methods which require low entanglement in the system to efficiently represent states. This means that while e.g. for ED solvers matrix dimensions \( 400 \times 400\) may be no problem, DMRG visibly starts to struggle if the dimension of the dense subblocks of its tensors exceeds \( \approx 500\). There are two ways to reduce this size (given as the last-but-one column in DMRG output):

  • only look at nicely behaved systems, in particular nothing critical and no long time evolutions
  • use symmetries to split the dense tensors up into more, smaller blocks. \( \mathrm{U}(1) \) symmetries such as particle number or \( S^z \)-spin allow us to discard large percentages of the dense subblocks, resulting in much smaller sizes. If you have a \( \mathrm{SU}(2) \) symmetry in the system, there are relations which require that some elements of these tensors are identical, allowing us to only store one copy (rather than e.g. all five values associated to the five \( S^z = -2, -1, 0, 1, 2 \) states associated to a \( S = 2 \) spin).

Further, depending on your system, you may benefit from settings which are not the default settings:

  • if you observe your blocks getting very big very quickly (last but one row of DMRG output) in DMRG3S, consider setting the expandBlocksize value (shortcut: eb) to something smaller like 10 or 20. Conversely, if you observe the number of blocks or the ratio between SU(2) and U(1) states getting very large, consider setting it to something larger. This value sets the maximal size by which each symmetry sector is enlarged during each iteration. A smaller value grows the sectors more slowly but also tends to prefer (e.g.) larger spin sectors, while a larger value causes the bond dimension to grow very quickly and favours the most important sectors. Note that the value here is always limited by the maximal number of states selected and the maximal blocksize allowed during truncation to avoid pointless expansion. If your state is sufficiently large, you may also set the expandRatio (shortcut: er) to something smaller from its default value of 0.5. This further limits the size of the expansion blocks, but in a more uniform way as it selects a fixed ratio of each block. However, for small states, it also over-estimates smaller blocks (due to a clipping of a minimal expansion of 1 required to get out of a product state).
  • SyTen currently implements two eigensolver routines: the standard Lanczos method and a variant of the Davidson method (without the preconditioning). The Lanczos method with Gram-Schmidt reorthogonalisation (mode K submode A) should be the default, but both the Davidson method (mode D) or the matrix-based reorthogonalisation of the old 'classic' Lanczos implementation (mode C submode M) may be faster. To set these, look at the eigensolver stanza in the stages configuration. Note however that in particular the matrix-based reorthogonalisation may be unstable. On some systems, the Davidson method converges much better and to the actual ground state, while the Lanczos method will get stuck in a highly-excited state.
  • There is also a 2DMRG method implemented. Some people say that this method may be faster or lead to better convergence for some systems, though I seriously doubt that. If you want to try it out, you can set the variant to "2DMRG" for a given stage, e.g. using (m 200 v 2DMRG) for m=200 states and a 2DMRG solver.
  • In some cases, using a different contraction sequence in DMRG may speed up your calculation, though a decent amount of effort has gone into making the default work as well as possible. Check the output of syten-dmrg --help for details.

How do I enable profile-guided optimisation and link-time optimisation?

Using both profile-guided optimisation and link-time optimisation helps a little bit with GCC (approx. 5%). To enable this, first add the line

CPPFLAGS_EXTRA+=-fprofile-generate

to your make.inc file, recompile and run a "typical" DMRG calculation. This will be slower than usual, so maybe restrict the number of sweeps. When completing, a number of *.gcda files will be created next to the compiled object files. Once this has been done, replace the line above by

CPPFLAGS_EXTRA+=-fprofile-use -flto -fno-fat-lto-objects

and recompile again. This will use the run-time timing information to slightly optimise your code. Especially in larger calculations where most time is spent inside BLAS/LAPACKE functions, the impact will be minimal, though.

What symmetries can be used to enhanced DMRG?

In principle, a physical symmetry can be used in a MPS if it is possible to assign each state of the MPS a single quantum number. For example, particle number conservation may be used if it is possible to assign each local physical state a single particle number. Momentum conservation on a ring can only be used after a Fourier transformation into momentum space, since otherwise the real-space states do not have definite momentum quantum numbers.

Currently implemented are \( \mathrm{SU}(2) \), \( \mathrm{U}(1) \) and \( \mathbb{Z}_k \) symmetry groups (note that the latter in particular also includes parity). Any local physical symmetry associated to any of these groups can be used without additional effort. Furthermore, some work has already been done towards implementing \( \mathrm{SU}(3) \) groups. Other groups ( \( \mathrm{SO}(3), \mathrm{Sp}(6)\) etc.) should be implementable without too much effort.

I need to store/read an object from disk. How?

There are three alternative approaches to storing and writing data to and from disks:

  • Boost serialisation is implemented and results in a binary format which allows a fairly efficient and compressed representation of all objects defined in SyTen. It is however not human-readable and intended primarily for cases where only SyTen itself uses the data (e.g. wavefunction representations, caches etc.). You can use save(), syten::loadInput and syten::loadCache to store and load objects.
  • read() and write() are functions which are implemented for a small subset of objects (primarily scalars and dense tensors). They are intended to be a compromise between human readability and ease of parsing by SyTen. Consequently, write() results in text files which can also be generated with other tools (e.g. Python) to represent for example coefficient matrices. If a write() function exists for an object, the corresponding read() function will be able to parse the output of the write() function and give the original object back.
  • syten::operator<< is overloaded for many objects and will result in a nicely-formatted human-readable representation. However, there is no guarantee that this format is stable, nor is there always a corresponding and invertible syten::operator>> which would allow you to read data back in. Use these for output intended for the user.
  • Finally, for the specific case of dense tensors, there are syten::readTabulatedCmplx and syten::readTabulatedReal which require you to already know the dimensions of the dense tensor to be read in, but otherwise offer a very convenient way to specify individual values at certain positions of a tensor, in particular if there are only a few non-zero values.

Tensor Products

Remarks
This FAQ mostly applies to the standard rank-typed tensors syten::Tensor, it does not directly apply to the newer smart tensor class syten::STensor.

There are only pairwise tensor contractions, i.e. expressions C = A·B. The basic interface for these contractions is the prod<C>(a<F>, b<S>, c_a, c_b) function which takes two tensors of ranks F and S and arrays of integers of the same size as well as the number of contracted tensor legs as an explicit template parameter C. Legs i of a and j of b are then contracted if c_a[i] == c_b[j] and c_a[i] > 0. A leg i of a becomes the k-th leg of the resulting tensor if c_a[i] == -k, similarly a leg j of b becomes the k-th leg of the resulting tensor if c_b[j] == -k. For example, to implement \( C_{ik} = \sum_{j} A_{ij} B_{jk} \), say C = prod<1>(A, B, {-1, 1}, {1, -2}). Each tensor leg has an associated direction, one can only contract an outgoing with an incoming index and vice versa, not two outgoing indices nor two incoming indices. prod<>() takes an additional boolean value which, if true, complex-conjugates the second argument and considers the directions of the legs of the second tensor to be inverted. By default, this flag is false.

Complete products (resulting in a scalar value) are available in two variants: prod<C>(a<C>, b<C>) calculates \( A \cdot B \) assuming that the order of legs matches between a and b and complex-conjugates b by default. Alternatively, it is also possible to specify all legs as contracted in the arrays c_a and c_b above which allows for products where the order of indices does not match between a and b. In this second form, b is not complex-conjugated by default. \( \mathrm{Tr}(A B^\dagger) \) can then be written as either prod<2>(a, b, {1, 2}, {1, 2}, true) or prod<2>(a, b[, true]).

There are two convenience wrappers around this basic prod<>() function:

The first, instead of integer arrays c_a and c_b takes two maps from indices to integers, this allows for an arbitrary order of these indices and hence does not require knowledge of the tensor order (only a global variable or function evaluating to the correct leg value). The matrix-matrix multiplication above could then be written as prod<1>(a, b, {{1, -1}, {2, 1}}, {{2, -2}, {1, 1}}).

The second takes either three strings or one string describing the tensor contraction using Einsteins summation convention. The variant with just one string immediately splits the string into three substrings at | characters and then behaves like the former. The index names can have arbitrary length and contain all characters but , which is used to separate the different indices. The matrix-matrix product above could be written as prod<1>(a, b, "i,j|j,k|i,k") or prod<1>(a, b, "i,j", "j,k", "i,k") or prod<1>(a, b, "first,summed|summed,second|first,second"). When using this form to initialise a complete product, the result index string has to be empty, i.e. prod<2>(a, b, "a,b|a,b|", true) or prod<2>(a, b, "first,second", "first,second", "", true).

Remarks
Outer products are possible, e.g. there are two possible ways to generate a four-fold identity tensor and fs and io2 are exactly identical in the following:
Tensor<4> fs;
{ auto f = gM(aBasis, bBasis);
auto s = gM(aBasis.f(), bBasis.f());
fs = prod<1>(f, s, "a,b,x|af,bf,x|a,b,af,bf");
fs.reduce(EliminateZeros::No, true);
}
Tensor<4> io2;
{ auto ioA = genIONode(aBasis);
auto ioB = genIONode(bBasis);
io2 = prod<0>(ioA, ioB, "a,af|b,bf|a,b,af,bf");
}

Order of Tensor Contractions

The order of contraction of indices is rather important in tensor networks. For example, when calculating the norm of a matrix product state, it is paramount to contract from left to right all site tensors with their complex conjugate and the relevant transfer matrix, rather than first contracting all sites tensors, then all complex conjugates of all site tensors and then the two now-massive tensors representing the state together. While the former runs in polynomial time (linear in the length of the MPS, qubic in its bond dimension, linear in its physical dimension), the latter runs in exponential time in the length of the MPS! Similar differences can often occur in smaller networks when it is possible to contract multiple tensors in different ways.

While for small networks, the optimal ordering is fairly easy to figure out by hand (simply check the cost of all possible orderings), for larger networks, this gets complicated quickly.

In the source of arXiv:1304.6112v6, you will find a netcon.m file. Open this file in Matlab ($ module load matlab; matlab, then click it…). You then get a function netcon which is explained in more detail also in that paper. Essentially, a call:

netcon({[1,2,3],[4,5,6],[1,5,7,8,-1],[2,4,7,-2],[3,6,8,-3]},
       2,1,1,0,[1 w; 2 m; 3 m; 4 m; 5 w; 6 m; 7 d; 8 d; -1 v; -2 n; -3 n])

specifies that you have five tensors in your network (the five groups enclosed in [] in the first argument). The first tensor has indices labelled 1, 2 and 3; the third tensor has indices labelled 1, 5, 7, 8 and -1. Indices labelled with positive integers are contracted (e.g. the first indices of tensors one and three are to be contracted), while negative integers denote resulting tensor legs (the last leg of the third tensor is the first leg of the resulting tensor). The options in between shouldn't have to be changed, the last option denotes the dimensions of each index (labelled by positive and negative integers), e.g. index -1 has dimension v whereas index 2 has dimension m. You can change and assign to these variables as you please; for example

w = 3; v = 6; m = 100; n = 200; d = 2;

works for assignment. Once you call netcon(), it should give you a result relatively quickly, naming the indices in the order in which you should contract them, i.e.

Best sequence:  6 3 1 5 8 2 4 7
 Cost:           10032000000

means to first contract over index 6 (tensors 2 and 5), then index 3 (tensors 1 and (25)), followed by indices 1, 5 and 8 (tensor 3 and (125)) and finally indices 2, 4 and 7 (tensor 4 and (3125)).

There doesn’t seem to be an analytic option to this, but using hugely different orders of magnitude for different indices works equally well as specifying \( m \gg d \) in some imaginary language.

Ordering of Tensor Legs

The above section deals with the order in which multiple tensors should be contracted together. This section deals with the ordering of legs within a tensor for optimal contraction. It is based on the fact that each dense tensor block is stored internally as an array of scalar numbers. When two dense tensor blocks are contracted together, the numbers in the two arrays have to be reshuffled such that the tensor contraction can be expressed as a matrix-matrix multiplication (transpose-transpose-gemm-transpose, TTGT). For example, given tensors \( A_{ajib} \) and \( B_{icdj} \), the contraction \( \sum_{ij} A_{ajib} B_{icdj} \) first transforms \( A_{ajib} \to A^\prime_{(ab)(ij)} \) and \( B_{icdj} \to B^\prime_{(ij)(cd)} \). It then calculates the matrix-matrix product \( A^\prime \cdot B^\prime = R^\prime_{(ab)(cd)} \) and then reorders the scalars in \( R^\prime \) according to the requested ordering of legs; e.g. if the call was prod<2>(a, b, {-2, 1, 2, -4}, {2, -3, -1, 1}), entries in \( R^\prime \) are sorted as \( R^\prime_{(ab)(cd)} \to R_{dacb} \).

This has the consequence that if \( A \), \( B \) or \( R \) are already in the same order as their primed counterparts, no rearraging of entries has to take place. Taking into account the column-major convention of BLAS which is used to calculate the matrix-matrix product, no rearranging will take place:

  • For \( A \to A^\prime \) iff:
    • contracted indices come first, in ascending order
    • non-contracted indices come second, in ascending order
  • For \( B \to B^\prime \) iff:
    • contracted indices come second, in ascending order
    • non-contracted indices come first, in ascending order
  • For \( R^\prime \to R \) iff:
    • All non-contracted indices from \( B \) come before those of \( A \).
    • Non-contracted indices within \( A \) and \( B \) are in ascending order.

For example, for the contraction of two rank-3 tensors into a new rank-4 tensor, the following call is optimal:

prodD<1>(a, b, {1,-3,-4}, {-1,-2,1})

and if all dimensions of a and b are of size 15, it takes 0.93ms to complete. In comparison, the call

 prodD<1>(a, b, {-1,1,-3}, {-4,-2,1})

with the same sizes takes 1.70ms to complete.

When contracting the same two tensors into a single rank-2 tensor, the call

prodD<2>(a, b, {1,2,-2}, {-1,1,2})

takes 0.055ms, while the call

prodD<2>(a, b, {-2,1,2}, {2,-1,1})

takes 0.122ms.

For larger tensors, the difference between the two calls becomes less and less noticeable, since re-arranging two matrices of size \( m \times n \) and size \( n \times p \) only takes \( O(m \cdot n + n \cdot p) \) operations, while their product takes \( O(m \cdot n \cdot p) \) operations. Nevertheless, it is useful to keep in mind.

One major exception to this formulation is the handling of fusing/merging tensors. When merging two legs of a tensor, the merged legs are first transposed to the front, merged and then transposed into the final position. As such, you ideally want to merge the first two legs of a tensor and then keep the merged leg in the front.

In general, it is more expensive to transpose legs in the result tensor than in the initial tensor, as they are transposed in every individual result tensor prior to the tensor reduction. Hence, if possible, avoid reordering tensor legs in the last transpose step of transpose-transpose-gemm-transpose.

Noncommuting multiplication: Is there any system to this madness?

Operator Multiplication

Sadly, C++ only knows operator*=, not operator=*. Hence, we can only write state *= op, which we want to be \( \hat O | \psi \rangle \). Consequently, state *= op1 *= op2 should be \( \hat O_2 \hat O_1 |\psi\rangle \). This in turn means that op1 * op2 should probably be \( \hat O_2 \hat O_1 \) (i.e. \( \hat O_1 \) applied first and then \( \hat O_2 \)). This extends to dot() products of operators, where we want dot(a, b) to apply a first and b second. Since in \( \hat O_2^\dagger \cdot \hat O_1 \) it is the second operator which is dual-space conjugated, dot(a, b) also dual-space-conjugates b, not a.

Scalar Products

Use the function overlap(a, b) which complex-conjugates its first argument (a) and calculates the complete product between a and b.

The prod(a, b) family of functions for type-ranked tensors (syten::Tensor) has an option to complex-conjugate its second argument for arbitrary tensor products. To be consistent with this, in the special case that all legs are contracted, prod(a, b) also complex-conjugates its second argument. This is also true for syten::prodD() and syten::prodS(). Note that while for products with a product specification to match different legs together, the second argument is not conjugated by default. In contrast, if one calculates the full product without giving a specification, the second argument is conjugated by default.

For states, dense tensors and tensors (but not syten::STensor), operator* also complex-conjugates the first argument.

syten::STensor objects are multiplied by operator* or syten::STensorImpl::prod(). None of the arguments are conjugated by default regardless of whether only some or all legs are contracted. Use syten::STensorImpl::overlap() to obtain overlaps or manually create a proxy using syten::STensorImpl::conj().

The Operator Parser

The command-line operator parser was until recently inconsistent, but now keeps arguments in mathematical order, i.e. operators further to the right are applied first.

Overview

Expression (via) Result
$ syten-overlap <a> <b> overlap(a, b) \( \langle a | b \rangle \)
'lat:a b *' \( \hat a \hat b \left(|\psi\rangle\right) \)
'lat:a b ·' dot(b, a) \( \hat a^\dagger \cdot \hat b \left(|\psi\rangle\right) \)
overlap(MPS::LBOState a, MPS::LBOState b) \( a^\dagger \cdot b \)
overlap(MPS::State a, MPS::State b) \( a^\dagger \cdot b \)
overlap(BTT::State a, BTT:State b) \( a^\dagger \cdot b \)
overlap(MPS::Operator a, MPS::Operator b) \( \mathrm{tr}\left(\hat a^\dagger \cdot \hat b \right) \)
overlap(Tensor a, Tensor b) prod(b, a, Conj::y()) \( a^\dagger \cdot b \)
overlap(STensor a, STensor b) prod(conj(a), b) \( a^\dagger \cdot b \)
overlap(StateWrapper a, StateWrapper b) \( a^\dagger \cdot b \)
operator*(STensor a, STensor b) \( a \circ b \) (connected over all matching legs)
operator*(DenseTensor a, DenseTensor b) prodD(b, a) \( a^\dagger \cdot b \)
operator*(Tensor a, Tensor b) prod(b, a) \( a^\dagger \cdot b \)
operator*(MPS::Operator a, MPS::Operator b)\( \hat b \hat a \left(|\psi\rangle\right) \)
prod(STensor a, STensor b) \( a \circ b \) (connected over all matching legs)
prod(Tensor a, Tensor b) \( b^\dagger \cdot a \)
prodD(DenseTensor a, DenseTensor b) \( b^\dagger \cdot a \)
prodS(SparseTensor a, SparseTensor b) \( b^\dagger \cdot a \)

Scalar types

There are numerous scalar types defined via typedefs. Here’s an overview (with default settings):

Name (default) underlying type (default) range/precision Application
syten::Index std::uint32_t 0 to 2^32 Tensor indices and dimensions
syten::Size std::size_t 0 to 2^64 Indices into flat arrays
syten::Rank std::size_t 0 to 2^64 Tensor ranks, compile-time constants, arrays
syten::CDef syten::HighPrec real, 76 digits Clebsch-Gordan coefficient tensors element type
syten::MDef syten::HighPrec real, 76 digits IREP representation matrices
syten::SDef std::complex<double> complex, 15 digits Standard dense tensor element type, configurable with SYTEN_COMPLEX_SCALAR/SYTEN_SDEF_TYPE, _c (_i) user-defined literal
syten::SRDef double real, 15 digits Default real-only low-precision type, configurable with SYTEN_SRDEF_TYPE, _r user-defined literal

If you want to describe a standard tensor element or scalar type, use SDef. If you want to describe a real scalar type, use SRDef. Use SRDef also for threshold and order-of-magnitude calculations. For Clebsch-Gordan-Coefficient tensors and calculations leading up to them, use CDef.

One of the aims here is to be able to, with relatively little effort, use float instead of double, use double instead of std::complex<double> and potentially even use higher-precision types. As such, most code should use only the scalar typedefs listed above.

Overlaps of matrix-product states

Given two MPS \( |a\rangle \) and \( |b\rangle \) and the aim to evaluate the overlap very precisely:

  • if their overlap \(\langle a | b \rangle \) is small, there is probably nothing better to do (I think) than syten-overlap a.state b.state which will give you this small number. Depending on the length of the system and the bond dimension of the states, this overlap can probably be \( 10^{-10} \) even if it should be zero.

    Maybe the stability of this calculation can be improved by bringing both states into left-canonical form first and then calculating the overlap. This is straightforward in pyten. The reasoning is that if the states are in canonical form, most site tensors \( A_{1…L-1} \) are "well-behaved" and their product should also be well-behaved. Since the syten::MPS::overlap() function proceeds from left to right, the non-well-behaved tensors \( A_L \) would be used last. But I'm not sure if this really makes a difference.

  • if their overlap \(\langle a | b \rangle \) is large and you want the deviation from 1, then instead of calculating \( \sqrt{2} \sqrt{1 - \langle a | b \rangle} \) (done by syten-overlap --root --subtract a.state b.state plus scaling by \( \sqrt{2} \)) it is indeed better to evaluate \( || | a\rangle - |b \rangle || \) which is done by syten-overlap --error --root a.state b.state. In this case, syten-overlap first subtracts the states \( |a\rangle \) and \( |b\rangle \), normalises the result and calculates its norm. This results in a slightly more precise evaluation allowing for about two orders of magnitude smaller overlaps (cf. the figure below; at \( m = 256 \) the maximal bond dimension was obtained).
overlaps-mps.png
Overlaps close to 1 as calculated above with two different methods.

Why should I not copy syten::STensor objects?

Short answer: Use syten::STensorImpl::STensor::copy() if a copy is unavoidable, move otherwise.

Long answer: Unfortunately, it is rarely quite clear whether an object is copied or moved-from. In particular, constructing a vector from a std::initializer_list of pure r-values, e.g.

void f(Vec<STensor> x);
STensor factory();
f({std::move(a), factory()});

produces an unnecessary copy of syten::STensor objects because the content of std::initializer_list is constant, while doing the same with a tuple only produces one move. For a simpler example, see e.g. here, notice how f() constains references to X(X&&) whereas g() contains references to X(X const&).

Furthermore, while returning temporaries inside curly braces into a std::tuple (as e.g. done in syten::STensorImpl::svd) works and does not copy, having named objects there also creates an unnecessary copy, which is not at all reasonable nor quite straightforward to see, as returning a named copy outside a tuple invokes copy elision.

To avoid those and other errors in the future, the copy ctor of syten::STensor logs all implicit copies made using either it or the copy assignment operator if tensor timings are enabled. The public member function syten::STensorImpl::STensor::copy() does not log this and should be used if you are certain that some copy is unavoidable.

Runtime Errors

boost::archive::archive_exception: array size too short

The following error is printed when attempting to read a state or lattice file:

terminate called after throwing an instance of 'boost::archive::archive_exception'
  what():  array size too short

This is most likely due to a mismatch between the scalar types used by the toolkit binary (run syten-binary --version to get a printout of its scalar types used) and the file read in. For example, if a lattice is created by syten-sql-mps-spin using real scalar values and then subsequently attempted to be read by syten-random using complex scalar values (or vice versa), the error will occur. See SYTEN_COMPLEX_SCALAR for details.

boost::archive::archive_exception: input stream error

When reading an input file (e.g. a state or lattice) fails, the following error may be printed:

terminate called after throwing an instance of 'boost::archive::archive_exception'
  what():  input stream error
  Aborted

In this case, you should make sure that you read in a valid state or lattice file. A common reason for failure until recently was to read a file which is still written-to. This failure mode should now be avoided (we create a temporary file first, serialise into that file and then move the temporary file into place), but similar reasons (no space left on device, for example) may also lead to such an error.

Error while loading shared libraries

The following error is printed when attempting to run a binary (file name may be different):

error while loading shared libraries: libmkl_intel_lp64.so:
  cannot open shared object file: No such file or directory

This error occurs if the system does not find a shared library required by the program. Most likely this will be Boost or the Intel MKL. Make sure that paths to both are contained in your LD_PRELOAD_PATH. At most clusters, the Intel MKL can be enabled by module load intel or module load icc (the version needs to match the version you used to compile the binary). The Boost library can often similarly be loaded via module load or, alternatively, by directly adding the boost/lib directory to your LD_PRELOAD_PATH.

'Attempt to create unrepresentable HalfInteger quantum number…'

The toolkit stores quantum numbers as syten::HalfInteger values which internally are stored as standard integers of twice the represented value (so a quantum number of Sz=3/2 is stored as an syten::HalfInteger with value 3/2 which in turn is stored as an integer with value 7). By default, a std::int16_t is used as the internal integer storage, resulting in representable quantum numbers ranging from approx. -16'000 to approx. +16'000. For most applications, this is enough.

However, if your lattice is very large and/or bosonic, this may not be enough. Note that in certain applications, such as generation of ‘complete’ random states, enumeration of all possible quantum numbers is necessary.

To alleviate this issue, either choose a smaller lattice or define the macro SYTEN_RDEF_BASE_TYPE to std::int32_t or std::int64_t. Note that the typical concerns apply here, binaries will not be compatible between different values of that type. To do so, add a line CPPFLAGS_EXTRA+=-DSYTEN_RDEF_BASE_TYPE=std::int32_t to your make.inc file.

One of the executable arguments needs to be a complex number, but doesn't accept '(0, 0.01)'.

To pass a complex number as an argument on the command line, use ' around it to stop the shell from intepreting the brackets. Do not place any spaces inside the brackets, i.e., use -d '(0,0.01)', not -d '(0, 0.01)'.

DMRG runs forever and only prints energy zero

DMRG convergence is currently measured as the relative change of energy over the last two sweeps, something like \( (E_{\mathrm{last}} - E_{\mathrm{now}})/(|E_{\mathrm{last}}|) \) has to be small. If the energy is exactly zero (or small), this check breaks down. To limit the number of sweeps then, use e.g. (m 50 l 10) for m=50 states and at most l=10 half-sweeps.

This problem can in particular occur if you try to orthogonalise against many orthogonal states or in a small subspace, the eigensolver is then more likely to return a zero state.

`syten-random` with `-g s` takes forever to create a spin state

syten-random -g s first creates a near-vacuum state (N=0, S=0 etc.) and then applies single-site operators randomly until the state transforms as desired. However, if you attempt to create a S=0 state on five Heisenberg spins (or with e.g. five electrons in the Hubbard model), syten-random -g s will not notice that this is in fact impossible and just try to apply spin flip operators forever. If syten-random -g s takes suspiciously long, consider re-checking your arguments. Alternatively, the ‘complete’ generator -g c will generate a zero-norm state in this case (all tensors empty).

My programme gets killed by `SIGKILL` or throws `bad_alloc` errors

If your program needs more memory than the operating system can supply, it may either throw a std::bad_alloc exception (if the OS denies an allocation) or get killed by a SIGKILL signal (if the OS initially agreed to the allocation but doesn’t have the memory once we access it, cf. overcommit.)

To diagnose such problems, you should

  • set the environment variables SYTEN_TENSOR_TIME=t and SYTEN_LOG_TIME_LVL=6. The former will cause timing of all tensor products to be enabled, the latter will print out the times after each individual product. In this way, you can check which product in your code is the problematic one, the last ‘completed’ product will be printed out just before the crash. The output tells you the line of the product as well as the sizes (in blocks and memory) of the involved tensors.
  • set the environment variable SYTEN_MEMORY_SAMPLER=mysamplefile. This will cause memory usage to be printed to the file continously. You can cross-check with the output of the tensor timer to see which products and at which times your code needs the most memory.

Once you have identified the parts of the code taking the most memory, you should check that only temporary tensors necessary during specific tensor contractions exceed the memory budget of your computer. If (e.g.) individual MPS tensors already go beyond the RAM available, there is not much one can do. If you have identified some temporary tensors which need a lot of RAM, you might want to

  • consider changing the tensor contraction order to minimise the sizes of the temporary tensors. Often, there is a trade-off between memory and CPU usage.
  • if your tensors are using SU(2) symmetries, it may be useful to call syten::correct_cgc_tensors on high-rank tensors periodically to avoid a build-up of errors in the CGC tensors.
  • consider caching unused tensors to disk. Especially during MPS calculations on long chains, storing 100 MPS component tensors of 500 MB each to disk saves you 50 GB of RAM which may be spent on the current on-site contraction tensors.
  • consider dynamically changing the contraction order. If in a matrix-matrix product A*B*C, sometimes A and sometimes C is bigger (and B in each case reduces the size considerably), it may be preferably to sometimes do (A*B)*C (if A is bigger) and sometimes do A*(B*C) (if C is bigger). syten::MemoryUsage::totalSize may help you determine the size of each tensor, alternatively, comparing the sizes of the relevant bonds may be more appropriate.
  • try to free unused memory by assigning zero tensors to temporary variables which are still in scope but no longer needed. E.g.
    auto x = a * b;
    auto y = x * c;
    auto z = y * d; // x still takes memory!
    may be changed to either
    auto x = a * b;
    auto y = x * c;
    x = STensor();
    auto z = y * d; // x still in scope but zero!
    or, alternatively, if using syten::STensor or having equal-rank syten::Tensor objects:
    auto x = a * b;
    x = x * c; // or x *= c;
    x = x * d; // or x *= d;
  • reconsider shared-memory parallelisation which necessarily incurs some memory overhead. This is particularly problematic if you are using top-level parallelisation such as operator-level or real-space parallelisation which causes the same large tensors to be needed in memory simultaneously. Consider disabling such parallelisation in these cases.

Caching warnings

If you get warnings like

AsyncCached<...>@...::wait_for_worker_completion() acquired mutex while worker was still running.
Will release mutex now and sleep for 100ms. See FAQ.

or

AsyncCached<...>@...::wait_for_worker_completion() called while worker is still busy caching the
object to disk. This is bad for performance. See FAQ.

these are due to unexpected situations in the asynchronous caching subsystem. This subsystem receives requests to cache or uncache specific objects and then dispatches workers to do so asynchronously. If a request to uncache an object comes in while this object is still being cached to disk, a performance hit is to be expected (as we have to wait until the caching process completes before we can undo it). You can enable/disable caching using the List of Environment Variables SYTEN_DO_CACHE and friends and you may also adapt the number of workers used and the caching threshold applied (to avoid caching very small objects).

While this warning may be annoying, it normally should not hint towards any kind of calculational error, the most likely cause is that the hard disk drive is much slower than expected.

I get an error or observe behaviour which doesn't happen anywhere else. Help?

When trying to reproduce an issue and failing, make sure that settings which may affect runtime behaviour are checked, in particular:

  • make sure that the same libraries/implementations (e.g. BLAS) are used
  • make sure that the make.inc file used to compile is identical or that all differences are understood
  • make sure that the output of env on both systems is identical in particular with respect to LD_LIBRARY_PATH and LD_PRELOAD variables
  • make sure that you are actually running the binaries you think you are running (check for example which syten-dmrg)
  • make sure that the git commits of the two executables match (using the --version and, if necessary, --version-diff switches).

Compiler Errors

'./inc/tensor/tensor_prod.h:572:8: error: expected ‘(’ before ‘constexpr’'

The compiler emits an error message complaining e.g. about

if constexpr (is_number<decltype(r)>::value) {
   ^~~~~~~~~

Starting at commit f366cfad, SyTen requires a C++17 compatible compiler, such as GCC 7.x. If your compiler is older and you cannot upgrade, consider the last C++11-commit only, tagged as pre-C++17: git checkout pre-C++17.

'error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]'

GCC 7.1 incorrectly diagnoses a strict-aliasing violation (GCC 7.2 fixes this problem). To avoid the error message, add CPPFLAGS_EXTRA+=-Wno-strict-aliasing to your make.inc file.

Make: No rule to make target `X.h', needed by `Y.d'. Stop.

If you encounter such an error, it is most likely because you still have an old dependency file which refers to a no-longer existing header file. Make then tries to re-make this header file, but does not know how (since they are written by hand, not by make) and then stops. Try to delete all old header files, e.g. via

find . -name '*.d' -delete

executed in the toolkit directory.

GCC/Clang: X.o:(.data+0x0): undefined reference to `__must_be_linked_with_icc_or_xild'

This error occurs if you compiled object files with the Intel compiler, but then attempted to compile an executable using either GCC or Clang. Either use the Intel compiler to also compile the executable or re-compile the object files with the other compiler. This can be done easiest by deleting all old object files via

find . -name '*.o' -delete

GCC prior to 7.x: error: attributes at the beginning of statement are ignored [-Werror=attributes]

GCC 6.x does not yet know about the [[fallthrough]] attribute which is a part of C++17. If you encounter this error, it is best to silence the warning by adding the line

CPPFLAGS_EXTRA+=-Wno-attributes

to your make.inc file.

In function `_start': undefined reference to `main'

The following error is emitted by the linker:

/usr/lib/../lib64/crt1.o: In function `_start':
  /usr/src/packages/BUILD/glibc-2.11.3/csu/../sysdeps/x86_64/elf/start.S:109:
  undefined reference to `main'

This complains that there is no function main() when attempting to build an executable. Most likely the error occurs because you only declared a function syten::main(). To be precise, an executable should follow either this model:

#include "header.h"
namespace syten {
int main(int argc, char** argv) {
// do stuff here
return 42;
}
}
int main(int argc, char** argv) { return syten::main(argc, argv); }

or

#include "header.h"
int main(int argc, char** argv) {
using namespace syten;
// do stuff here
return 42;
}

The first method is slightly preferred because it minimises the use of the global namespace and hence possibly name clashes with other functions.

How should I acknowledge using the toolkit?

Please see Contributors.