Overview, Design and Comprehension Questions

What is a truncation threshold, weight and maximal number of states?

The toolkit supports four different criteria based on which it discards singular values. Any of these criteria is sufficient to discard a particular singular value. Two are only sensible relative to the norm of the input tensor, which is why it is useful to call Truncation::scale() prior to a SVD, passing in the norm of the tensor.

The four criteria are:

maximal blocksize: truncMaxBlocksize or b in DMRG stage stanzas, somtimes -b option: If the maximal size of a dense block exceeds this size, the smallest singular values are discarded.
maximal states: truncMaxStates or m in DMRG stage stanzas, usually -m option: If the sum of the sizes of all dense blocks exceeds this size, the smallest singular values are discarded.
threshold: truncThreshold or t in DMRG stage stanzas, usually -t option: If a singular value is smaller than this, it will be discarded.
discarded weight: truncWeight or w in DMRG stage stanzas, usually -w option: The smallest singular values are discarded until the sum of their squares reaches just below this value. Note that this is usually a very small number. If you truncate a normalised MPS with specified weight $ \omega $, you will incur an error $ \sqrt{\left|\left| \left|\psi\right\rangle - \left|\psi^\prime\right\rangle \right|\right|} $ of approximately $ \sqrt{2 L \omega} $.

Also note that the error typically reported is calculated as $ \left( \sum_{i=1}^L \right) \sqrt{2 - 2 \sqrt{1 - \omega/N^2}} $ where $N$ is the norm of the state and we either report a single truncation (without the sum) or the maximal accumulation of all truncations (with the sum). This is of roughly the same order as $ \sqrt{2 L \omega} $, but in the case of multiple truncations, both expressions are upper bounds, not the exact error. You can get the exact error from syten-overlap --norm --subtract original.mps truncated.mps which calculates $ 1 - \left\langle \psi^\prime \middle| \psi \right\rangle / \left(\left|\left| \left|\psi\right\rangle \right| \right| \cdot \left|\left| \left|\psi^\prime\right\rangle \right| \right|\right) $. Taking the square root of this value will give you the approximate error.

How can I specify an operator on the command line? Which functions are available?

Many tools require a tensor product operator (at the moment either MPO or BTTO) as an input. Tensor product operators are stored in lattice files and are accessed using the generic <lattice file>:<operator expression> syntax. <lattice file> here is the usual name of the file, <operator expression> can be the name of a specific global operator or a more complicated expression. This expression is parsed using reverse prefix notation. That is, assuming two operands a and b and a binary operation f, the result f(a, b) is expressed in this syntax as b a f. RPN has the advantage that no parantheses are necessary to uniquely specify operation ordering and that writing a parser for it is extremely simple. For example, the expression I 3 * takes the identity operator from the lattice and multiplies it by three. The expression I 3 * H + does the same as above and then adds the result to the operator defined on the lattice as H.

The operand types of the operator parser are:

unevaluated strings, e.g. H or q3,0
numbers, e.g. 3 or (0,-2.123e-4)
sectors, e.g. the $ \mathrm{U}(1)_N = 3 ; \mathrm{SU}(2)_S = \frac{1}{2} $ sector
TPOs, e.g. Hamiltonians, constructed single-site operators etc.

By default, each element of the operator expression is interpreted as a number unless that conversion fails. If this conversion fails, the element is stored as a string. Strings can implicitly be converted to numbers, sectors and TPOs. If after evaluation of all operations the final operand is not a TPO, it will implicitly be converted into a TPO.

Matrix Product Operators

For matrix product operators, the following functions/operators are available:

<x> ? or <x> _operator: explicitly interprets the string <x> as a global TPO. Example: H ?, N ?, both equivalent to just H and N.
<x> <y> @ or <x> <y> _at_site: if <y> can be parsed as a number, constructs the single-site operator <x>_<y>, otherwise constructs the single-site operator <y>_<x>. Example: sz 0 @, 17 s @, n 2 _at_site.
<x> # or <x> _sector: Converts <x> starting at its second position into a quantum number sector description with respect to the current lattice. Example: q0,3 or a15. Note that the first position is ignored but necessary to avoid interpreting e.g. 15 as a number (which cannot be converted into a sector any more).
<x> … or <x> _distribute. Builds the MPO $ x \mathbb{1} $, i.e. the identity times <x>, with the absolute value of the factor distributed evenly over all tensors. The phase is stored in the first tensor. Example: 15 _distribute.
<x> : or <x> _truncate. Explicitly truncates the MPO <x>, even if the parsing component was invoked with truncation disabled. Useful to (pre-)truncate the arguments for syten-truncate or syten-print, which normally don't truncate in the parser. Example sz 0 @ sz 1 @ + _truncate.
<a> + adds two numbers or two operators together.
<a> * or <a> _mult. If <a> and  are both numbers, calculates the product of the two numbers. If one is an operator and the other a number, scales the operator by the number. If both are operators, calculates $ a b $ (or, in SyTen C++, b * a). That is,  is applied first and <a> is applied second. Example: 2 3 * is 6, 4 I * is the identity scaled by four, A B * is $ A B (|\psi\rangle)$, chu 0 @ cu 0 @ * is nu 0 @.
<a> · or <a> _dot calculates the dot product of two operators <a> and  as dot(b, a), which applies b first and dual-space-conjugated a second. Example: c 0 @ c 1 @ _dot is $ c_0^\dagger \cdot c_1 $, if the dual-space conjugate of cu is chu, cu 1 @ cu 0 @ _dot is the same as chu 1 @ cu 0 @ *. If  is already a sector, the function forwards to _dotq, see below.
<a> <q> _dotq calculates the dot product of two operators <a> and  projected into the sector <q> as dot(b, a, q). Example: c 0 @ c 0 @ q0,1 _dotq is proportional to s 0 @ in a Fermi-Hubbard system and hence c 0 @ c 0 @ q0,1 _dotq c 0 @ c 0 @ q0,1 _dotq _dot 0.75 * is the same as s 0 @ s 0 @ _dot or $ (\hat s_0)^2 $. The first position of <q> is ignored but potentially necessary to avoid interpretation as a number. The string following the first position is interpreted in exactly the same way as in e.g. syten-random -s or syten-apply-op -s.
<a> ^m2 or <a> _msquare is the same as <a> <a> *, i.e. calculates the square of a number or an operator. Example: H 1.2345 -1 * _distribute _msquare calculates $ \left( \hat H - 1.2345 \right)^2 $.
<a> ^d2 or <a> _dsquare calculates either dot(a, a) (for operators) or a * conj(a) (for numbers).

Binary Tree Tensor Operators

For binary tree tensor operators, the following operations are defined:

<x> ? or <x> _operator: explicitly interprets the string <x> as a global TPO. Example: H ?, N ?, both equivalent to just H and N.
<x> <y> @ or <x> <y> _at_site: if <y> can be parsed as a number, constructs the single-site operator <x>_<y>, otherwise constructs the single-site operator <y>_<x>. Example: sz c2:3 @, c17:1 s @, n c2:1 _at_site. For details of the accepted coordinates, see operator[].
<x> # or <x> _sector: Converts <x> starting at its second position into a quantum number sector description with respect to the current lattice. Example: q0,3 or a15. Note that the first position is ignored but necessary to avoid interpreting e.g. 15 as a number (which cannot be converted into a sector any more).
<a> * or <a> _mult. If <a> and  are both numbers, calculates the product of the two numbers. If one is an operator and the other a number, scales the operator by the number. If both are operators, calculates $ a b $ (or, in SyTen C++, b * a). That is,  is applied first and <a> is applied second. Example: 2 3 * is 6, 4 I * is the identity scaled by four, A B * is $ A B (|\psi\rangle)$, chu c0:0 @ cu c0:0 @ * is nu c0:0 @.
<a> · or <a> _dot calculates the dot product of two operators <a> and  as dot(b, a), which applies b first and dual-space-conjugated a second. Example: c c0:0 @ c c0:1 @ _dot is $ c_0^\dagger \cdot c_1 $, if the dual-space conjugate of cu is chu, cu c0:1 @ cu c0:2 @ _dot is the same as chu c0:1 @ cu c0:2 @ *. If  is already a sector, the function forwards to _dotq, see below.
<a> <q> _dotq calculates the dot product of two operators <a> and  projected into the sector <q> as dot(b, a, q).
<a> + adds two numbers or two operators together.

Common Examples

The Hamiltonian of a system can often be split into a fixed part and a part which one wants to scale by a scalar. Examples include the Hubbard- $ U $, i.e. $ \hat H = \hat H_{\mathrm{kin}} + U \hat H_{\mathrm{int}} $, dynamic laser fields and more. In such situations, it is advisable to store the individual parts, e.g. Hkin and Hint as global operators in the lattice. The full Hamiltonian can then be constructed on the command line as
```
$ syten-dmrg -l fh:'Hkin Hint 4 *' …
```
for $ \hat H_{\mathrm{kin}} + 4 \hat H_{\mathrm{int}} $. The parameter can then be changed as desired from the command line without regenerating the lattice.
Single-site operators include e.g. sp-u1:sz 2 @ for $ \hat s^z_2 $ on a $\mathrm{U}(1)_{S^z}$-symmetric spin lattice.
_dot may be used to dual-space conjugate an operator: A I _dot multiplies operators A and the identity I (defined on every lattice), but dual-space conjugates A in the process. The result is the proper $ A^\dagger $.
_dot and _dotq may also be used to project an operator onto a specific quantum number sector, e.g. I A q4 _dot effectively projects A onto the 4 sector.

The toolkit uses complex scalars by default? Why and how can I disable it?

The decision whether to use real scalars in tensors (double) or complex scalars (std::complex<double>) is currently a compile-time decision, i.e. there is no run-time introspection. This implies that once you decided to use real-valued scalars, you cannot do certain things to your wave function, e.g. apply an operator which requires complex numbers or do real-time evolution.

In comparison, if you are using only complex-valued scalars, all these things are possible even if you did not initially think of them. Hence, by default, the toolkit uses complex scalars. If you are sure that you only want to use real-valued scalars in your tensors, set -DSYTEN_COMPLEX_SCALAR=0 (c.f. SYTEN_COMPLEX_SCALAR).

Which time-evolution algorithm should I use?

Short answer: Use syten-tdvp or syten-krylov (for gapped systems or if your error requirements are very stringent). If you only wish to do imaginary-time evolution, consider SYTEN_COMPLEX_SCALAR.

Long answer: There are currently five time evolution implementations in the toolkit:

syten-tebd: Implements the classical TEBD algorithm and a real-space parallelised variant thereof of orders 1, 2 and 4, but requires the time-evolution operator (usually the Hamiltonian) to be split by hand into an ‘even’ and an ‘odd’ part, with those parts saved as individual MPOs in the toolkit. Furthermore, it currently only works on spin chains, as it requires all local operators occurring in the Hamiltonian ( $ \hat S^z, \hat S^+, \hat S^-$ etc.) to have zero overlap with the identity, i.e. $ \mathrm{tr}\left( \hat O \hat I \right) = 0 $. Consider this to be currently in the testing phase, though it should work (at least the unparallelised method) reliably if your system permits it.
syten-tebd-swappable: Also implements the standard TEBD algorithm (1st and 2nd order) but handles long-range interactions and arbitrary bosonic systems (Fermions may also work, I don't know). Requires an explicit list of summands of the Hamiltonian to be written to a text file first, which is then parsed and each summand exponentiated individually. See commit 00b6ea17 for details regarding the terms file. Unfortunately due to the inherent TEBD error of O(t^2) at best, it requires smaller time-steps than those which would be possible using the Krylov method and hence is usually slower.
syten-tdvp: Based on the TDVP principle applied to MPS as described by Jutho Haegemann et.al. Both the one-site and two-site variants are implemented. The one-site variant is very fast, but incurs a fairly large error, as it does not increase the bond dimension. The two-site variant is slower, but supposedly exact if there are only nearest-neighbour interactions. However, we currently do not quantify the projection error properly.
syten-taylor: Based on a simple Taylor series expansion of $ \hat U = \mathrm{exp}\left\{ -i \hat H t \right\} $ and then repeated application of the constructed $ \hat U $. Extremely fast if you need many very short time steps, but the error is not well-controlled – what may seem like a decent time evolution may veer of uncontrollably within just a few time steps. Worse, changing the timestep just slightly may completely alleviate the problem. However, if the time step is very small and you need very many steps, the method has the advantage of only building $ \hat U $ once in the beginning.
syten-krylov: Based on the MPS-supported Krylov subspace method. At each step, the Krylov subspace is built up and the Hamiltonian exactly diagonalised within this subspace. The method is well-controlled and you can in principle push every error to zero. In principle, the method allows arbitrarily large time steps, but may require very many Krylov vectors (iterations) to achieve this result. Various options for the internal eigensolver are available, check the output of syten-krylov --help, but the defaults should work reliably, though not necessarily as fast as possible. If your iteration exits with "maxIter" (last column of output), decrease the size of the time step or increase the maximal number of Krylov vectors to achieve a converged result. If your initial state is reasonably complex, you may want to try the 'W' or 'V' orthonormalisation submodes instead of the default 'A' mode. The method works best in gapped systems with little entanglement.

DMRG is sloooooow!

DMRG and MPS in general are methods which require low entanglement in the system to efficiently represent states. This means that while e.g. for ED solvers matrix dimensions $ 400 \times 400$ may be no problem, DMRG visibly starts to struggle if the dimension of the dense subblocks of its tensors exceeds $ \approx 500$. There are two ways to reduce this size (given as the last-but-one column in DMRG output):

only look at nicely behaved systems, in particular nothing critical and no long time evolutions
use symmetries to split the dense tensors up into more, smaller blocks. $ \mathrm{U}(1) $ symmetries such as particle number or $ S^z $-spin allow us to discard large percentages of the dense subblocks, resulting in much smaller sizes. If you have a $ \mathrm{SU}(2) $ symmetry in the system, there are relations which require that some elements of these tensors are identical, allowing us to only store one copy (rather than e.g. all five values associated to the five $ S^z = -2, -1, 0, 1, 2 $ states associated to a $ S = 2 $ spin).

Further, depending on your system, you may benefit from settings which are not the default settings:

if you observe your blocks getting very big very quickly (last but one row of DMRG output) in DMRG3S, consider setting the expandBlocksize value (shortcut: eb) to something smaller like 10 or 20. Conversely, if you observe the number of blocks or the ratio between SU(2) and U(1) states getting very large, consider setting it to something larger. This value sets the maximal size by which each symmetry sector is enlarged during each iteration. A smaller value grows the sectors more slowly but also tends to prefer (e.g.) larger spin sectors, while a larger value causes the bond dimension to grow very quickly and favours the most important sectors. Note that the value here is always limited by the maximal number of states selected and the maximal blocksize allowed during truncation to avoid pointless expansion. If your state is sufficiently large, you may also set the expandRatio (shortcut: er) to something smaller from its default value of 0.5. This further limits the size of the expansion blocks, but in a more uniform way as it selects a fixed ratio of each block. However, for small states, it also over-estimates smaller blocks (due to a clipping of a minimal expansion of 1 required to get out of a product state).
SyTen currently implements two eigensolver routines: the standard Lanczos method and a variant of the Davidson method (without the preconditioning). The Lanczos method with Gram-Schmidt reorthogonalisation (mode K submode A) should be the default, but both the Davidson method (mode D) or the matrix-based reorthogonalisation of the old 'classic' Lanczos implementation (mode C submode M) may be faster. To set these, look at the eigensolver stanza in the stages configuration. Note however that in particular the matrix-based reorthogonalisation may be unstable. On some systems, the Davidson method converges much better and to the actual ground state, while the Lanczos method will get stuck in a highly-excited state.
There is also a 2DMRG method implemented. Some people say that this method may be faster or lead to better convergence for some systems, though I seriously doubt that. If you want to try it out, you can set the variant to "2DMRG" for a given stage, e.g. using (m 200 v 2DMRG) for m=200 states and a 2DMRG solver.
In some cases, using a different contraction sequence in DMRG may speed up your calculation, though a decent amount of effort has gone into making the default work as well as possible. Check the output of syten-dmrg --help for details.

How do I enable profile-guided optimisation and link-time optimisation?

Using both profile-guided optimisation and link-time optimisation helps a little bit with GCC (approx. 5%). To enable this, first add the line

CPPFLAGS_EXTRA+=-fprofile-generate

to your make.inc file, recompile and run a "typical" DMRG calculation. This will be slower than usual, so maybe restrict the number of sweeps. When completing, a number of *.gcda files will be created next to the compiled object files. Once this has been done, replace the line above by

CPPFLAGS_EXTRA+=-fprofile-use -flto -fno-fat-lto-objects

and recompile again. This will use the run-time timing information to slightly optimise your code. Especially in larger calculations where most time is spent inside BLAS/LAPACKE functions, the impact will be minimal, though.

How can I use gdb to debug my code?

First, add -ggdb3 to the CPPFLAGS_EXTRA variable in your make.inc file and ensure that you also added -DDEBUG and -Og. Before running your program, set the environment variable SYTEN_USE_EXTERN_DEBUG to disable the built-in signal and termination handler (which normally prints backtraces) and not unwind the stack and exit once an error occurs.

Then run gdb as normal, e.g. gdb --args syten-dmrg -l lat:H.... To print C++ objects, you can define a cpp_print function e.g. as follows:

> define cpp_print
> call (void)operator<<(std::cerr, $arg0)
> printf "\n"
> end

which you can then use with any local variable whose operator<< is compiled in (at least if they are simple, I didn't manage to print tensors this way). To get a list of local variables, use info locals. Make sure to also read any of the GDB tutorials/documentation writeups available online.

What symmetries can be used to enhanced DMRG?

In principle, a physical symmetry can be used in a MPS if it is possible to assign each state of the MPS a single quantum number. For example, particle number conservation may be used if it is possible to assign each local physical state a single particle number. Momentum conservation on a ring can only be used after a Fourier transformation into momentum space, since otherwise the real-space states do not have definite momentum quantum numbers.

Currently implemented are $ \mathrm{SU}(3) $, $ \mathrm{SU}(2) $, $ \mathrm{U}(1) $ and $ \mathbb{Z}_k $ symmetry groups (note that the latter in particular also includes parity). Any local physical symmetry associated to any of these groups can be used without additional effort. Other groups ( $ \mathrm{SO}(3), \mathrm{Sp}(6)$ etc.) should be implementable without too much effort.

I need to store/read an object from disk. How?

There are three alternative approaches to storing and writing data to and from disks:

Boost serialisation is implemented and results in a binary format which allows a fairly efficient and compressed representation of all objects defined in SyTen. It is however not human-readable and intended primarily for cases where only SyTen itself uses the data (e.g. wavefunction representations, caches etc.). You can use save(), syten::loadInput and syten::loadCache to store and load objects.
read() and write() are functions which are implemented for a small subset of objects (primarily scalars and dense tensors). They are intended to be a compromise between human readability and ease of parsing by SyTen. Consequently, write() results in text files which can also be generated with other tools (e.g. Python) to represent for example coefficient matrices. If a write() function exists for an object, the corresponding read() function will be able to parse the output of the write() function and give the original object back.
syten::operator<< is overloaded for many objects and will result in a nicely-formatted human-readable representation. However, there is no guarantee that this format is stable, nor is there always a corresponding and invertible syten::operator>> which would allow you to read data back in. Use these for output intended for the user.
Finally, for the specific case of dense tensors, there are syten::readTabulatedCmplx and syten::readTabulatedReal which require you to already know the dimensions of the dense tensor to be read in, but otherwise offer a very convenient way to specify individual values at certain positions of a tensor, in particular if there are only a few non-zero values.

Tensor Products

Remarks: This FAQ mostly applies to the standard rank-typed tensors syten::Tensor, it does not directly apply to the newer smart tensor class syten::STensor.

There are only pairwise tensor contractions, i.e. expressions C = A·B. The basic interface for these contractions is the prod<C>(a<F>, b<S>, c_a, c_b) function which takes two tensors of ranks F and S and arrays of integers of the same size as well as the number of contracted tensor legs as an explicit template parameter C. Legs i of a and j of b are then contracted if c_a[i] == c_b[j] and c_a[i] > 0. A leg i of a becomes the k-th leg of the resulting tensor if c_a[i] == -k, similarly a leg j of b becomes the k-th leg of the resulting tensor if c_b[j] == -k. For example, to implement $ C_{ik} = \sum_{j} A_{ij} B_{jk} $, say C = prod<1>(A, B, {-1, 1}, {1, -2}). Each tensor leg has an associated direction, one can only contract an outgoing with an incoming index and vice versa, not two outgoing indices nor two incoming indices. prod<>() takes an additional boolean value which, if true, complex-conjugates the second argument and considers the directions of the legs of the second tensor to be inverted. By default, this flag is false.

Complete products (resulting in a scalar value) are available in two variants: prod<C>(a<C>, b<C>) calculates $ A \cdot B $ assuming that the order of legs matches between a and b and complex-conjugates b by default. Alternatively, it is also possible to specify all legs as contracted in the arrays c_a and c_b above which allows for products where the order of indices does not match between a and b. In this second form, b is not complex-conjugated by default. $ \mathrm{Tr}(A B^\dagger) $ can then be written as either prod<2>(a, b, {1, 2}, {1, 2}, true) or prod<2>(a, b[, true]).

There are two convenience wrappers around this basic prod<>() function:

The first, instead of integer arrays c_a and c_b takes two maps from indices to integers, this allows for an arbitrary order of these indices and hence does not require knowledge of the tensor order (only a global variable or function evaluating to the correct leg value). The matrix-matrix multiplication above could then be written as prod<1>(a, b, {{1, -1}, {2, 1}}, {{2, -2}, {1, 1}}).

The second takes either three strings or one string describing the tensor contraction using Einsteins summation convention. The variant with just one string immediately splits the string into three substrings at | characters and then behaves like the former. The index names can have arbitrary length and contain all characters but , which is used to separate the different indices. The matrix-matrix product above could be written as prod<1>(a, b, "i,j|j,k|i,k") or prod<1>(a, b, "i,j", "j,k", "i,k") or prod<1>(a, b, "first,summed|summed,second|first,second"). When using this form to initialise a complete product, the result index string has to be empty, i.e. prod<2>(a, b, "a,b|a,b|", true) or prod<2>(a, b, "first,second", "first,second", "", true).

Remarks: Outer products are possible, e.g. there are two possible ways to generate a four-fold identity tensor and fs and io2 are exactly identical in the following:

Tensor<4> fs;
{ auto f = gM(aBasis, bBasis);
  auto s = gM(aBasis.f(), bBasis.f());
  fs = prod<1>(f, s, "a,b,x|af,bf,x|a,b,af,bf");
  fs.reduce(EliminateZeros::No, true);
}
Tensor<4> io2;
{ auto ioA = genIONode(aBasis);
  auto ioB = genIONode(bBasis);
  io2 = prod<0>(ioA, ioB, "a,af|b,bf|a,b,af,bf");
}

Order of Tensor Contractions

The order of contraction of indices is rather important in tensor networks. For example, when calculating the norm of a matrix product state, it is paramount to contract from left to right all site tensors with their complex conjugate and the relevant transfer matrix, rather than first contracting all sites tensors, then all complex conjugates of all site tensors and then the two now-massive tensors representing the state together. While the former runs in polynomial time (linear in the length of the MPS, qubic in its bond dimension, linear in its physical dimension), the latter runs in exponential time in the length of the MPS! Similar differences can often occur in smaller networks when it is possible to contract multiple tensors in different ways.

While for small networks, the optimal ordering is fairly easy to figure out by hand (simply check the cost of all possible orderings), for larger networks, this gets complicated quickly.

In the source of arXiv:1304.6112v6, you will find a netcon.m file. Open this file in Matlab ($ module load matlab; matlab, then click it…). You then get a function netcon which is explained in more detail also in that paper. Essentially, a call:

netcon({[1,2,3],[4,5,6],[1,5,7,8,-1],[2,4,7,-2],[3,6,8,-3]},
       2,1,1,0,[1 w; 2 m; 3 m; 4 m; 5 w; 6 m; 7 d; 8 d; -1 v; -2 n; -3 n])

specifies that you have five tensors in your network (the five groups enclosed in [] in the first argument). The first tensor has indices labelled 1, 2 and 3; the third tensor has indices labelled 1, 5, 7, 8 and -1. Indices labelled with positive integers are contracted (e.g. the first indices of tensors one and three are to be contracted), while negative integers denote resulting tensor legs (the last leg of the third tensor is the first leg of the resulting tensor). The options in between shouldn't have to be changed, the last option denotes the dimensions of each index (labelled by positive and negative integers), e.g. index -1 has dimension v whereas index 2 has dimension m. You can change and assign to these variables as you please; for example

w = 3; v = 6; m = 100; n = 200; d = 2;

works for assignment. Once you call netcon(), it should give you a result relatively quickly, naming the indices in the order in which you should contract them, i.e.

Best sequence:  6 3 1 5 8 2 4 7
 Cost:           10032000000

means to first contract over index 6 (tensors 2 and 5), then index 3 (tensors 1 and (25)), followed by indices 1, 5 and 8 (tensor 3 and (125)) and finally indices 2, 4 and 7 (tensor 4 and (3125)).

There doesn’t seem to be an analytic option to this, but using hugely different orders of magnitude for different indices works equally well as specifying $ m \gg d $ in some imaginary language.

Ordering of Tensor Legs

The above section deals with the order in which multiple tensors should be contracted together. This section deals with the ordering of legs within a tensor for optimal contraction. It is based on the fact that each dense tensor block is stored internally as an array of scalar numbers. When two dense tensor blocks are contracted together, the numbers in the two arrays have to be reshuffled such that the tensor contraction can be expressed as a matrix-matrix multiplication (transpose-transpose-gemm-transpose, TTGT). For example, given tensors $ A_{ajib} $ and $ B_{icdj} $, the contraction $ \sum_{ij} A_{ajib} B_{icdj} $ first transforms $ A_{ajib} \to A^\prime_{(ab)(ij)} $ and $ B_{icdj} \to B^\prime_{(ij)(cd)} $. It then calculates the matrix-matrix product $ A^\prime \cdot B^\prime = R^\prime_{(ab)(cd)} $ and then reorders the scalars in $ R^\prime $ according to the requested ordering of legs; e.g. if the call was prod<2>(a, b, {-2, 1, 2, -4}, {2, -3, -1, 1}), entries in $ R^\prime $ are sorted as $ R^\prime_{(ab)(cd)} \to R_{dacb} $.

This has the consequence that if $ A $, $ B $ or $ R $ are already in the same order as their primed counterparts, no rearraging of entries has to take place. Taking into account the column-major convention of BLAS which is used to calculate the matrix-matrix product, no rearranging will take place:

For $ A \to A^\prime $ iff:
- contracted indices come first, in ascending order
- non-contracted indices come second, in ascending order
For $ B \to B^\prime $ iff:
- contracted indices come second, in ascending order
- non-contracted indices come first, in ascending order
For $ R^\prime \to R $ iff:
- All non-contracted indices from $ B $ come before those of $ A $.
- Non-contracted indices within $ A $ and $ B $ are in ascending order.

For example, for the contraction of two rank-3 tensors into a new rank-4 tensor, the following call is optimal:

prodD<1>(a, b, {1,-3,-4}, {-1,-2,1})

and if all dimensions of a and b are of size 15, it takes 0.93ms to complete. In comparison, the call

 prodD<1>(a, b, {-1,1,-3}, {-4,-2,1})

with the same sizes takes 1.70ms to complete.

When contracting the same two tensors into a single rank-2 tensor, the call

prodD<2>(a, b, {1,2,-2}, {-1,1,2})

takes 0.055ms, while the call

prodD<2>(a, b, {-2,1,2}, {2,-1,1})

takes 0.122ms.

For larger tensors, the difference between the two calls becomes less and less noticeable, since re-arranging two matrices of size $ m \times n $ and size $ n \times p $ only takes $ O(m \cdot n + n \cdot p) $ operations, while their product takes $ O(m \cdot n \cdot p) $ operations. Nevertheless, it is useful to keep in mind.

One major exception to this formulation is the handling of fusing/merging tensors. When merging two legs of a tensor, the merged legs are first transposed to the front, merged and then transposed into the final position. As such, you ideally want to merge the first two legs of a tensor and then keep the merged leg in the front.

In general, it is more expensive to transpose legs in the result tensor than in the initial tensor, as they are transposed in every individual result tensor prior to the tensor reduction. Hence, if possible, avoid reordering tensor legs in the last transpose step of transpose-transpose-gemm-transpose.

Noncommuting multiplication: Is there any system to this madness?

Operator Multiplication

Sadly, C++ only knows operator*=, not operator=*. Hence, we can only write state *= op, which we want to be $ \hat O | \psi \rangle $. Consequently, state *= op1 *= op2 should be $ \hat O_2 \hat O_1 |\psi\rangle $. This in turn means that op1 * op2 should probably be $ \hat O_2 \hat O_1 $ (i.e. $ \hat O_1 $ applied first and then $ \hat O_2 $). This extends to dot() products of operators, where we want dot(a, b) to apply a first and b second. Since in $ \hat O_2^\dagger \cdot \hat O_1 $ it is the second operator which is dual-space conjugated, dot(a, b) also dual-space-conjugates b, not a.

Scalar Products

Use the function overlap(a, b) which complex-conjugates its first argument (a) and calculates the complete product between a and b.

The prod(a, b) family of functions for type-ranked tensors (syten::Tensor) has an option to complex-conjugate its second argument for arbitrary tensor products. To be consistent with this, in the special case that all legs are contracted, prod(a, b) also complex-conjugates its second argument. This is also true for syten::prodD() and syten::prodS(). Note that while for products with a product specification to match different legs together, the second argument is not conjugated by default. In contrast, if one calculates the full product without giving a specification, the second argument is conjugated by default.

For states, dense tensors and tensors (but not syten::STensor), operator* also complex-conjugates the first argument.

syten::STensor objects are multiplied by operator* or syten::STensorImpl::prod(). None of the arguments are conjugated by default regardless of whether only some or all legs are contracted. Use syten::STensorImpl::overlap() to obtain overlaps or manually create a proxy using syten::STensorImpl::conj().

The Operator Parser

The command-line operator parser was until recently inconsistent, but now keeps arguments in mathematical order, i.e. operators further to the right are applied first.

Overview

| Expression | (via) | Result | |:------------------------------------------—|:--------------------—:|----------------------------------------------------------—:| | $ syten-overlap <a>  | overlap(a, b) | $ \langle a | b \rangle $ | | ‘'lat:a b *’| | @f$ \hat a \hat b \left(|\psi\rangle\right) @f$ | |'lat:a b ·'|dot(b, a)| @f$ \hat a^\dagger \cdot \hat b \left(|\psi\rangle\right) @f$| |overlap(MPS::LBOState a, MPS::LBOState b)| | @f$ a^\dagger \cdot b @f$ | |overlap(MPS::State a, MPS::State b)| | @f$ a^\dagger \cdot b @f$ | |overlap(BTT::State a, BTT:State b)| | @f$ a^\dagger \cdot b @f$ | |overlap(MPS::Operator a, MPS::Operator b)| | @f$ \mathrm{tr}\left(\hat a^\dagger \cdot \hat b \right) @f$ | |overlap(Tensor a, Tensor b)|prod(b, a, Conj::y())| @f$ a^\dagger \cdot b @f$ | |overlap(STensor a, STensor b)|prod(conj(a), b)| @f$ a^\dagger \cdot b @f$ | |overlap(StateWrapper a, StateWrapper b)| | @f$ a^\dagger \cdot b @f$ | |operator*(STensor a, STensor b)| | @f$ a \circ b @f$ (connected over all matching legs) | |operator*(DenseTensor a, DenseTensor b)|prodD(b, a)| @f$ a^\dagger \cdot b @f$ | |operator*(Tensor a, Tensor b)|prod(b, a)| @f$ a^\dagger \cdot b @f$ | |operator*(MPS::Operator a, MPS::Operator b)| | @f$ \hat b \hat a \left(|\psi\rangle\right) @f$ | |prod(STensor a, STensor b)| | @f$ a \circ b @f$ (connected over all matching legs) | |prod(Tensor a, Tensor b)| | @f$ b^\dagger \cdot a @f$ | |prodD(DenseTensor a, DenseTensor b)| | @f$ b^\dagger \cdot a @f$ | |prodS(SparseTensor a, SparseTensor b)` | | $ b^\dagger \cdot a $ |

Scalar types

There are numerous scalar types defined via typedefs. Here’s an overview (with default settings):

Name	(default) underlying type	(default) range/precision	Application
`syten::Index`	`std::uint32_t`	0 to 2^32	Tensor indices and dimensions
`syten::Size`	`std::size_t`	0 to 2^64	Indices into flat arrays
`syten::Rank`	`std::size_t`	0 to 2^64	Tensor ranks, compile-time constants, arrays
`syten::CDef`	`syten::HighPrec`	real, 76 digits	Clebsch-Gordan coefficient tensors element type
`syten::MDef`	`syten::HighPrec`	real, 76 digits	IREP representation matrices
`syten::SDef`	`std::complex<double>`	complex, 15 digits	Standard dense tensor element type, configurable with `SYTEN_COMPLEX_SCALAR`/`SYTEN_SDEF_TYPE`, `_c` (`_i`) user-defined literal
`syten::SRDef`	`double`	real, 15 digits	Default real-only low-precision type, configurable with `SYTEN_SRDEF_TYPE`, `_r` user-defined literal

If you want to describe a standard tensor element or scalar type, use SDef. If you want to describe a real scalar type, use SRDef. Use SRDef also for threshold and order-of-magnitude calculations. For Clebsch-Gordan-Coefficient tensors and calculations leading up to them, use CDef.

One of the aims here is to be able to, with relatively little effort, use float instead of double, use double instead of std::complex<double> and potentially even use higher-precision types. As such, most code should use only the scalar typedefs listed above.

Overlaps of matrix-product states

Given two MPS $ |a\rangle $ and $ |b\rangle $ and the aim to evaluate the overlap very precisely:

if their overlap $\langle a | b \rangle $ is small, there is probably nothing better to do (I think) than syten-overlap a.state b.state which will give you this small number. Depending on the length of the system and the bond dimension of the states, this overlap can probably be $ 10^{-10} $ even if it should be zero.

Maybe the stability of this calculation can be improved by bringing both states into left-canonical form first and then calculating the overlap. This is straightforward in pyten. The reasoning is that if the states are in canonical form, most site tensors $ A_{1…L-1} $ are "well-behaved" and their product should also be well-behaved. Since the syten::MPS::overlap() function proceeds from left to right, the non-well-behaved tensors $ A_L $ would be used last. But I'm not sure if this really makes a difference.
if their overlap $\langle a | b \rangle $ is large and you want the deviation from 1, then instead of calculating $ \sqrt{2} \sqrt{1 - \langle a | b \rangle} $ (done by syten-overlap --root --subtract a.state b.state plus scaling by $ \sqrt{2} $) it is indeed better to evaluate $ || | a\rangle - |b \rangle || $ which is done by syten-overlap --error --root a.state b.state. In this case, syten-overlap first subtracts the states $ |a\rangle $ and $ |b\rangle $, normalises the result and calculates its norm. This results in a slightly more precise evaluation allowing for about two orders of magnitude smaller overlaps (cf. the figure below; at $ m = 256 $ the maximal bond dimension was obtained).

Overlaps close to 1 as calculated above with two different methods.

Why should I not copy syten::STensor objects?

Short answer: Use syten::STensorImpl::STensor::copy() if a copy is unavoidable, move otherwise.

Long answer: Unfortunately, it is rarely quite clear whether an object is copied or moved-from. In particular, constructing a vector from a std::initializer_list of pure r-values, e.g.

void f(Vec<STensor> x);
 
STensor a;
 
STensor factory();
 
f({std::move(a), factory()});

produces an unnecessary copy of syten::STensor objects because the content of std::initializer_list is constant, while doing the same with a tuple only produces one move. For a simpler example, see e.g. here, notice how f() constains references to X(X&&) whereas g() contains references to X(X const&).

Furthermore, while returning temporaries inside curly braces into a std::tuple (as e.g. done in syten::STensorImpl::svd) works and does not copy, having named objects there also creates an unnecessary copy, which is not at all reasonable nor quite straightforward to see, as returning a named copy outside a tuple invokes copy elision.

To avoid those and other errors in the future, the copy ctor of syten::STensor logs all implicit copies made using either it or the copy assignment operator if tensor timings are enabled. The public member function syten::STensorImpl::STensor::copy() does not log this and should be used if you are certain that some copy is unavoidable.

Runtime Errors

boost::archive::archive_exception: array size too short

The following error is printed when attempting to read a state or lattice file:

terminate called after throwing an instance of 'boost::archive::archive_exception'
  what():  array size too short

This is most likely due to a mismatch between the scalar types used by the toolkit binary (run syten-binary --version to get a printout of its scalar types used) and the file read in. For example, if a lattice is created by syten-sql-mps-spin using real scalar values and then subsequently attempted to be read by syten-random using complex scalar values (or vice versa), the error will occur. See SYTEN_COMPLEX_SCALAR for details.

boost::archive::archive_exception: input stream error

When reading an input file (e.g. a state or lattice) fails, the following error may be printed:

terminate called after throwing an instance of 'boost::archive::archive_exception'
  what():  input stream error
  Aborted

In this case, you should make sure that you read in a valid state or lattice file. A common reason for failure until recently was to read a file which is still written-to. This failure mode should now be avoided (we create a temporary file first, serialise into that file and then move the temporary file into place), but similar reasons (no space left on device, for example) may also lead to such an error.

Error while loading shared libraries

The following error is printed when attempting to run a binary (file name may be different):

error while loading shared libraries: libmkl_intel_lp64.so:
  cannot open shared object file: No such file or directory

This error occurs if the system does not find a shared library required by the program. Most likely this will be Boost or the Intel MKL. Make sure that paths to both are contained in your LD_LIBRARY_PATH. At most clusters, the Intel MKL can be enabled by module load intel or module load icc (the version needs to match the version you used to compile the binary). The Boost library can often similarly be loaded via module load or, alternatively, by directly adding the boost/lib directory to your LD_LIBRARY_PATH.

'Attempt to create unrepresentable HalfInteger quantum number…'

The toolkit stores quantum numbers as syten::HalfInteger values which internally are stored as standard integers of twice the represented value (so a quantum number of Sz=3/2 is stored as an syten::HalfInteger with value 3/2 which in turn is stored as an integer with value 7). By default, a std::int16_t is used as the internal integer storage, resulting in representable quantum numbers ranging from approx. -16'000 to approx. +16'000. For most applications, this is enough.

However, if your lattice is very large and/or bosonic, this may not be enough. Note that in certain applications, such as generation of ‘complete’ random states, enumeration of all possible quantum numbers is necessary.

To alleviate this issue, either choose a smaller lattice or define the macro SYTEN_RDEF_BASE_TYPE to std::int32_t or std::int64_t. Note that the typical concerns apply here, binaries will not be compatible between different values of that type. To do so, add a line CPPFLAGS_EXTRA+=-DSYTEN_RDEF_BASE_TYPE=std::int32_t to your make.inc file.

One of the executable arguments needs to be a complex number, but doesn't accept '(0, 0.01)'.

To pass a complex number as an argument on the command line, use ' around it to stop the shell from intepreting the brackets. Do not place any spaces inside the brackets, i.e., use -d '(0,0.01)', not -d '(0, 0.01)'.

DMRG runs forever and only prints energy zero

DMRG convergence is currently measured as the relative change of energy over the last two sweeps, something like $ (E_{\mathrm{last}} - E_{\mathrm{now}})/(|E_{\mathrm{last}}|) $ has to be small. If the energy is exactly zero (or small), this check breaks down. To limit the number of sweeps then, use e.g. (m 50 l 10) for m=50 states and at most l=10 half-sweeps.

This problem can in particular occur if you try to orthogonalise against many orthogonal states or in a small subspace, the eigensolver is then more likely to return a zero state.

<tt>syten-random</tt> with <tt>-g s</tt> takes forever to create a spin state

syten-random -g s first creates a near-vacuum state (N=0, S=0 etc.) and then applies single-site operators randomly until the state transforms as desired. However, if you attempt to create a S=0 state on five Heisenberg spins (or with e.g. five electrons in the Hubbard model), syten-random -g s will not notice that this is in fact impossible and just try to apply spin flip operators forever. If syten-random -g s takes suspiciously long, consider re-checking your arguments. Alternatively, the ‘complete’ generator -g c will generate a zero-norm state in this case (all tensors empty).

My calculation gets killed by <tt>SIGKILL</tt> or throws <tt>bad_alloc</tt> errors

If your program needs more memory than the operating system can supply, it may either throw a std::bad_alloc exception (if the OS denies an allocation) or get killed by a SIGKILL signal (if the OS initially agreed to the allocation but doesn’t have the memory once we access it, cf. overcommit.)

To diagnose such problems, you should

set the environment variables SYTEN_TENSOR_TIME=t and SYTEN_LOG_TIME_LVL=6. The former will cause timing of all tensor products to be enabled, the latter will print out the times after each individual product. In this way, you can check which product in your code is the problematic one, the last ‘completed’ product will be printed out just before the crash. The output tells you the line of the product as well as the sizes (in blocks and memory) of the involved tensors.
set the environment variable SYTEN_MEMORY_SAMPLER=mysamplefile. This will cause memory usage to be printed to the file continously. You can cross-check with the output of the tensor timer to see which products and at which times your code needs the most memory.

Once you have identified the parts of the code taking the most memory, you should check that only temporary tensors necessary during specific tensor contractions exceed the memory budget of your computer. If (e.g.) individual MPS tensors already go beyond the RAM available, there is not much one can do. If you have identified some temporary tensors which need a lot of RAM, you might want to

consider changing the tensor contraction order to minimise the sizes of the temporary tensors. Often, there is a trade-off between memory and CPU usage.
if your tensors are using SU(2) symmetries, it may be useful to call syten::correct_cgc_tensors on high-rank tensors periodically to avoid a build-up of errors in the CGC tensors.
consider caching unused tensors to disk. Especially during MPS calculations on long chains, storing 100 MPS component tensors of 500 MB each to disk saves you 50 GB of RAM which may be spent on the current on-site contraction tensors.
consider dynamically changing the contraction order. If in a matrix-matrix product A*B*C, sometimes A and sometimes C is bigger (and B in each case reduces the size considerably), it may be preferably to sometimes do (A*B)*C (if A is bigger) and sometimes do A*(B*C) (if C is bigger). syten::MemoryUsage::totalSize may help you determine the size of each tensor, alternatively, comparing the sizes of the relevant bonds may be more appropriate.
try to free unused memory by assigning zero tensors to temporary variables which are still in scope but no longer needed. E.g.
auto x = a * b;

auto y = x * c;

auto z = y * d; // x still takes memory!

may be changed to either
auto x = a * b;

auto y = x * c;

x = STensor();

auto z = y * d; // x still in scope but zero!

or, alternatively, if using syten::STensor or having equal-rank syten::Tensor objects:
auto x = a * b;

x = x * c; // or x *= c;

x = x * d; // or x *= d;
reconsider shared-memory parallelisation which necessarily incurs some memory overhead. This is particularly problematic if you are using top-level parallelisation such as operator-level or real-space parallelisation which causes the same large tensors to be needed in memory simultaneously. Consider disabling such parallelisation in these cases.

Caching warnings

If you get warnings like

AsyncCached<...>@...::wait_for_worker_completion() acquired mutex while worker was still running.

Will release mutex now and sleep for 100ms. See FAQ.

or

AsyncCached<...>@...::wait_for_worker_completion() called while worker is still busy caching the

object to disk. This is bad for performance. See FAQ.

these are due to unexpected situations in the asynchronous caching subsystem. This subsystem receives requests to cache or uncache specific objects and then dispatches workers to do so asynchronously. If a request to uncache an object comes in while this object is still being cached to disk, a performance hit is to be expected (as we have to wait until the caching process completes before we can undo it). You can enable/disable caching using the List of Environment Variables SYTEN_DO_CACHE and friends and you may also adapt the number of workers used and the caching threshold applied (to avoid caching very small objects).

While this warning may be annoying, it normally should not hint towards any kind of calculational error, the most likely cause is that the hard disk drive is much slower than expected.

pthread_mutex_lock: Assertion failure and deadlocks

If you get error messages like

../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.

or observe a deadlock in the mutex protecting a CUDA dense tensor product:

(gdb) bt
#0  0x00002aaaaacdeb7c in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00002aaaaacd983b in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00002aaab82cce3b in ?? () from /mpcdf/soft/SLE_12_SP3/packages/sandybridge/cuda/10.0.130/lib64/libcublas.so.10.0
#3  0x00002aaab82c6c6d in ?? () from /mpcdf/soft/SLE_12_SP3/packages/sandybridge/cuda/10.0.130/lib64/libcublas.so.10.0
#4  0x00002aaab82c9899 in cublasZgemm_v2 () from /mpcdf/soft/SLE_12_SP3/packages/sandybridge/cuda/10.0.130/lib64/libcublas.so.10.0
#5  0x0000000000d18dc1 in syten::CudaDenseTensorImpl::cuda_mm_cm () at inc/dense/cuda_prod.cpp:33

and your glibc version is very old (e.g. 2.22 as shipped by OpenSUSE and used on some clusters), try updating the version of glibc used. At least in one case, it was (a) not possible to reproduce the issue with identical settings and compilers except for a newer glibc and (b) updating glibc on the same cluster solved the problem.

I get an error or observe behaviour which doesn't happen anywhere else. Help?

When trying to reproduce an issue and failing, make sure that settings which may affect runtime behaviour are checked, in particular:

make sure that the same libraries/implementations (e.g. BLAS) are used
make sure that the make.inc file used to compile is identical or that all differences are understood
make sure that the output of env on both systems is identical in particular with respect to LD_LIBRARY_PATH and LD_PRELOAD variables
make sure that you are actually running the binaries you think you are running (check for example which syten-dmrg)
make sure that the git commits of the two executables match (using the --version and, if necessary, --version-diff switches).

Python code produces pages and pages of unintelligible error messages

Linking the pyten module against libtcmalloc causes Python exceptions to produce very long and uncompletely unusable error messages. Try again without linking against libtcmalloc.

I get warnings about 'validateCGCContraction' or that 'Norm of CGC tensors encountered during Tensor::reduce() should never be very small.'

Clebsch-Gordan tensors are used to store the relation between multiplets of states with non-abelian symmetry. During a long calculation, it is possible that errors accumulate in those tensors which lead to non-physical results. Such errors are detected by (among others) contracting the tensor with itself over all but one leg, a proper CGC (of any rank) will give a result proportional to the identity matrix. On every SVD, this contraction is required anyways to ensure proper normalisation and it is straightforward to check for validity of the tensor. Furthermore, tensor products should either be of order one or exactly zero – if they are very small, this also indicates a loss of precision.

Under those circumstances, you will get warning messages likes

2019.09.11T12:14:02.299 6.29976e+03 W SYTEN_WARN() triggered at ./inc/tensor/tensor_add_reduce.h line 209:
   Condition '!underThreshold(nc,SYTEN_ZNORM_THRESHOLD_SQD)' failed.
   Additional information: Norm of CGC tensors encountered during Tensor::reduce() should never be very small.

or

2019.09.11T12:15:29.190 6.31263e+03 W validateCGCContraction: input ({Sparse<2> in {4, 4}:  [{1, 1} => 0.218941] [{2, 2} => 0.218941]}) has element [{1,1}] == 0.218941 != 0.
   This is potentially problematic and hints at numerical instability. The offending
   tensor block has likely been removed.
2019.09.11T12:15:37.924 6.31417e+03 E SYTEN_ASSERT() triggered at ./inc/tensor/tensor_svd.h line 141:
   Condition 'SYTEN_SINGLE_ARG(validateCGCContraction(y) && y[{0,0}] != 0.)' failed.
   Additional information: ({Sparse<3> in {1, 2, 4}:  [{0, 0, 1} => 6.607123e-01] [{0, 1, 2} => -6.607123e-01]})
2019.09.11T12:15:37.924 6.31417e+03 E Uncaught exception encountered: SYTEN_ASSERT() at ./inc/tensor/tensor_svd.h line 141:
   Condition SYTEN_SINGLE_ARG(validateCGCContraction(y) && y[{0,0}] != 0.) failed.
2019.09.11T12:15:37.924 6.31417e+03 E A critical error occured requiring program termination.

The first message indicates that a small CGC tensor was observed, the second message that a non-physical CGC tensor was found (here combining a one-dimensional singlet and a two-dimensional doublet into a four-dimensional quadruplet, which is impossible).

In connection with those error messages, you may also observe your program to get stuck without visible progress and/or extremely high memory usage. This is due to those imprecise CGC spaces no longer being parallel to each other and hence not allowing the reduction of parallel tensor blocks in a tensor. As a result, the number of blocks in each tensor and the memory usage explode.

The high-precision scalars used by SyTen mean that in most cases, you will never run into these problems. However, if your calculations are particularly long and/or your tensor rank is relatively high and/or you are working with large-spin states, you might exhaust the precision of those scalars eventually. Fortunately, it is possible to stabilise even long-running calculations by periodically re-constructing the CGC tensors from scratch. This is done by merging all incoming and outgoing legs into one large leg, ensuring that the resulting rank-2 tensor is proportional to the identity and de-merging the legs again (using exact rank-3 fusing/splitting tensors). The function syten::correct_cgc_tensors() takes care of this and you may wish to call it from time to time, ideally on relatively small tensors carrying the calculation forward. For example, you might want to call it on the tensors of a MPS during a long time evolution, but there is (typically) no reason to call it on the intermediate tensor of e.g. a local eigensolver problem.

For examples of this being done, see e.g. inc/mps/tdvp.cpp or the iPEPSv1 code.

Note that while the C++ function syten::correct_cgc_tensors() takes its argument by reference and works on it in-place, the Python function returns the corrected CGC tensor, taking its argument by-value.

'OMP: Info #270: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.'

In OpenMP 5.0, the function omp_set_nested() to enable nested parallelisation is deprecated. In OpenMP 4.5, it is necessary to call it to enable nested parallelisation. Unfortunately, while both GCC and Clang at the moment do not implement OpenMP 5.0 and hence define the _OPENMP macro to be smaller than 201611 (for OpenMP 5.0 Preview 1), the libiomp5 library shipped with Clang 9.0 already implements this preview and prints a warning if we use omp_set_nested().

This means that while we check the value of the _OPENMP macro at compile-time to only call omp_set_nested() if the OpenMP version is 4.5 or earlier, this does not suffice to ensure that the same version is used at runtime. Unfortunately, OpenMP does not define an interface to get the version of the linked runtime at runtime.

With GCC 8.2, it is recommended to link in libiomp5, as libgomp has performance issues; then using a recent libiomp5 will print the warning. With Clang 9.0, the same problem exists when using the shipped libiomp5. GCC 9.x throws up when used together with libiomp5, hence for this version of GCC, one should only link in libgomp and the problem will not occur.

In any case, this is only a warning which may be annoying but does not hurt. It will go away as soon as the compilers implement OpenMP 5.0 and hence set the _OPENMP macro correctly.

GCC 9.x throws lots of errors at runtime when parallelising

For GCC 8.x and earlier, it is recommended to link in both the Intel-provided libiomp5 and the GNU-provided libgomp as OpenMP runtime libraries, with libiomp5 linked in first and taking precedence. This is because libgomp suffers from some severe performance problems and does not appear to keep a thread pool. However, starting with GCC 9.x, the compiler produces erroneous code when linked with libiomp5, you should hence not link in libiomp5 when using GCC 9.x. Furthermore, when using then Intel MKL together with GCC 9.x, make sure to use -lmkl_gnu_thread and not -lmkl_intel_thread (cf. make.inc.example).

Compiler Errors

OpenBLAS headers require const float * but we are expecting const std::complex<float> *

OpenBLAS in version 0.2.20 defines its complex functions to take float and double pointers whereas all other BLAS libraries take void pointers (cf. here). This was fixed in a later release.

The problem is likely to surface while compiling e.g. inc/util/dynarray.cpp.

If possible, please install a later release of OpenBLAS. If you are stuck on a buggy OpenBLAS installation, please

use the generic Netlib BLAS header (on Debian, by updating the alternatives symlink /etc/alternatives/cblas.h-x86_64-linux-gnu to point to /usr/include/x86_64-linux-gnu/cblas-netlib.h)
and do not define the SYTEN_USE_OPENBLAS macro (the generic header does not have the OpenBLAS-specific threading functions).

'./inc/tensor/tensor_prod.h:572:8: error: expected ‘(’ before ‘constexpr’'

The compiler emits an error message complaining e.g. about

if constexpr (is_number<decltype(r)>::value) {
   ^~~~~~~~~

Starting at commit f366cfad, SyTen requires a C++17 compatible compiler, such as GCC 7.x. If your compiler is older and you cannot upgrade, consider the last C++11-commit only, tagged as pre-C++17: git checkout pre-C++17.

'error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]'

GCC 7.1 incorrectly diagnoses a strict-aliasing violation (GCC 7.2 fixes this problem). To avoid the error message, add CPPFLAGS_EXTRA+=-Wno-strict-aliasing to your make.inc file.

Make: No rule to make target ‘X.h’, needed by ‘Y.d’. Stop.

If you encounter such an error, it is most likely because you still have an old dependency file which refers to a no-longer existing header file. Make then tries to re-make this header file, but does not know how (since they are written by hand, not by make) and then stops. Try to delete all old header files, e.g. via

find . -name '*.d' -delete

executed in the toolkit directory.

GCC/Clang: X.o:(.data+0x0): undefined reference to ‘__must_be_linked_with_icc_or_xild’

This error occurs if you compiled object files with the Intel compiler, but then attempted to compile an executable using either GCC or Clang. Either use the Intel compiler to also compile the executable or re-compile the object files with the other compiler. This can be done easiest by deleting all old object files via

find . -name '*.o' -delete

GCC prior to 7.x: <tt>error: attributes at the beginning of statement are ignored [-Werror=attributes]</tt>

GCC 6.x does not yet know about the [[fallthrough]] attribute which is a part of C++17. If you encounter this error, it is best to silence the warning by adding the line

CPPFLAGS_EXTRA+=-Wno-attributes

to your make.inc file.

In function ‘_start’: undefined reference to ‘main’

The following error is emitted by the linker:

/usr/lib/../lib64/crt1.o: In function `_start':
  /usr/src/packages/BUILD/glibc-2.11.3/csu/../sysdeps/x86_64/elf/start.S:109:
  undefined reference to `main'

This complains that there is no function main() when attempting to build an executable. Most likely the error occurs because you only declared a function syten::main(). To be precise, an executable should follow either this model:

#include "header.h"
 
namespace syten {
  int main(int argc, char** argv) {
    // do stuff here
    return 42;
  }
}
 
int main(int argc, char** argv) { return syten::main(argc, argv); }

or

#include "header.h"
 
int main(int argc, char** argv) {
  using namespace syten;
  // do stuff here
  return 42;
}

The first method is slightly preferred because it minimises the use of the global namespace and hence possibly name clashes with other functions.

How should I acknowledge using the toolkit?

Please see Contributors.

Table of Contents