SyTen
Running

Each binary will print a short help text when queried via --help and some build information when --version is passed. Some environment variables are also parsed and may influence program behaviour.

OpenMP Threading

If OpenMP is enabled, threading will take place. There are four defined “levels” of threading, each corresponding to a set of locations at which it will be used and each can be configured to employ so-and-so many threads. Each number of threads can be changed via environment variables and options passed to the binary, where the option passed takes precedence over the environment variable(s). From top to bottom, these are

  • super: Used to parallelise over large-scale structures, e.g. all tensors in a state or MPO. Defaults to a single thread. Environment variable is SYTEN_THREADS_SUPER, option is --threads-super. Real-Space DMRG Parallelisation uses this option to specify the number of workers, sampled random state generation uses it to produce multiple sampled states at the same time.
  • tpo: Used to parallelise by considering a tensor-product operator (TPO) as a sum of smaller TPOs. Currently only used in MPS-DMRG when passing in multiple Hamiltonians by repeating -l.
  • tensor: Used to parallelise over symmetry-protected blocks in a single tensor. Environment variables are OMP_NUM_THREADS and SYTEN_THREADS_TENSOR, with the latter taking precedence. Option is --threads-tensor.
  • dense: Used to parallelise over the elements of dense reduced tensors, e.g. in tensor-tensor contractions. Environment variables are MKL_NUM_THREADS and SYTEN_THREADS_DENSE, with the latter taking precedence. Option is --threads-dense.
Speed-Up to be expected by OpenMP Threading
  • Super-level threading is largely only implemented in the context of real-space DMRG parallelisation. There, you can expect a close-to-ideal speed-up per sweep at the cost of slightly more sweeps, especially during the beginning of the calculation.
  • Tensor-product level threading allows parallelisation by treating a sum of tensor-product operators as one effective operator. This may eventually lead to a speed-up, in the cases tested so far, it however always result in worse parallelisation than the simple tensor-level parallelisation below. The sole exception is the case of no symmetries, where is is favourable to use this.
  • Tensor-level threading gives speed-ups up to roughly half the number of blocks in each individual tensor, to get an estimate, check the N column in DMRG output. Allowing too many threads there hardly ever results in any slow-down.
  • Dense-level threading sometimes results in a speed-up for very large dense matrices, but very often leads to a serious slow-down, in particular if, due to the increased number of threads/cores, the tensor-level cores cannot sit on a single blade any more.

As always, using more threads than cores available will kill performance.

List of Environment Variables

This list is an attempt to enumerate the environment variables affecting the behaviour of SyTen. It may not be complete, but every variable listed here should behave as described.

Behaviour Changes

  • SYTEN_MPO_PARSE_NOTRUNC: If set, truncation of MPOs during parsing is disabled. This has the advantage that you get exactly what you specify, however it also leads to larger MPOs than necessary (e.g. for DMRG). Note that some tools (syten-print and syten-truncate) always disable truncation of MPOs during parsing.
  • SYTEN_DEFAULT_TPO_TRUNCATE: Can be set to s/svd, l/delinearise, p/deparallelise or d/default. If set, MPO truncation routines will only use the selected truncation method, unless specified in the function call (this is rare, however). If you observe numerical trouble from truncating MPOs, consider setting this to p.
  • SYTEN_NO_HISTORY: If set, disables recording of new history records for states. This is useful if you intend to use many iterations using command-line tools which would normally record the history.
  • SYTEN_PROD_EARLY_CHECK: If set, tensor products check the dimensions of dense tensors early, i.e.~sequentially and outside of the OpenMP loop later tasked with the actual dense tensor products. The reason is that throwing exceptions out of OpenMP loops is forbidden and leads to program termination. In particular in ipython+pyten.so, this is rather annoying, as it kills the entire notebook. The early check is enabled by default by pyten.so, but can also be enabled by setting this environment variable.
  • SYTEN_MINIMISE_RANKS: Integer values [-1, 0, 1] used at the moment. 0: default, -1: always use asymptotically optimal algorithms with potentially higher-rank tensors, 1: always attempt to minimise tensor ranks at the cost of computational asymptotic efficiency.

Caching subsystem

  • SYTEN_DO_CACHE: If set to 1, process-wide caching is enabled, if set to zero, it is disabled. --cache also sets this option. If enabled, calls to syten::Cached::cache() (and the syten::AsyncCached versions) will work, if disabled (default), they will return immediately as if the file was too small for the configured threshold.
  • SYTEN_CACHE_THRESHOLD: Adapts the caching threshold in bytes below which objects are not cached to disk. Defaults to 1 megabyte (1048576).
  • SYTEN_CACHE_MAXWORKERS: Adapts the number of workers of the asynchronous caching infrastructure. Defaults to 10.
  • SYTEN_CACHE_DIR: If set, do not save cache files in the current working directory but in the specified directory instead.

CUDA subsystem

  • SYTEN_CUDA_DEVICES: Sets the allowed CUDA devices to the supplied comma-separated list. Has no effect if CUDA support is not compiled in. SYTEN_CUDA_DEVICES=0,2,3 has the same effect as a call Cuda::setup({0,2,3}). Note that using multiple CUDA devices is currently disabled due to a race condition somewhere.
  • SYTEN_CUDA_THRESHOLD: Threshold in bytes below which tensors are moved to the CPU instead of the GPU if a tensor-tensor operation is acting on one tensor on the GPU and one tensor on the CPU. That is, when some operation op(A,B) is called on two dense tensors A and B and one of them lives on the CPU and the other on the GPU and the tensor on the GPU is smaller (in bytes) than SYTEN_CUDA_THRESHOLD, this tensor is moved to the CPU and the operation is performed on the CPU.
  • SYTEN_CUDA_NUM_HANDLES: Sets the number of cuBLAS handles to be initialised per device up-front. Should be roughly equal to the total number of threads, possibly divided by the number of devices.
  • SYTEN_CUDA_ALLOC_MAX and SYTEN_CUDA_ALLOC_MIN: Sets the logarithm of the maximal and minimal block sizes used by the CUDA allocator. That is, the CUDA allocator will obtain blocks of size 2**SYTEN_CUDA_ALLOC_MAX from the system (and forward requests for larger sizes directly to cudaMalloc()) and split those blocks up into at most blocks of size 2** SYTEN_CUDA_ALLOC_MIN. By default, blocks fo 4 GB are obtained and split up into sizes of at least 256 Byte.

Threading

  • SYTEN_NO_THREADS: If set, disables all threading. Overwritten by SYTEN_THREADS_*, OMP_NUM_THREADS, MKL_NUM_THREADS and command-line options.
  • OMP_NUM_THREADS: If set, sets the number of tensor-block level threads to this value. Overwritten by SYTEN_THREADS_TENSOR and the command-line option --threads-tensor
  • MKL_NUM_THREADS: If set, sets the number of dense level threads to this value. Overwritten by SYTEN_THREADS_DENSE and the command-line option --threads-dense.
  • SYTEN_THREADS_{SUPER,TPO,TENSOR,DENSE}: If set, sets the number of threads on that level to this value. Overwritten by command-line options --threads-{super,tensor,dense,sub}.

Debugging information

  • SYTEN_INFO_DYNARRAY_ALLOC: If set, the size of dynamic allocations from syten::DynArray will be printed to cerr.
  • SYTEN_DISABLE_BACKTRACES: If set, disables backtraces from printing. Useful if you somehow generate many warnings.
  • SYTEN_VARAPPLYORTHO_TIMING: If set, enables detailed timing information in the apply_op_orthogonalise_fit() function used by the V mode of syten-krylov.
  • SYTEN_MEMORY_SAMPLER: If set, enables the memory sampler. The sampler will collect S measurements T milliseconds apart and then print the average and the maximal measured memory usage to the file set in the SYTEN_MEMORY_SAMPLER environment variable. T can be set via the SYTEN_MEMORY_SAMPLER_INTERVAL environment variable, S can be set with the SYTEN_MEMORY_SAMPLER_SAMPLES environment variable. The default is to collect 100 samples 10ms apart each.
  • SYTEN_USE_EXTERN_DEBUG: If set, the termination handler and also other printing of backtraces is disabled. In particular the termination handler leads to clean program exits under error conditions, which is not helpful if one uses GDB or somesuch to debug the code.
  • SYTEN_TENSOR_TIME: If set, tensor products and decompositions are timed (if possible including their call-point) and collected into a big statistic which is printed either at programm termination or when syten::TensorProd::timer_print_data() is called.
  • SYTEN_DBG_AD: If set, extra debugging information is printed for automatic differentiation.
  • SYTEN_STENSOR_REGISTER: If set, each created syten::STensor object is registered together with its creation time. A call to syten::STensorImpl::registry_printout() (or, in Python, stensor_registry_printout()) will cause all currently-existing syten::STensor objects to be listed together with their creation time, rank, location in memory, autodiff ID and basis IDs.

Logging levels

  • SYTEN_LOG_TIME_FNAME and SYTEN_LOG_GENERIC_FNAME: If set, the file names for generic and timing logs to use. Overwritten by command-line options --log-file and --log-file-timings respectively.
  • SYTEN_LOG_TIME_LVL, SYTEN_LOG_TIME_FLVL and SYTEN_LOG_TIME_TLVL: Timing log levels for both files and the terminal, only the file and only the terminal. _FLVL and _TLVL overwrite _LVL. Overwritten by command-line options --log-level-timings, --log-level-timings-file and --log-level-timings-err.
  • SYTEN_LOG_GENERIC_LVL, SYTEN_LOG_GENERIC_FLVL and SYTEN_LOG_GENERIC_TLVL: Timing log levels for both files and the terminal, only the file and only the terminal. _FLVL and _TLVL overwrite _LVL. Overwritten by command-line options --log-level, --log-level-file and --log-level-err.

Output control

In general, warnings, errors, trace logs etc. are always printed to the standard error stream. Only actual output data (e.g. expectation values calculated) are written to the standard output stream. The format and amount of the latter data strongly depends on the tool used. The former, error data, however, can be handled in a fairly general fashion:

All tools currently support the --quiet option which completely suppresses all non-necessary output. There is currently a move to use generic logging functions which obey the options --log-level* and --log-file*. With these logging functions, an ‘informational’ level of output is the default. It can be reduced to only printing notices, warnings or errors. It is possible to define a log file. If set, messages are also stored there. It is also possible to define different levels for the terminal output and the log file, if required.