memdet#
- detkit.memdet(A, max_mem=inf, num_blocks=1, assume='gen', triangle=None, mixed_precision='float64', parallel_io=None, scratch_dir=None, overwrite=False, return_info=False, verbose=False)#
Compute log-determinant under memory constraint.
- Parameters:
- Anumpy.ndarray, numpy.memmap, zarr.core.Array, dask.array, tensotstore.array
Square matrix, which is either already loaded on the memory (such as
numpy.ndarray
), or cannot fit on the memory, and rather, stored on disk as Numpy’s memory map, or Zarr, Dask, or TensorStore array formats.- max_memfloat or str, default=
float('inf')
The maximum memory allowed to be allocated during the computation. This can be specified as an integer representing the number of bytes, or as a string such as
16.2GB
(a number immediately followed by the unit of memory such asB
,KB
,MB
,GB
,TB
, etc).The default value of
float('inf')
indicates infinite amount of memory is available, hence, no memory is constrained. This case falls back to conventional computation of log-determinant on memory without creation of any scratchpad space on disk.Note
To constrain memory, you can either set
max_mem
or directly set the number of matrix blocks (seenum_blocks
option). Ifmax_mem
is set,num_block
option is ignored.- num_blocksint, default=1
Number of memory blocks along rows and columns. This is used when the whole matrix cannot be loaded on the memory, rather, smaller blocks of the matrix are loaded on the memory.
If =1: the whole matrix is loaded to memory as one block. No scratchpad disk space is needed as all data is on memory.
If =2: matrix is decomposed to 2 by 2 memory blocks (four blocks), but three of these blocks will be loaded concurrently to memory. No scratchpad disk space is needed.
If >2: matrix is decomposed to a grid of (
num_blocks
,num_blocks
) blocks, but only 4 of these blocks will be loaded concurrently. Scratchpad disk space will be created (seescratch_dir
option).
The number of blocks may or may not be a divisor of the matrix size. If the number of blocks is not a divisor of the matrix size, the blocks on the last row-block and column-block will have smaller size.
Note
To constrain memory, you can either set
num_blocks
or directly set the amount of memory limit (seemax_mem
option). Ifmax_mem
is set, the givennum_block
option by the user is ignored, and instead computed automatically.- triangle
'l'
,'u'
, or None, default=None When the matrix is symmetric, this option indicates whether the full matrix is stored or only a triangular part of the matrix is given:
'l'
: assumes the lower-triangular part of the matrix is given.'u'
: assumes the upper-triangular part of the matrix is given.None
: indicates full matrix is given.
- assumestr {
'gen'
,'sym'
,'spd'
}, default='gen'
Assumption on the input matrix A:
'gen'
: generic square matrix'sym'
: symmetric matrix'spd'
: symmetric positive-definite matrix
The assumption on the matrix is not checked.
- mixed_precisionstr {
'float32'
,'float64'
}, or numpy.dtype, default='float64'
The precision at which the computations are performed. This may be different than the data type of the input matrix. It is recommended to set a precision equal or higher than the dtype of the input matrix. For instance, if the input matrix has
float32
data type, you may set this option tofloat64
.- parallel_iostr {
'multiproc'
,'dask'
,'tensorstore'
} or None, default=None Parallel data transfer (load and store operations of each block) from memory to scratchpad on the disk and vice-versa:
'multiproc'
: utilizes Python’s built-in multiprocessing.'dask'
: utilizes Dask’s multiprocessing. For this to work, the package dask should be installed.'tensorstore'
: utilizes TensorStore’s multiprocessing. For this to work, the packages tensorstore and zarr should be installed.None
: no parallel processing is performed. All data transfer is performed on a single CPU thread.
Note
The option
'tensorstore'
can only be used when the input matrix A is a zarr array. See zarr package.- scratch_dirstr, default=None
When
num_blocks
is greater than 2, the computations are performed on a scratchpad space on disk. This option determines the directory where memdet should create a temporary scratch file. IfNone
, the default OS’s tmp directory will be used. For instance, in UNIX, this is almost always'/tmp'
directory.Note
This directory should have enough space as much as the size of the input matrix (or half of the input matrix size if
triangle
option is set).- overwriteboolean, default=True
Uses the input matrix storage for intermediate computations. This will overwrite the input matrix.
- return_infobool, default=False
Returns a dictionary containing profiling information such as wall and process times, memory allocation, disk usage, etc. See
info
variable in the return section below.- verbosebool, default=False
Prints verbose output during computation.
- Returns:
- ldfloat
\(\mathrm{logabsdet}(\mathbf{A})\), which is the natural logarithm of the absolute value of the determinant of the input matrix.
- signint
Sign of determinant
- diagnumpy.array
An array of the size of the number rows (or columns) of the matrix, containing the diagonal elements of the matrix decomposition as follows:
For genetic matrix (when
assume='gen'
), this is the diagonal entries of the matrix \(\mathbf{U}\) in the LU decomposition \(\mathbf{P} \mathbf{A} = \mathbf{L} \mathbf{U}\).For symmetric matrix (when
assume='sym'
), this is the diagonal entries of the matrix \(\mathbf{D}\) in the LDL decomposition \(\mathbf{P} \mathbf{A} = \mathbf{U}^{\intercal} \mathbf{D} \mathbf{U}\) where \(\mathbf{U}\) is upper-triangular.For symmetric positive-definite matrix (when
assume='spd'
), this is the diagonal entries of the matrix \(\mathbf{L}\) in the Cholesky decomposition \(\mathbf{A} = \mathbf{U}^{\intercal} \mathbf{U}\) where \(\mathbf{U}\) is upper-triangular.
- if
return_info=True
: - infodict
A dictionary containing the following key-values:
'matrix'
: info about input matrix'dtype'
: the data type of the input matrix.'matrix_shape'
: shape of the input matrix.'triangle'
: in case of symmetric matrix, whether upper or lower triangle part of matrix is given (based ontriangle
option).'assume'
: whether matrix is generic, symmetric, or symmetric and positive-definite (based onassume
option).
'process'
: info about the computation process and profiling'processor'
: name of the CPU processor'tot_wall_time'
: total wall time of the process.'tot_proc_time'
: total process time of all CPU threads combined.'load_wall_time'
: wall time for only the load operation, which is the data transfer from disk to memory. This is relevant only if scratchpad space was used during the computation.'load_proc_time'
: process time of all CPU threads for only the load operation, which is the data transfer from disk to memory. This is relevant only if scratchpad space was used during the computation.'store_wall_time'
: wall time for only the store operation, which is the data transfer from memory to disk. This is relevant only if scratchpad space was used during the computation.'store_proc_time'
: process time of all CPU threads for only the store operation, which is the data transfer from memory to disk. This is relevant only if scratchpad space was used during the computation.
'block'
: info about matrix blocks'block_nbytes'
: number of bytes of each block allocated on the memory. When the number of blocks along row-block (or column-block) is not a divisor of the matrix size, some blocks may be smaller, however, this quantity reports the size of the largest block.'block_shape'
: shape of each memory block in array size. When the number of blocks along row-block (or column-block) is not a divisor of the matrix size, some blocks may be smaller, however, this quantity reports the size of the largest block.'matrix_blocks'
: the shape of the grid of blocks that decomposes the input matrix, which is (num_blocks
,num_blocks
).
'scratch'
: info about scratchpad space (relevant if used)'io_chunk'
: the size of data chunks for for input/output data transfer operations between disk and memory. This size is almost always equal to the size of number of rows/columns of each block (seeblock_shape
above).'num_scratch_blocks'
: number of blocks stored to the scratchpad space. Note that not all memory blocks are stored, hence, this quantity is smaller thannum_blocks * num_blocks
.'scratch_file'
: the scratch file that was created, and later deleted after termination of the algorithm. This file was in thescratch_dir
and it was a hidden file (for instance, in UNIX, it has a dot prefix).'scratch_nbytes'
: the size of scratchpad file in bytes.'num_block_loads'
: a counter of the number of times that blocks were read from disk to memory.'num_block_stores'
: a counter of the number of times that blocks were written from memory to disk.
'memory'
: info about memory allocation'alloc_mem'
: block memory allocated in bytes divided bymem_unit
.'alloc_mem_peak'
: block peak memory allocated in bytes divided bymem_unit
.'total_mem'
: total memory allocated in bytes divided bymem_unit
. This includes the memory of blocks and any extra memory required by the algorithm.'total_mem_peak'
: total peak memory allocated in bytes divided bymem_unit
. This includes the memory of blocks and any extra memory required by the algorithm.'mem_unit'
: the unit in which the above memory are reported with. This is usually the memory (in bytes) of one block, so it makes the above memory memory sizes relative to the memory size of one block.
'solver'
: info about the solver'version'
: version of detkit package'method'
: method of computation, such as LU decomposition , LDL decomposition, or Cholesky decomposition, respectively for generic, symmetric, or symmetric positive-definite matrices.'dtype'
: the data type used during computation (see'mixed_precision'
option).'order'
: order of array, such asC
for contiguous (row-major) ordering orF
for Fortran (column-major) ordering during computation.
- Raises:
- RuntimeError
Error raised when
assume='spd'
and matrix A is not symmetric positive-definite.
See also
Notes
How to Limit Memory Usage:
If the whole matrix cannot be loaded on the memory, this function chunks the matrix into smaller sub-matrices (blocks) and load three or four of these blocks concurrently to the memory.
For instance, if your matrix size is 100 GB, and your machine has 16 GB memory, you may need a grid of 5 by 5 blocks (25 blocks), each having 100 GB / 25 = 4 GB in size. Four of these blocks take 16 GB, which can fit your machine’s memory.
There are two ways to set the memory limit:
either directly, by setting
max_mem
argument (such as16GB
in the above example),or indirectly, by setting
num_blocks
argument (such as 5 in the above example).
You only need to set one of these arguments, but not both. However, if you set
max_mem
, the argumentnum_blocks
is ignored, and rather, recalculated frommax_mem
.What is Scratch:
When
num_blocks
is 1 or 2 (a grid of 1x1 or 2x2 blocks), all calculations are performed on the memory, even if the whole input matrix cannot be fit on the memory (in case of 2x2 blocks)!However, for larger number of blocks (when
num_blocks
is greater than 2), this function creates a temporary space on your disk to store the variables during the inner computations. This space (called scratchpad) is a hidden file created in thescratch_dir
directory, and will be automatically removed once this function returns.If you do not specify
scratch_dir
, the tmp directory in your operating system (such as/tmp
in UNIX) will be used.What is Parallel IO:
This function reads and writes to the scratchpad on your disk. For very large matrices (and hence, very large blocks) the read/write operations (io operations) can be time consuming. You can leverage the
parallel_io
argument to let all CPU threads performing these tasks in parallel. However, note that, depending on your hardware, your disk may throttle parallel file operations.Using Dask:
When using Dask (either if the input array
A
is a Dask array or whenparallel_io='dask'
), you should calldetkit.memdet()
function in a protected if-clause. See further details at multiprocessing-error-without-if-clause-protection.The “diag” Output Variable:
In addition to the log-abs-determinant (
ld
) and sign of determinant (sign
) variables, this function also returns thediag
variable. The variablediag
is a 1D array of size n (the number of rows or columns ofA
), and can be used to compute the log-abs-determinant of any sub-matrixA[:m, :m]
, all at once, wherem
can be 1 to n. If we denote the sub-matrixA[:m, :m]
as \(\mathbf{A}_{[:m, :m]}\) and the elementdiag[i]
as \(d_i\), we have:For generic and symmetric matrices (if
assume
is set to'gen'
or'sym'
):\[\log \vert \mathrm{det}(\mathbf{A}_{[:m, :m]}) \vert = \sum_{i=1}^{m} \log \vert d_i \vert.\]For symmetric and positive-definite matrices (if
assume
is set to'spd'
):\[\log \vert \mathrm{det}(\mathbf{A}_{[:m, :m]}) \vert = 2 \sum_{i=1}^{m} \log \vert d_i \vert.\]
In particular, the output variable
ld
is computed fromdiag
when \(m = n\).Note that computing
diag
is a by-product for free and does not require any additional cost.References
[1]Siavash Ameli, Chris van der Heide, Liam Hodgkinson, Fred Roosta, Michael W. Mahoney (2024). Determinant Estimation under Memory Constraints and Neural Scaling Law (under review)
Examples
In this example, we generate a random matrix
A
, and for test purposes, we store this matrix on the disk as a zarr arrayz
. You can either passA
orz
todetkit.memdet()
.>>> # Create a symmetric matrix >>> import numpy >>> n = 10000 >>> A = numpy.random.randn(n, n) >>> A = A.T + A >>> # Store matrix as a zarr array on disk (optional) >>> import zarr >>> z_path = 'matrix_file.zarr' >>> z = zarr.open(z_path, mode='w', shape=(n, n), dtype=A.dtype) >>> z[:, :] = A >>> # Compute log-determinant while limiting memory to 500 MB >>> from detkit import memdet >>> ld, sign, diag, info = memdet(z, max_mem='500MB', assume='sym', ... parallel_io='tensorstore', ... verbose=True, return_info=True) >>> # logarithm of absolute value of determinant >>> print(ld) 82104.567748 >>> # sign of determinant >>> print(sign) -1
By setting
verbose=True
, a detailed log is printed during the computation, as shown in the screenshot below.The above logs illustrate how the matrix is processed. For example, due to the memory limit of 500 MB, a matrix of size 762.9 MB is decomposed into smaller blocks (a grid of 3 by 3 blocks), where each block is 84.8 MB. At any time, only four of these blocks are concurrently loaded into memory: blocks A11, A12, A21, and A22. The allocated size of each block is shown.
In the above, the computation was performed in 7 steps. The number of steps vary depending the number of blocks. Each step may involve:
Loading a block from disk to memory (loading blk)
Storing a block from memory back to disk (storing blk)
Performing LU, LDL, or Cholesky decomposition (e.g. ldl decompo)
Solving an upper triangular system of equations (solve uptri)
Solving a lower triangular system of equations (solve lotri)
Computing the Schur complement (schur compl)
For each task, the proceeding columns in the verbose prints are as follows:
time: CPU process time
cpu: CPU utilization percentage (for all CPU threads combined)
alloc: Peak memory allocation during the task
read: Data read from scratchpad on disk
write: Data written to scratchpad on disk
Note that an efficient implementation should not allocate any new memory during any of the above tasks during the computation. The only memory allocation should be the creation of the blocks at the beginning. As seen in the screenshot above, all memory allocations (on the order of KB) are negligible compared to the size of a block (on the order of MB), indicating that no new array is created.
The above code also returns the
info
variable by settingreturn_info=True
. Here is a pretty-print ofinfo
dictionary:>>> # Print info results >>> from pprint import pprint >>> pprint(info)
which gives the following output:
{ 'block': { 'block_nbytes': 88924448, 'block_shape': (3334, 3334), 'matrix_blocks': (1, 1) }, 'matrix': { 'assume': 'sym', 'dtype': 'float64', 'matrix_shape': (10000, 10000), 'triangle': None }, 'memory': { 'alloc_mem': 4.003514331626776, 'alloc_mem_peak': 4.004212868434112, 'mem_unit': '88924448 bytes', 'total_mem': 0.0009245601389620096, 'total_mem_peak': 0.0027315772598329765 }, 'process': { 'load_proc_time': 4.354914859999929, 'load_wall_time': 2.3955368995666504, 'num_proc': 8, 'processor': ' Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz', 'store_proc_time': 1.6077849289999904, 'store_wall_time': 1.626847743988037, 'tot_proc_time': 600.250950974, 'tot_wall_time': 115.73805451393127 }, 'scratch': { 'io_chunk': 3334, 'num_block_loads': 11, 'num_block_stores': 2, 'num_scratch_blocks': 4, 'scratch_file': '/tmp/.detkit-memdet-va1k6eio.zarr', 'scratch_nbytes': 0 }, 'solver': { 'dtype': 'float64', 'method': 'ldl decomposition', 'order': 'F', 'version': '0.6.1' } }