mlx_rs::ops

Function quantize

Source
pub fn quantize(
    w: impl AsRef<Array>,
    group_size: impl Into<Option<i32>>,
    bits: impl Into<Option<i32>>,
) -> Result<(Array, Array, Array)>
Expand description

Quantize the matrix w using bits bits per element.

Note, every group_size elements in a row of w are quantized together. Hence, number of columns of w should be divisible by group_size. In particular, the rows of w are divided into groups of size group_size which are quantized together.

quantized currently only supports 2D inputs with dimensions which are multiples of 32

For details, please see this documentation

ยงParams

  • w: The input matrix
  • group_size: The size of the group in w that shares a scale and bias. (default: 64)
  • bits: The number of bits occupied by each element of w in the returned quantized matrix. (default: 4)