llama.cpp GGML Quantization Type

ops/2025/2/5 14:33:33/

llama.cpp GGML Quantization Type

  • 1. GGML Quantization Type
  • 2. `static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT]`
  • 3. `Q#_K_M` and `Q#_K`
  • References

什么神仙妖魔,不过是他们禁锢异族命运的枷锁!

GGUF
https://huggingface.co/docs/hub/gguf

docs/hub/gguf.md
https://github.com/huggingface/hub-docs/blob/main/docs/hub/gguf.md

GGML_Quantization_Type_10">1. GGML Quantization Type

packages/gguf/src/quant-descriptions.ts
https://github.com/huggingface/huggingface.js/blob/main/packages/gguf/src/quant-descriptions.ts

import { GGMLQuantizationType } from "./types";export const GGUF_QUANT_DESCRIPTIONS: Record<GGMLQuantizationType, { txt: string; src_url?: string }> = {[GGMLQuantizationType.F32]: {txt: "32-bit standard IEEE 754 single-precision floating-point number.",src_url: "https://en.wikipedia.org/wiki/Single-precision_floating-point_format",},[GGMLQuantizationType.F16]: {txt: "16-bit standard IEEE 754 half-precision floating-point number.",src_url: "https://en.wikipedia.org/wiki/Half-precision_floating-point_format",},[GGMLQuantizationType.Q8_0]: {txt: "8-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249",},[GGMLQuantizationType.Q8_1]: {txt: "8-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290",},[GGMLQuantizationType.Q8_K]: {txt: `8-bit quantization (q). Each block has 256 weights. Only used for quantizing intermediate results. All 2-6 bit dot products are implemented for this quantization type. Weight formula: w = q * block_scale.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.Q6_K]: {txt: `6-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: w = q * block_scale(8-bit), resulting in 6.5625 bits-per-weight.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.Q5_0]: {txt: "5-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249",},[GGMLQuantizationType.Q5_1]: {txt: "5-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290",},[GGMLQuantizationType.Q5_K]: {txt: `5-bit quantization (q). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit), resulting in 5.5 bits-per-weight.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.Q4_0]: {txt: "4-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249",},[GGMLQuantizationType.Q4_1]: {txt: "4-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).",src_url: "https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290",},[GGMLQuantizationType.Q4_K]: {txt: `4-bit quantization (q). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit), resulting in 4.5 bits-per-weight.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.Q3_K]: {txt: `3-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: w = q * block_scale(6-bit), resulting. 3.4375 bits-per-weight.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.Q2_K]: {txt: `2-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weight. Weight formula: w = q * block_scale(4-bit) + block_min(4-bit), resulting in 2.5625 bits-per-weight.`,src_url: "https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305",},[GGMLQuantizationType.IQ4_XS]: {txt: "4-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 4.25 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ3_S]: {txt: "3-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 3.44 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ3_XXS]: {txt: "3-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 3.06 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ2_S]: {txt: "2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.5 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ2_XS]: {txt: "2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.31 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ2_XXS]: {txt: "2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.06 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ1_S]: {txt: "1-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 1.56 bits-per-weight.",src_url:"https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70",},[GGMLQuantizationType.IQ4_NL]: {txt: "4-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix.",src_url: "https://github.com/ggerganov/llama.cpp/pull/5590",},[GGMLQuantizationType.I8]: {txt: "8-bit fixed-width integer number.",src_url: "https://github.com/ggerganov/llama.cpp/pull/6045",},[GGMLQuantizationType.I16]: {txt: "16-bit fixed-width integer number.",src_url: "https://github.com/ggerganov/llama.cpp/pull/6045",},[GGMLQuantizationType.I32]: {txt: "32-bit fixed-width integer number.",src_url: "https://github.com/ggerganov/llama.cpp/pull/6045",},[GGMLQuantizationType.I64]: {txt: "64-bit fixed-width integer number.",src_url: "https://github.com/ggerganov/llama.cpp/pull/6062",},[GGMLQuantizationType.F64]: {txt: "64-bit standard IEEE 754 double-precision floating-point number.",src_url: "https://en.wikipedia.org/wiki/Double-precision_floating-point_format",},[GGMLQuantizationType.IQ1_M]: {txt: "1-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 1.75 bits-per-weight.",src_url: "https://github.com/ggerganov/llama.cpp/pull/6302",},[GGMLQuantizationType.BF16]: {txt: "16-bit shortened version of the 32-bit IEEE 754 single-precision floating-point number.",src_url: "https://en.wikipedia.org/wiki/Bfloat16_floating-point_format",},
};
typesourcedescription
F64Wikipedia64-bit standard IEEE 754 double-precision floating-point number.
I64GH64-bit fixed-width integer number.
F32Wikipedia32-bit standard IEEE 754 single-precision floating-point number.
I32GH32-bit fixed-width integer number.
F16Wikipedia16-bit standard IEEE 754 half-precision floating-point number.
BF16Wikipedia16-bit shortened version of the 32-bit IEEE 754 single-precision floating-point number.
I16GH16-bit fixed-width integer number.
Q8_0GH8-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).
Q8_1GH8-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today)
Q8_KGH8-bit quantization (q). Each block has 256 weights. Only used for quantizing intermediate results. All 2-6 bit dot products are implemented for this quantization type. Weight formula: w = q * block_scale.
I8GH8-bit fixed-width integer number.
Q6_KGH6-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: w = q * block_scale(8-bit), resulting in 6.5625 bits-per-weight.
Q5_0GH5-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).
Q5_1GH5-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).
Q5_KGH5-bit quantization (q). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit), resulting in 5.5 bits-per-weight.
Q4_0GH4-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).
Q4_1GH4-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).
Q4_KGH4-bit quantization (q). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit), resulting in 4.5 bits-per-weight.
Q3_KGH3-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: w = q * block_scale(6-bit), resulting. 3.4375 bits-per-weight.
Q2_KGH2-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weight. Weight formula: w = q * block_scale(4-bit) + block_min(4-bit), resulting in 2.5625 bits-per-weight.
IQ4_NLGH4-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix.
IQ4_XSHF4-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 4.25 bits-per-weight.
IQ3_SHF3-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 3.44 bits-per-weight.
IQ3_XXSHF3-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 3.06 bits-per-weight.
IQ2_XXSHF2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.06 bits-per-weight.
IQ2_SHF2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.5 bits-per-weight.
IQ2_XSHF2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.31 bits-per-weight.
IQ1_SHF1-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 1.56 bits-per-weight.
IQ1_MGH1-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 1.75 bits-per-weight.
GitHub, GH
Hugging Face, HF

GGML_TYPE_COUNT_182">2. static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT]

https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-quants.h
https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-quants.c

https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml.c

static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT] = {[GGML_TYPE_I8] = {.type_name                = "i8",.blck_size                = 1,.type_size                = sizeof(int8_t),.is_quantized             = false,},[GGML_TYPE_I16] = {.type_name                = "i16",.blck_size                = 1,.type_size                = sizeof(int16_t),.is_quantized             = false,},[GGML_TYPE_I32] = {.type_name                = "i32",.blck_size                = 1,.type_size                = sizeof(int32_t),.is_quantized             = false,},[GGML_TYPE_I64] = {.type_name                = "i64",.blck_size                = 1,.type_size                = sizeof(int64_t),.is_quantized             = false,},[GGML_TYPE_F64] = {.type_name                = "f64",.blck_size                = 1,.type_size                = sizeof(double),.is_quantized             = false,},[GGML_TYPE_F32] = {.type_name                = "f32",.blck_size                = 1,.type_size                = sizeof(float),.is_quantized             = false,},[GGML_TYPE_F16] = {.type_name                = "f16",.blck_size                = 1,.type_size                = sizeof(ggml_fp16_t),.is_quantized             = false,.to_float                 = (ggml_to_float_t) ggml_fp16_to_fp32_row,.from_float_ref           = (ggml_from_float_t) ggml_fp32_to_fp16_row,},[GGML_TYPE_Q4_0] = {.type_name                = "q4_0",.blck_size                = QK4_0,.type_size                = sizeof(block_q4_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q4_0,.from_float_ref           = (ggml_from_float_t) quantize_row_q4_0_ref,},[GGML_TYPE_Q4_1] = {.type_name                = "q4_1",.blck_size                = QK4_1,.type_size                = sizeof(block_q4_1),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q4_1,.from_float_ref           = (ggml_from_float_t) quantize_row_q4_1_ref,},[4] = { // GGML_TYPE_Q4_2.type_name                = "DEPRECATED",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[5] = { // GGML_TYPE_Q4_3.type_name                = "DEPRECATED",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[GGML_TYPE_Q5_0] = {.type_name                = "q5_0",.blck_size                = QK5_0,.type_size                = sizeof(block_q5_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q5_0,.from_float_ref           = (ggml_from_float_t) quantize_row_q5_0_ref,},[GGML_TYPE_Q5_1] = {.type_name                = "q5_1",.blck_size                = QK5_1,.type_size                = sizeof(block_q5_1),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q5_1,.from_float_ref           = (ggml_from_float_t) quantize_row_q5_1_ref,},[GGML_TYPE_Q8_0] = {.type_name                = "q8_0",.blck_size                = QK8_0,.type_size                = sizeof(block_q8_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q8_0,.from_float_ref           = (ggml_from_float_t) quantize_row_q8_0_ref,},[GGML_TYPE_Q8_1] = {.type_name                = "q8_1",.blck_size                = QK8_1,.type_size                = sizeof(block_q8_1),.is_quantized             = true,.from_float_ref           = (ggml_from_float_t) quantize_row_q8_1_ref,},[GGML_TYPE_Q2_K] = {.type_name                = "q2_K",.blck_size                = QK_K,.type_size                = sizeof(block_q2_K),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q2_K,.from_float_ref           = (ggml_from_float_t) quantize_row_q2_K_ref,},[GGML_TYPE_Q3_K] = {.type_name                = "q3_K",.blck_size                = QK_K,.type_size                = sizeof(block_q3_K),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q3_K,.from_float_ref           = (ggml_from_float_t) quantize_row_q3_K_ref,},[GGML_TYPE_Q4_K] = {.type_name                = "q4_K",.blck_size                = QK_K,.type_size                = sizeof(block_q4_K),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q4_K,.from_float_ref           = (ggml_from_float_t) quantize_row_q4_K_ref,},[GGML_TYPE_Q5_K] = {.type_name                = "q5_K",.blck_size                = QK_K,.type_size                = sizeof(block_q5_K),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q5_K,.from_float_ref           = (ggml_from_float_t) quantize_row_q5_K_ref,},[GGML_TYPE_Q6_K] = {.type_name                = "q6_K",.blck_size                = QK_K,.type_size                = sizeof(block_q6_K),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q6_K,.from_float_ref           = (ggml_from_float_t) quantize_row_q6_K_ref,},[GGML_TYPE_IQ2_XXS] = {.type_name                = "iq2_xxs",.blck_size                = QK_K,.type_size                = sizeof(block_iq2_xxs),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq2_xxs,.from_float_ref           = NULL,},[GGML_TYPE_IQ2_XS] = {.type_name                = "iq2_xs",.blck_size                = QK_K,.type_size                = sizeof(block_iq2_xs),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq2_xs,.from_float_ref           = NULL,},[GGML_TYPE_IQ3_XXS] = {.type_name                = "iq3_xxs",.blck_size                = QK_K,.type_size                = sizeof(block_iq3_xxs),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq3_xxs,.from_float_ref           = (ggml_from_float_t)quantize_row_iq3_xxs_ref,},[GGML_TYPE_IQ3_S] = {.type_name                = "iq3_s",.blck_size                = QK_K,.type_size                = sizeof(block_iq3_s),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq3_s,.from_float_ref           = (ggml_from_float_t)quantize_row_iq3_s_ref,},[GGML_TYPE_IQ2_S] = {.type_name                = "iq2_s",.blck_size                = QK_K,.type_size                = sizeof(block_iq2_s),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq2_s,.from_float_ref           = (ggml_from_float_t)quantize_row_iq2_s_ref,},[GGML_TYPE_IQ1_S] = {.type_name                = "iq1_s",.blck_size                = QK_K,.type_size                = sizeof(block_iq1_s),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq1_s,.from_float_ref           = NULL,},[GGML_TYPE_IQ1_M] = {.type_name                = "iq1_m",.blck_size                = QK_K,.type_size                = sizeof(block_iq1_m),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq1_m,.from_float_ref           = NULL,},[GGML_TYPE_IQ4_NL] = {.type_name                = "iq4_nl",.blck_size                = QK4_NL,.type_size                = sizeof(block_iq4_nl),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq4_nl,.from_float_ref           = (ggml_from_float_t)quantize_row_iq4_nl_ref,},[GGML_TYPE_IQ4_XS] = {.type_name                = "iq4_xs",.blck_size                = QK_K,.type_size                = sizeof(block_iq4_xs),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_iq4_xs,.from_float_ref           = (ggml_from_float_t)quantize_row_iq4_xs_ref,},[GGML_TYPE_Q8_K] = {.type_name                = "q8_K",.blck_size                = QK_K,.type_size                = sizeof(block_q8_K),.is_quantized             = true,},[GGML_TYPE_BF16] = {.type_name                = "bf16",.blck_size                = 1,.type_size                = sizeof(ggml_bf16_t),.is_quantized             = false,.to_float                 = (ggml_to_float_t) ggml_bf16_to_fp32_row,.from_float_ref           = (ggml_from_float_t) ggml_fp32_to_bf16_row_ref,},[31] = { // GGML_TYPE_Q4_0_4_4.type_name                = "TYPE_Q4_0_4_4 REMOVED, use Q4_0 with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[32] = { // GGML_TYPE_Q4_0_4_8.type_name                = "TYPE_Q4_0_4_8 REMOVED, use Q4_0 with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[33] = { // GGML_TYPE_Q4_0_8_8.type_name                = "TYPE_Q4_0_8_8 REMOVED, use Q4_0 with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[GGML_TYPE_TQ1_0] = {.type_name                = "tq1_0",.blck_size                = QK_K,.type_size                = sizeof(block_tq1_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_tq1_0,.from_float_ref           = (ggml_from_float_t) quantize_row_tq1_0_ref,},[GGML_TYPE_TQ2_0] = {.type_name                = "tq2_0",.blck_size                = QK_K,.type_size                = sizeof(block_tq2_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_tq2_0,.from_float_ref           = (ggml_from_float_t) quantize_row_tq2_0_ref,},[36] = { // GGML_TYPE_IQ4_NL_4_4.type_name                = "TYPE_IQ4_NL_4_4 REMOVED, use IQ4_NL with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[37] = { // GGML_TYPE_IQ4_NL_4_8.type_name                = "TYPE_IQ4_NL_4_8 REMOVED, use IQ4_NL with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},[38] = { // GGML_TYPE_IQ4_NL_8_8.type_name                = "TYPE_IQ4_NL_8_8 REMOVED, use IQ4_NL with runtime repacking",.blck_size                = 0,.type_size                = 0,.is_quantized             = false,},
};

/home/yongqiang/llm_work/llama_cpp_25_01_05/llama.cpp/ggml/include/ggml.h

    // NOTE: always add types at the end of the enum to keep backward compatibilityenum ggml_type {GGML_TYPE_F32     = 0,GGML_TYPE_F16     = 1,GGML_TYPE_Q4_0    = 2,GGML_TYPE_Q4_1    = 3,// GGML_TYPE_Q4_2 = 4, support has been removed// GGML_TYPE_Q4_3 = 5, support has been removedGGML_TYPE_Q5_0    = 6,GGML_TYPE_Q5_1    = 7,GGML_TYPE_Q8_0    = 8,GGML_TYPE_Q8_1    = 9,GGML_TYPE_Q2_K    = 10,GGML_TYPE_Q3_K    = 11,GGML_TYPE_Q4_K    = 12,GGML_TYPE_Q5_K    = 13,GGML_TYPE_Q6_K    = 14,GGML_TYPE_Q8_K    = 15,GGML_TYPE_IQ2_XXS = 16,GGML_TYPE_IQ2_XS  = 17,GGML_TYPE_IQ3_XXS = 18,GGML_TYPE_IQ1_S   = 19,GGML_TYPE_IQ4_NL  = 20,GGML_TYPE_IQ3_S   = 21,GGML_TYPE_IQ2_S   = 22,GGML_TYPE_IQ4_XS  = 23,GGML_TYPE_I8      = 24,GGML_TYPE_I16     = 25,GGML_TYPE_I32     = 26,GGML_TYPE_I64     = 27,GGML_TYPE_F64     = 28,GGML_TYPE_IQ1_M   = 29,GGML_TYPE_BF16    = 30,// GGML_TYPE_Q4_0_4_4 = 31, support has been removed from gguf files// GGML_TYPE_Q4_0_4_8 = 32,// GGML_TYPE_Q4_0_8_8 = 33,GGML_TYPE_TQ1_0   = 34,GGML_TYPE_TQ2_0   = 35,// GGML_TYPE_IQ4_NL_4_4 = 36,// GGML_TYPE_IQ4_NL_4_8 = 37,// GGML_TYPE_IQ4_NL_8_8 = 38,GGML_TYPE_COUNT   = 39,};// precisionenum ggml_prec {GGML_PREC_DEFAULT,GGML_PREC_F32,};// model file typesenum ggml_ftype {GGML_FTYPE_UNKNOWN        = -1,GGML_FTYPE_ALL_F32        = 0,GGML_FTYPE_MOSTLY_F16     = 1,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q4_0    = 2,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q4_1    = 3,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q4_1_SOME_F16 = 4, // tok_embeddings.weight and output.weight are F16GGML_FTYPE_MOSTLY_Q8_0    = 7,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q5_0    = 8,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q5_1    = 9,  // except 1d tensorsGGML_FTYPE_MOSTLY_Q2_K    = 10, // except 1d tensorsGGML_FTYPE_MOSTLY_Q3_K    = 11, // except 1d tensorsGGML_FTYPE_MOSTLY_Q4_K    = 12, // except 1d tensorsGGML_FTYPE_MOSTLY_Q5_K    = 13, // except 1d tensorsGGML_FTYPE_MOSTLY_Q6_K    = 14, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ2_XXS = 15, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ2_XS  = 16, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ3_XXS = 17, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ1_S   = 18, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ4_NL  = 19, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ3_S   = 20, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ2_S   = 21, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ4_XS  = 22, // except 1d tensorsGGML_FTYPE_MOSTLY_IQ1_M   = 23, // except 1d tensorsGGML_FTYPE_MOSTLY_BF16    = 24, // except 1d tensors};

3. Q#_K_M and Q#_K

https://netraneupane.medium.com/hands-on-llms-quantization-a4c7ab1421c2

In the context of llama.cpp, Q4_K_M refers to a specific type of k-means quantization method. The naming convention is as follows:

  • Q stands for Quantization.
  • 4 indicates the number of bits used in the quantization process.
  • K refers to the use of k-means clustering in the quantization.
  • M represents the size of the model after quantization. (S = Small, M = Medium, L = Large).

Similarly, Q2_K refers to specific type of k-means quantization too. The naming convention is as follow:

  • Q stands for Quantization.
  • 2 indicates the number of bits used in the quantization process.
  • K refers to the use of k-means clustering in the quantization.

References

[1] Yongqiang Cheng, https://yongqiang.blog.csdn.net/
[2] huggingface/gguf, https://github.com/huggingface/huggingface.js/tree/main/packages/gguf
[3] llama.cpp, https://github.com/ggerganov/llama.cpp
[4] k-quants, https://github.com/ggerganov/llama.cpp/pull/1684


http://www.ppmy.cn/ops/155894.html

相关文章

部署keepalvied+lVS(dr)高可用集群

第一步&#xff0c;环境准备 服务器名称 IP 描述 master VIP:192.168.244.100 DIP:192.168.244.101 高可用keeplived_master LVS负载均衡 backup VIP:192.168.244.100 DIP:192.168.244.102 高可用keeplived_backup LVS负载均衡 server1 RIP:192.168.244.103 Web服务…

LeetCode 0680.验证回文串 II:两侧向中间,不同就试删

【LetMeFly】680.验证回文串 II&#xff1a;两侧向中间&#xff0c;不同就试删 力扣题目链接&#xff1a;https://leetcode.cn/problems/valid-palindrome-ii/ 给你一个字符串 s&#xff0c;最多 可以从中删除一个字符。 请你判断 s 是否能成为回文字符串&#xff1a;如果能…

利用deepseek参与软件测试 基本架构如何 又该在什么环节接入deepseek

利用DeepSeek参与软件测试&#xff0c;可以考虑以下基本架构和接入环节&#xff1a; ### 基本架构 - **数据层** - **测试数据存储**&#xff1a;用于存放各种测试数据&#xff0c;包括正常输入数据、边界值数据、异常数据等&#xff0c;这些数据可以作为DeepSeek的输入&…

理解PLT表和GOT表

1 简介 现代操作系统都是通过库来进行代码复用&#xff0c;降低开发成本提升系统整体效率。而库主要分为两种&#xff0c;一种是静态库&#xff0c;比如windows的.lib文件&#xff0c;macos的.a&#xff0c;linux的.a&#xff0c;另一种是动态库&#xff0c;比如windows的dll文…

MFC开发,给对话框添加垂直滚动条并解决鼠标滚动响应的问题

无论在使用QT或者MFC进行界面开发时&#xff0c;都会出现在一个对话框里面存在好多的选项&#xff0c;导致对话框变得非常长或者非常大&#xff0c;就会显现的不美观&#xff0c;在这种情况下通常是添加一个页面的滚动条来解决这个问题&#xff0c;下面我们就来介绍给MFC的对话…

llama.cpp的C语言API使用

我们知道&#xff0c;一般运行大语言模型都是在Python上运行的&#xff0c;可是Python的性能太差了&#xff0c;不适合用于生产环境&#xff0c;因此可以采用llama.cpp提供的API在C语言上运行大模型。 llama.cpp的下载 Windows下的下载 我们需要下载llama.cpp的两个部分&…

编程AI深度实战:AI编程工具哪个好? Copilot vs Cursor vs Cody vs Supermaven vs Aider

​ 系列文章: 编程AI深度实战:私有模型deep seek r1,必会ollama-CSDN博客 编程AI深度实战:自己的AI,必会LangChain-CSDN博客 编程AI深度实战:给vim装上AI-CSDN博客 编程AI深度实战:火的编程AI,都在用语法树(AST)-CSDN博客 编程AI深度实战:让verilog不再是 AI …

C#,shell32 + 调用控制面板项(.Cpl)实现“新建快捷方式对话框”(全网首发)

Made By 于子轩&#xff0c;2025.2.2 不管是使用System.IO命名空间下的File类来创建快捷方式文件&#xff0c;或是使用Windows Script Host对象创建快捷方式&#xff0c;亦或是使用Shell32对象创建快捷方式&#xff0c;都对用户很不友好&#xff0c;今天小编为大家带来一种全新…