biopython解析mmcif文件得到组装体、链、序列、原子坐标、变换矩阵等信息

使用 Biopython 解析 .mmCIF 文件可以提取出蛋白质结构的相关信息，包括模型（model）、链（chain）、序列、原子坐标以及可能存在的变换矩阵。以下是一个完整的示例代码，展示如何使用 Biopython 的 MMCIFParser 解析 .mmCIF 文件，并提取这些信息。

示例代码

from Bio.PDB import MMCIFParser
from Bio.SeqUtils import seq1
import numpy as np# 解析 mmCIF 文件
def parse_mmcif(file_path):parser = MMCIFParser(QUIET=True)structure = parser.get_structure('structure', file_path)models_data = []for model in structure:model_data = {'model_id': model.id, 'chains': []}for chain in model:chain_data = {'chain_id': chain.id, 'residues': [], 'atoms': []}for residue in chain:if residue.id[0] == ' ':  # 确保是标准残基try:# 提取序列信息，使用 seq1 函数将三字母代码转换为单字母代码seq_residue = seq1(residue.resname)except KeyError:seq_residue = 'X'  # 若不能转换为单字母代码，则用 'X' 表示chain_data['residues'].append(seq_residue)for atom in residue:# 提取原子坐标coord = atom.coordatom_data = {'atom_name': atom.name,'coord': coord}chain_data['atoms'].append(atom_data)model_data['chains'].append(chain_data)models_data.append(model_data)return models_data# 解析变换矩阵
def parse_transformation_matrices(file_path):# 使用 MMCIF2Dict 解析 mmCIF 文件cif_dict = MMCIF2Dict.MMCIF2Dict(file_path)# 提取变换矩阵信息matrices = []# 如果 mmCIF 文件包含变换矩阵，则从 _pdbx_struct_oper_list 中提取if '_pdbx_struct_oper_list.matrix[1][1]' in cif_dict:n_matrices = len(cif_dict['_pdbx_struct_oper_list.matrix[1][1]'])  # 矩阵数量for i in range(n_matrices):# 提取矩阵元素，按行存储matrix = np.array([[float(cif_dict[f'_pdbx_struct_oper_list.matrix[1][1]'][i]),float(cif_dict[f'_pdbx_struct_oper_list.matrix[1][2]'][i]),float(cif_dict[f'_pdbx_struct_oper_list.matrix[1][3]'][i])],[float(cif_dict[f'_pdbx_struct_oper_list.matrix[2][1]'][i]),float(cif_dict[f'_pdbx_struct_oper_list.matrix[2][2]'][i]),float(cif_dict[f'_pdbx_struct_oper_list.matrix[2][3]'][i])],[float(cif_dict[f'_pdbx_struct_oper_list.matrix[3][1]'][i]),float(cif_dict[f'_pdbx_struct_oper_list.matrix[3][2]'][i]),float(cif_dict[f'_pdbx_struct_oper_list.matrix[3][3]'][i])]])# 提取平移向量translation = np.array([float(cif_dict[f'_pdbx_struct_oper_list.vector[1]'][i]),float(cif_dict[f'_pdbx_struct_oper_list.vector[2]'][i]),float(cif_dict[f'_pdbx_struct_oper_list.vector[3]'][i])])matrices.append({'matrix': matrix, 'translation': translation})return matrices# 示例调用
file_path = '/path/to/.cif/file'# 解析结构信息
structure_data = parse_mmcif(file_path)
for model in structure_data:print(f"Model ID: {model['model_id']}")for chain in model['chains']:print(f"  Chain ID: {chain['chain_id']}")print(f"  Sequence: {''.join(chain['residues'])}")for atom in chain['atoms'][:5]:  # 打印前5个原子坐标print(f"    Atom: {atom['atom_name']}, Coord: {atom['coord']}")# 解析变换矩阵信息
transformation_matrices = parse_transformation_matrices(file_path)
for i, matrix in enumerate(transformation_matrices):print(f"Transformation Matrix {i+1}:\n{matrix}")

代码说明：

结构解析部分：
- 使用 MMCIFParser 解析 .mmCIF 文件，遍历每个模型（model）和链（chain），提取残基的序列以及原子的坐标。
- Bio.SeqUtils 模块中的 seq1 函数，它也可以将三字母的氨基酸代码转换为单字母代码: eq1(residue.resname) 。
- 通过遍历原子来获取坐标。
变换矩阵解析部分：
- 变换矩阵的解析是通过读取文件并根据关键字 _pdbx_struct_oper_list.matrix 提取相关信息。变换矩阵通常以3x3的形式给出。