首先恭喜格林深瞳3月18日在科创板成功上市
一、介绍
Glint360K数据集包含36万类别的1800万张图像,不论是类别数还是图像数,相比MS1MV2数据集都有大幅提升。
这是一个号称全球最大最干净的人脸数据集,
下载地址(我自己上传的):链接:https://pan.baidu.com/s/1K3UDER9u352oNIyph-FI1w?pwd=3o3i
提取码:3o3i
--来自百度网盘超级会员V5的分享
二、解压和解码
下载好了之后先解压
cat glint360k_* | tar -xzvf -
然后它是.rec格式数据,下面我们将它解码成图片
先简单配置一个小环境:
conda create -n glint
source activate glint
pip install mxnet -i https://pypi.douban.com/simple
pip install opencv-python -i https://pypi.douban.com/simple
编写处理代码:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import os
import cv2
import mxnet as mx
def main(args):include_datasets = args.include.split(',')rec_list = []for ds in include_datasets:path_imgrec = os.path.join(ds, 'train.rec')path_imgidx = os.path.join(ds, 'train.idx')imgrec = mx.recordio.MXIndexedRecordIO(path_imgidx, path_imgrec, 'r') # pylint: disable=redefined-variable-typerec_list.append(imgrec)if not os.path.exists(args.output):os.makedirs(args.output)imgid = 0for ds_id in range(len(rec_list)):imgrec = rec_list[ds_id]s = imgrec.read_idx(0)header, _ = mx.recordio.unpack(s)assert header.flag > 0seq_identity = range(int(header.label[0]), int(header.label[1]))for identity in seq_identity:s = imgrec.read_idx(identity)header, _ = mx.recordio.unpack(s)for _idx in range(int(header.label[0]), int(header.label[1])):s = imgrec.read_idx(_idx)_header, _img = mx.recordio.unpack(s)label = int(_header.label[0])class_path = os.path.join(args.output, "id_%d" % label)if not os.path.exists(class_path):os.makedirs(class_path)_img = mx.image.imdecode(_img).asnumpy()[:, :, ::-1] # to bgrimage_path = os.path.join(class_path, "%d_%d.jpg" % (label, imgid))cv2.imwrite(image_path, _img)imgid += 1if imgid % 10000 == 0:print(imgid)
if __name__ == '__main__':parser = argparse.ArgumentParser(description='do dataset merge')# generalparser.add_argument('--include', default='', type=str, help='')parser.add_argument('--output', default='', type=str, help='')args = parser.parse_args()main(args)
执行:
python process.py --include=/glint360k/glint360k --output=/glint360k/output
会像这样生成每个id一个文件夹,每个文件夹里面都是同一个人的照片