1 huggingface datasets
需要先安装 datasets库
pip install datasets
用coco数据集举例,我们可以搜索coco,然后通过页面右侧的use this dataset或者是 clone respository来获取数据集
https://huggingface.co/datasets/phiyodr/coco2017

huggingface的dataset数据类型
from datasets import load_datasetds = load_dataset("phiyodr/coco2017") print(ds)
Generating train split: 100%|██████████| 118287/118287 [00:00<00:00, 1413307.31 examples/s]
Generating validation split: 100%|██████████| 5000/5000 [00:00<00:00, 1229064.06 examples/s]
DatasetDict({
train: Dataset({
features: ['license', 'file_name', 'coco_url', 'height', 'width', 'date_captured', 'flickr_url', 'image_id', 'ids', 'captions'],
num_rows: 118287
})
validation: Dataset({
features: ['license', 'file_name', 'coco_url', 'height', 'width', 'date_captured', 'flickr_url', 'image_id', 'ids', 'captions'],
num_rows: 5000
})
})
可以通过上述地址看到,coco数据集用.arrow格式储存了
pandas dataFrame格式
import pandas as pdsplits = {'train': 'data/train-00000-of-00001-0084e041f1902997.parquet', 'validation': 'data/validation-00000-of-00001-e3c37e369512a3aa.parquet'} df = pd.read_parquet("hf://datasets/phiyodr/coco2017/" + splits["train"]) print(df)
git下载
右侧点击 clone repository
2 kaggle datasets
Find Open Datasets and Machine Learning Projects | KaggleDownload Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government, Sports, Medicine, Fintech, Food, More. Flexible Data Ingestion.https://www.kaggle.com/datasets
3 通过迅雷下载
需要自行获取数据集下载地址
coco
COCO2017 训练数据:http://images.cocodataset.org/zips/train2017.zip
http://images.cocodataset.org/annotations/annotations_trainval2017.zip
COCO2017验证数据:http://images.cocodataset.org/zips/val2017.zip
http://images.cocodataset.org/annotations/stuff_annotations_trainval2017.zip
COCO2017测试数据集:http://images.cocodataset.org/zips/test2017.zip
http://images.cocodataset.org/annotations/image_info_test2017.zip
4 经典数据集介绍
WIT
huggingface
https://huggingface.co/datasets/google/withttps://huggingface.co/datasets/google/witgithub
GitHub - google-research-datasets/wit: WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages. - google-research-datasets/withttps://github.com/google-research-datasets/wit数据集论文地址
https://arxiv.org/pdf/2103.01913https://arxiv.org/pdf/2103.01913