Subjects200K¶
Details¶
- Hub link: Yuanshi/Subjects200K
- Description: 200K+ high-quality composite images with paired descriptions and quality assessments. Each sample contains two side-by-side images of the same subject in different contexts.
- Caption format(s): Structured JSON with separate descriptions for each image half
- Special features: Quality assessment scores, collection tags, composite images
Dataset Structure¶
The Subjects200K dataset is unique because:
- Each image field contains two images combined side-by-side into a single wide image
- Each sample has two separate captions - one for the left image (description.description_0) and one for the right (description.description_1)
- Quality assessment metadata allows filtering by image quality metrics
- Images are pre-organized into collections
Example data structure:
{
'image': PIL.Image, # Combined image (e.g., 1056x528 for two 528x528 images)
'collection': 'collection_1',
'quality_assessment': {
'compositeStructure': 5,
'objectConsistency': 5,
'imageQuality': 5
},
'description': {
'item': 'Eames Lounge Chair',
'description_0': 'The Eames Lounge Chair is placed in a modern city living room...',
'description_1': 'Nestled in a cozy nook of a rustic cabin...',
'category': 'Furniture',
'description_valid': True
}
}
Direct Usage (No Preprocessing Required)¶
Unlike datasets that require extraction and preprocessing, Subjects200K can be used directly from HuggingFace! The dataset is already properly formatted and hosted.
Simply ensure you have the datasets library installed:
Dataloader Configuration¶
Since each sample contains two images, we need to configure two separate dataset entries - one for each half of the composite image:
[
{
"id": "subjects200k-left",
"type": "huggingface",
"dataset_name": "Yuanshi/Subjects200K",
"caption_strategy": "huggingface",
"metadata_backend": "huggingface",
"resolution": 512,
"resolution_type": "pixel_area",
"cache_dir_vae": "cache/vae/subjects-left",
"huggingface": {
"caption_column": "description.description_0",
"image_column": "image",
"composite_image_config": {
"enabled": true,
"image_count": 2,
"select_index": 0
},
"filter_func": {
"collection": "collection_1",
"quality_thresholds": {
"compositeStructure": 4.5,
"objectConsistency": 4.5,
"imageQuality": 4.5
}
}
}
},
{
"id": "subjects200k-right",
"type": "huggingface",
"dataset_name": "Yuanshi/Subjects200K",
"caption_strategy": "huggingface",
"metadata_backend": "huggingface",
"resolution": 512,
"resolution_type": "pixel_area",
"cache_dir_vae": "cache/vae/subjects-right",
"huggingface": {
"caption_column": "description.description_1",
"image_column": "image",
"composite_image_config": {
"enabled": true,
"image_count": 2,
"select_index": 1
},
"filter_func": {
"collection": "collection_1",
"quality_thresholds": {
"compositeStructure": 4.5,
"objectConsistency": 4.5,
"imageQuality": 4.5
}
}
}
},
{
"id": "text-embed-cache",
"dataset_type": "text_embeds",
"default": true,
"type": "local",
"cache_dir": "cache/text/flux"
}
]
Configuration Explained¶
Composite Image Configuration¶
composite_image_config.enabled: Activates composite image handlingcomposite_image_config.image_count: Number of images in the composite (2 for side-by-side)composite_image_config.select_index: Which image to extract (0 = left, 1 = right)
Quality Filtering¶
The filter_func allows you to filter samples based on:
- collection: Only use images from specific collections
- quality_thresholds: Minimum scores for quality metrics:
- compositeStructure: How well the two images work together
- objectConsistency: Consistency of the subject across both images
- imageQuality: Overall image quality
Caption Selection¶
- Left image uses:
"caption_column": "description.description_0" - Right image uses:
"caption_column": "description.description_1"
Customization Options¶
-
Adjust quality thresholds: Lower values (e.g., 4.0) include more images, higher values (e.g., 4.8) are more selective
-
Use different collections: Change
"collection": "collection_1"to other available collections in the dataset -
Change resolution: Adjust the
resolutionvalue based on your training needs -
Disable filtering: Remove the
filter_funcsection to use all images -
Use item names as captions: Change caption column to
"description.item"to use just the subject name
Tips¶
- The dataset will be automatically downloaded and cached on first use
- Each "half" is treated as an independent dataset, effectively doubling your training samples
- Consider using different quality thresholds for each half if you want variety
- The VAE cache directories should be different for each half to avoid conflicts