DALLE-3¶
Details¶
- Hub link: ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions
- Description: 1 million+ DALLE-3 images combined with a small number of Midjourney and other AI-sourced images.
- Caption format(s): JSON files containing multiple styles of caption.
Required preprocessing steps¶
The DALLE-3 dataset contains files in the following structure:
We will use two scripts to prepare:
- A parquet table containing the image metadata, eg. width, height, and filename
- A .txt file for each sample containing its caption, in case loading captions from parquet is too slow
Extracting the dataset¶
- Retrieve the dataset from the hub via your chosen retrieval method, or:
huggingface-cli login
huggingface-cli download --repo-type=dataset ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions --local-dir=/home/user/training/data/dalle3
- Enter the DALLE-3 data directory and extract all entries:
cd /home/user/training/data/dalle3
mkdir data
for tar_file_path in *.tar; do
tar xf "$tar_file_path" -C data/
done
.png and .json files in the data/ subdirectory.
Compiling a parquet table¶
In the dalle3 directory, create a file named compile.py with the following contents:
import glob
import json, os
import pandas as pd
from tqdm import tqdm
# Glob for all JSON files in the folder
json_files = glob.glob('data/*.json')
data = []
# Process each JSON file
for file_path in tqdm(json_files):
with open(file_path, 'r') as file:
content = json.load(file)
#print(f"Content: {content}")
if "width" not in content:
continue
# Extract the necessary information
text_path = os.path.splitext(content['image_name'])[0] + ".txt"
width = int(content['width'])
height = int(content['height'])
caption = content['short_caption']
filename = content['image_name']
# Append to data list
data.append({'width': width, 'height': height, 'caption': caption, 'filename': filename})
# Create a DataFrame
df = pd.DataFrame(data, columns=['width', 'height', 'caption', 'filename'])
# Save the DataFrame to a Parquet file
df.to_parquet('dalle3.parquet', index=False)
print("Data has been successfully compiled and saved as 'dalle3.parquet'")
Execute the compilation script, being sure to have your python venv active:
Check that the parquet file contains the resulting rows. You may need to run pip install parquet-tools first:
This will print the first ten rows of the DALLE3 dataset. Don't worry if it takes a while, we're crunching more than 1 million rows from a columnar format.
Extracting image captions into textfiles¶
In the dalle3 directory, create a new script called extract-captions.py containing the following:
import glob
import json, os
import pandas as pd
from tqdm import tqdm
# Glob for all JSON files in the folder
json_files = glob.glob('data/*.json')
data = []
caption_field = 'short_caption'
# Process each JSON file
for file_path in tqdm(json_files, desc="Extracting text captions from JSON"):
with open(file_path, 'r') as file:
content = json.load(file)
if "width" not in content:
continue
text_path = "data/" + os.path.splitext(content['image_name'])[0] + ".txt"
# write content to text path
with open(text_path, 'w') as text_file:
text_file.write(content[caption_field])
This script will retrieve caption_field from each JSON file in the data/ subdirectory, and write that value to a .txt file with the same filename as the image.
If you wish to use a different caption field from the DALLE-3 set, update the value for caption_field before executing it.
Now, execute the script, being sure to have the venv active:
You can verify that there are the correct number of JSON files in the directory now. Beware that this may take a while to run:
You should see the correct value of 1,161,957.
Dataloader entry:¶
The following configuration entry will correctly locate the filenames and captions for your newly-created DALLE-3 dataset:
{
"id": "dalle3",
"type": "local",
"instance_data_dir": "/home/user/training/data/dalle3/data",
"resolution": 1.0,
"maximum_image_size": 2.0,
"minimum_image_size": 0.75,
"target_downsample_size": 1.75,
"resolution_type": "area",
"cache_dir_vae": "/path/to/cache/vae/",
"caption_strategy": "textfile",
"metadata_backend": "parquet",
"parquet": {
"path": "/home/user/training/data/dalle3/dalle3.parquet",
"caption_column": "caption",
"filename_column": "filename",
"width_column": "width",
"height_column": "height",
"identifier_includes_extension": true
}
},
Note: You can skip the extract-captions.py script and simply use caption_strategy=parquet if you wish to save on disk inodes.