structure_data_extractor.StructureDataExtractor
- class structure_data_extractor.StructureDataExtractor(hdf5_path, filters, batch_size)[source]
Manage parallel extraction of raw CSD data into an HDF5 container.
This class: - Loads a list of refcodes from CSV (clustered or unique). - Initializes or overwrites an HDF5 file with a ‘/structures’ group. - Batches refcodes and invokes _extract_one() in parallel. - Writes each structure’s raw fields into its own HDF5 subgroup.
- hdf5_path
File path for the output HDF5 container.
- Type:
Path
- filters
ExtractionConfig.filters dictionary (e.g., data_directory, data_prefix, center_molecule, etc.).
- Type:
Dict[str, Any]
- reader
CCDC EntryReader instance used by _extract_one.
- Type:
io.EntryReader
- run : () -> None
Execute the full extraction pipeline: overwrite HDF5, initialize, load refcode list, and process each batch.
- _load_refcodes : () -> List[str]
Read the CSV of refcodes to extract.
- _process_batch : (batch: List[str], h5: h5py.File) -> None
Extract raw data for each batch of refcodes and write to HDF5.
Methods
__init__(hdf5_path, filters, batch_size)Initialize a StructureDataExtractor.
run()Perform the full raw-data extraction for all refcodes into HDF5.
- run()[source]
Perform the full raw-data extraction for all refcodes into HDF5.
This method: - Deletes the existing HDF5 file at hdf5_path if it exists. - Calls initialize_hdf5_file to create the ‘/structures’ group. - Loads all refcodes via _load_refcodes() and writes the ‘/refcode_list’ dataset. - Processes refcodes in batches of size batch_size:
Submits each refcode to ProcessPoolExecutor running _extract_one().
Collects (refcode, raw_data) tuples.
Writes raw fields under /structures/<refcode>/… as typed datasets.
Closes the HDF5 file.
- Raises:
FileNotFoundError – If the refcode list CSV is missing.
Exception – If any CCDC call or HDF5 write operation fails.