structure_data_extractor.StructureDataExtractor

class structure_data_extractor.StructureDataExtractor(hdf5_path, filters, batch_size)[source]

Manage parallel extraction of raw CSD data into an HDF5 container.

This class: - Loads a list of refcodes from CSV (clustered or unique). - Initializes or overwrites an HDF5 file with a ‘/structures’ group. - Batches refcodes and invokes _extract_one() in parallel. - Writes each structure’s raw fields into its own HDF5 subgroup.

hdf5_path

File path for the output HDF5 container.

Type:

Path

filters

ExtractionConfig.filters dictionary (e.g., data_directory, data_prefix, center_molecule, etc.).

Type:

Dict[str, Any]

batch_size

Number of refcodes to process concurrently per batch.

Type:

int

reader

CCDC EntryReader instance used by _extract_one.

Type:

io.EntryReader

run : () -> None

Execute the full extraction pipeline: overwrite HDF5, initialize, load refcode list, and process each batch.

_load_refcodes : () -> List[str]

Read the CSV of refcodes to extract.

_process_batch : (batch: List[str], h5: h5py.File) -> None

Extract raw data for each batch of refcodes and write to HDF5.

__init__(hdf5_path, filters, batch_size)[source]

Initialize a StructureDataExtractor.

Parameters:
  • hdf5_path (Union[str, Path]) – Path for the HDF5 file to create or overwrite.

  • filters (Dict[str, Any]) – ExtractionConfig.filters, containing: - ‘data_directory’ - ‘data_prefix’ - ‘center_molecule’ and other keys.

  • batch_size (int) – Number of structures to extract concurrently per batch.

Methods

__init__(hdf5_path, filters, batch_size)

Initialize a StructureDataExtractor.

run()

Perform the full raw-data extraction for all refcodes into HDF5.

__init__(hdf5_path, filters, batch_size)[source]

Initialize a StructureDataExtractor.

Parameters:
  • hdf5_path (Union[str, Path]) – Path for the HDF5 file to create or overwrite.

  • filters (Dict[str, Any]) – ExtractionConfig.filters, containing: - ‘data_directory’ - ‘data_prefix’ - ‘center_molecule’ and other keys.

  • batch_size (int) – Number of structures to extract concurrently per batch.

run()[source]

Perform the full raw-data extraction for all refcodes into HDF5.

This method: - Deletes the existing HDF5 file at hdf5_path if it exists. - Calls initialize_hdf5_file to create the ‘/structures’ group. - Loads all refcodes via _load_refcodes() and writes the ‘/refcode_list’ dataset. - Processes refcodes in batches of size batch_size:

  • Submits each refcode to ProcessPoolExecutor running _extract_one().

  • Collects (refcode, raw_data) tuples.

  • Writes raw fields under /structures/<refcode>/… as typed datasets.

  • Closes the HDF5 file.

Raises: