crystal_analyzer.CrystalAnalyzer

class crystal_analyzer.CrystalAnalyzer(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]

Orchestrates the end-to-end extraction and processing pipeline for molecular- crystal data from the CSD.

extraction_config

Controls which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.

Type:

ExtractionConfig

csd_ops

Handles direct interactions with the CSD (refcode families, downloads, etc.).

Type:

CSDOperations

extractor

Performs detailed per-structure data extraction and parsing into HDF5.

Type:

StructureDataExtractor

data_dir

Directory where intermediate and output data (CSV, HDF5) are stored.

Type:

pathlib.Path

__init__(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]

Initialize the CrystalAnalyzer pipeline with specified configurations.

Parameters:
  • extraction_config (ExtractionConfig) – Configuration object controlling which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.

  • csd_ops_cls (Type[CSDOperations], optional) – Class implementing CSD operations. Default is CSDOperations.

  • extractor_cls (Type[StructureDataExtractor], optional) – Class for extracting structure-specific data. Default is StructureDataExtractor.

Raises:
  • RuntimeError – If any batched computation fails (e.g., OOM) or if shape mismatches occur when writing back to HDF5.

  • IOError – If appending to or reading from the HDF5 file fails.

Notes

Data flows:

raw_HDF5 -> (load into torch tensors on CPU) -> run batch computations
         -> append new datasets to HDF5 -> log memory utilization and
         batch progress

Methods

__init__(extraction_config[, csd_ops_cls, ...])

Initialize the CrystalAnalyzer pipeline with specified configurations.

extract_data()

Execute all data-extraction substeps specified by extraction_config.actions.

__init__(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]

Initialize the CrystalAnalyzer pipeline with specified configurations.

Parameters:
  • extraction_config (ExtractionConfig) – Configuration object controlling which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.

  • csd_ops_cls (Type[CSDOperations], optional) – Class implementing CSD operations. Default is CSDOperations.

  • extractor_cls (Type[StructureDataExtractor], optional) – Class for extracting structure-specific data. Default is StructureDataExtractor.

Raises:
  • RuntimeError – If any batched computation fails (e.g., OOM) or if shape mismatches occur when writing back to HDF5.

  • IOError – If appending to or reading from the HDF5 file fails.

Notes

Data flows:

raw_HDF5 -> (load into torch tensors on CPU) -> run batch computations
         -> append new datasets to HDF5 -> log memory utilization and
         batch progress
extract_data()[source]

Execute all data-extraction substeps specified by extraction_config.actions.

The sequence of substeps is: 1. _extract_refcode_families (if actions.get(“get_refcode_families”) is True) 2. _cluster_refcode_families (if actions.get(“cluster_refcode_families”) is True) 3. _extract_unique_structures (if actions.get(“get_unique_structures”) is True) 4. _extract_structure_data (if actions.get(“get_structure_data”) is True) 5. _post_extraction_process (if actions.get(“post_extraction_process”) is True)

During each substep, corresponding CSV/HDF5 files are generated (refcode lists, clustered families, per-structure atom lists, fragment datasets, etc.). The elapsed time for the entire pipeline is logged at INFO level.

Raises:

Exception – If any substep fails (e.g., network error fetching from CSD, parsing error).