crystal_analyzer.CrystalAnalyzer
- class crystal_analyzer.CrystalAnalyzer(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]
Orchestrates the end-to-end extraction and processing pipeline for molecular- crystal data from the CSD.
- extraction_config
Controls which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.
- Type:
- csd_ops
Handles direct interactions with the CSD (refcode families, downloads, etc.).
- Type:
- extractor
Performs detailed per-structure data extraction and parsing into HDF5.
- Type:
- data_dir
Directory where intermediate and output data (CSV, HDF5) are stored.
- Type:
- __init__(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]
Initialize the CrystalAnalyzer pipeline with specified configurations.
- Parameters:
extraction_config (ExtractionConfig) – Configuration object controlling which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.
csd_ops_cls (Type[CSDOperations], optional) – Class implementing CSD operations. Default is CSDOperations.
extractor_cls (Type[StructureDataExtractor], optional) – Class for extracting structure-specific data. Default is StructureDataExtractor.
- Raises:
RuntimeError – If any batched computation fails (e.g., OOM) or if shape mismatches occur when writing back to HDF5.
IOError – If appending to or reading from the HDF5 file fails.
Notes
Data flows:
raw_HDF5 -> (load into torch tensors on CPU) -> run batch computations -> append new datasets to HDF5 -> log memory utilization and batch progress
Methods
__init__(extraction_config[, csd_ops_cls, ...])Initialize the CrystalAnalyzer pipeline with specified configurations.
Execute all data-extraction substeps specified by extraction_config.actions.
- __init__(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]
Initialize the CrystalAnalyzer pipeline with specified configurations.
- Parameters:
extraction_config (ExtractionConfig) – Configuration object controlling which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.
csd_ops_cls (Type[CSDOperations], optional) – Class implementing CSD operations. Default is CSDOperations.
extractor_cls (Type[StructureDataExtractor], optional) – Class for extracting structure-specific data. Default is StructureDataExtractor.
- Raises:
RuntimeError – If any batched computation fails (e.g., OOM) or if shape mismatches occur when writing back to HDF5.
IOError – If appending to or reading from the HDF5 file fails.
Notes
Data flows:
raw_HDF5 -> (load into torch tensors on CPU) -> run batch computations -> append new datasets to HDF5 -> log memory utilization and batch progress
- extract_data()[source]
Execute all data-extraction substeps specified by extraction_config.actions.
The sequence of substeps is: 1. _extract_refcode_families (if actions.get(“get_refcode_families”) is True) 2. _cluster_refcode_families (if actions.get(“cluster_refcode_families”) is True) 3. _extract_unique_structures (if actions.get(“get_unique_structures”) is True) 4. _extract_structure_data (if actions.get(“get_structure_data”) is True) 5. _post_extraction_process (if actions.get(“post_extraction_process”) is True)
During each substep, corresponding CSV/HDF5 files are generated (refcode lists, clustered families, per-structure atom lists, fragment datasets, etc.). The elapsed time for the entire pipeline is logged at INFO level.
- Raises:
Exception – If any substep fails (e.g., network error fetching from CSD, parsing error).