crystal_analyzer module
Module: crystal_analyzer.py
Main orchestration logic for extracting and processing molecular-crystal data from the Cambridge Structural Database (CSD).
This module defines the CrystalAnalyzer class, which orchestrates the end-to-end pipeline for: - Extraction of refcode families - Clustering of structures - Extraction of structure-specific data - Post-extraction processing (e.g., computing fragment properties)
Dependencies
pandas torch csa_config csd_operations structure_data_extractor structure_post_extraction_processor
- class crystal_analyzer.CrystalAnalyzer(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]
Bases:
objectOrchestrates the end-to-end extraction and processing pipeline for molecular- crystal data from the CSD.
- extraction_config
Controls which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.
- Type:
- csd_ops
Handles direct interactions with the CSD (refcode families, downloads, etc.).
- Type:
- extractor
Performs detailed per-structure data extraction and parsing into HDF5.
- Type:
- data_dir
Directory where intermediate and output data (CSV, HDF5) are stored.
- Type:
- __init__(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]
Initialize the CrystalAnalyzer pipeline with specified configurations.
- Parameters:
extraction_config (ExtractionConfig) – Configuration object controlling which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.
csd_ops_cls (Type[CSDOperations], optional) – Class implementing CSD operations. Default is CSDOperations.
extractor_cls (Type[StructureDataExtractor], optional) – Class for extracting structure-specific data. Default is StructureDataExtractor.
- Raises:
RuntimeError – If any batched computation fails (e.g., OOM) or if shape mismatches occur when writing back to HDF5.
IOError – If appending to or reading from the HDF5 file fails.
Notes
Data flows:
raw_HDF5 -> (load into torch tensors on CPU) -> run batch computations -> append new datasets to HDF5 -> log memory utilization and batch progress
- extract_data()[source]
Execute all data-extraction substeps specified by extraction_config.actions.
The sequence of substeps is: 1. _extract_refcode_families (if actions.get(“get_refcode_families”) is True) 2. _cluster_refcode_families (if actions.get(“cluster_refcode_families”) is True) 3. _extract_unique_structures (if actions.get(“get_unique_structures”) is True) 4. _extract_structure_data (if actions.get(“get_structure_data”) is True) 5. _post_extraction_process (if actions.get(“post_extraction_process”) is True)
During each substep, corresponding CSV/HDF5 files are generated (refcode lists, clustered families, per-structure atom lists, fragment datasets, etc.). The elapsed time for the entire pipeline is logged at INFO level.
- Raises:
Exception – If any substep fails (e.g., network error fetching from CSD, parsing error).
CrystalAnalyzer Class
- class crystal_analyzer.CrystalAnalyzer(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]
Bases:
objectOrchestrates the end-to-end extraction and processing pipeline for molecular- crystal data from the CSD.
- extraction_config
Controls which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.
- Type:
- csd_ops
Handles direct interactions with the CSD (refcode families, downloads, etc.).
- Type:
- extractor
Performs detailed per-structure data extraction and parsing into HDF5.
- Type:
- data_dir
Directory where intermediate and output data (CSV, HDF5) are stored.
- Type:
- __init__(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]
Initialize the CrystalAnalyzer pipeline with specified configurations.
- Parameters:
extraction_config (ExtractionConfig) – Configuration object controlling which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.
csd_ops_cls (Type[CSDOperations], optional) – Class implementing CSD operations. Default is CSDOperations.
extractor_cls (Type[StructureDataExtractor], optional) – Class for extracting structure-specific data. Default is StructureDataExtractor.
- Raises:
RuntimeError – If any batched computation fails (e.g., OOM) or if shape mismatches occur when writing back to HDF5.
IOError – If appending to or reading from the HDF5 file fails.
Notes
Data flows:
raw_HDF5 -> (load into torch tensors on CPU) -> run batch computations -> append new datasets to HDF5 -> log memory utilization and batch progress
The main orchestration class for the CSA pipeline. This class coordinates all five stages of the analysis workflow and manages the flow of data between components.
Key Responsibilities:
Pipeline orchestration and stage management
Configuration validation and setup
Resource management and cleanup
Error handling and recovery
Progress monitoring and logging
Usage Example:
from crystal_analyzer import CrystalAnalyzer from csa_config import load_config # Load configuration config = load_config('analysis_config.json') # Initialize analyzer analyzer = CrystalAnalyzer(extraction_config=config) # Run complete pipeline analyzer.extract_data()
- __init__(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]
Initialize the CrystalAnalyzer pipeline with specified configurations.
- Parameters:
extraction_config (ExtractionConfig) – Configuration object controlling which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.
csd_ops_cls (Type[CSDOperations], optional) – Class implementing CSD operations. Default is CSDOperations.
extractor_cls (Type[StructureDataExtractor], optional) – Class for extracting structure-specific data. Default is StructureDataExtractor.
- Raises:
RuntimeError – If any batched computation fails (e.g., OOM) or if shape mismatches occur when writing back to HDF5.
IOError – If appending to or reading from the HDF5 file fails.
Notes
Data flows:
raw_HDF5 -> (load into torch tensors on CPU) -> run batch computations -> append new datasets to HDF5 -> log memory utilization and batch progress
- extract_data()[source]
Execute all data-extraction substeps specified by extraction_config.actions.
The sequence of substeps is: 1. _extract_refcode_families (if actions.get(“get_refcode_families”) is True) 2. _cluster_refcode_families (if actions.get(“cluster_refcode_families”) is True) 3. _extract_unique_structures (if actions.get(“get_unique_structures”) is True) 4. _extract_structure_data (if actions.get(“get_structure_data”) is True) 5. _post_extraction_process (if actions.get(“post_extraction_process”) is True)
During each substep, corresponding CSV/HDF5 files are generated (refcode lists, clustered families, per-structure atom lists, fragment datasets, etc.). The elapsed time for the entire pipeline is logged at INFO level.
- Raises:
Exception – If any substep fails (e.g., network error fetching from CSD, parsing error).
Methods
extract_data()
- CrystalAnalyzer.extract_data()[source]
Execute all data-extraction substeps specified by extraction_config.actions.
The sequence of substeps is: 1. _extract_refcode_families (if actions.get(“get_refcode_families”) is True) 2. _cluster_refcode_families (if actions.get(“cluster_refcode_families”) is True) 3. _extract_unique_structures (if actions.get(“get_unique_structures”) is True) 4. _extract_structure_data (if actions.get(“get_structure_data”) is True) 5. _post_extraction_process (if actions.get(“post_extraction_process”) is True)
During each substep, corresponding CSV/HDF5 files are generated (refcode lists, clustered families, per-structure atom lists, fragment datasets, etc.). The elapsed time for the entire pipeline is logged at INFO level.
- Raises:
Exception – If any substep fails (e.g., network error fetching from CSD, parsing error).
Executes the complete five-stage CSA pipeline:
Family Extraction - Query CSD for structure families
Similarity Clustering - Group similar crystal packings
Representative Selection - Choose optimal structures
Data Extraction - Extract detailed structural data
Feature Engineering - Compute advanced descriptors
Each stage can be enabled/disabled via the configuration file.
Private Methods
_extract_refcode_families()
- CrystalAnalyzer._extract_refcode_families()[source]
Query CSD to retrieve all refcode families, save to disk, and return.
This method performs the following steps: - Invoke self.csd_ops.get_refcode_families_df() - Receive a DataFrame with columns [‘family_id’, ‘refcode’] - Write the DataFrame to disk at:
extraction_config.data_directory / f”{extraction_config.data_prefix}_refcode_families.csv”
Log the number of families retrieved at INFO level
- Returns:
DataFrame with columns: - family_id : Unique integer or string ID for each refcode family - refcode : CSD refcode belonging to that family
- Return type:
Queries the CSD and groups structures into families based on refcode prefixes.
Returns:
CSV file with family mappings
Statistics on family sizes and composition
_cluster_refcode_families()
- CrystalAnalyzer._cluster_refcode_families()[source]
Group structures within each refcode family according to packing similarity.
This method performs the following steps: - Read the CSV produced by _extract_refcode_families() - For each family_id, call self.csd_ops.cluster_families(family_id,
output_path) to perform clustering of atomic coordinates.
Save clustering results to: extraction_config.data_directory / f”{extraction_config.data_prefix}_clustered_families.csv”
Log the number of clusters and cluster sizes at INFO level.
- Raises:
RuntimeError – If clustering fails for any family (e.g., insufficient data, corrupted CIF).
Performs packing similarity clustering within each family using CCDC algorithms.
Process:
Validates structures against filter criteria
Computes 3D packing similarity for all pairs
Builds similarity graphs and identifies clusters
Outputs clustered family assignments
_extract_unique_structures()
- CrystalAnalyzer._extract_unique_structures()[source]
Retrieve unique crystal structures for each cluster representative.
This method performs the following steps: - Read the clustered families CSV to identify one representative refcode per
cluster.
For each representative refcode: - Use self.csd_ops.get_unique_structures() to fetch atomic coordinates,
symmetry operators, and other metadata.
Save the raw CIF to: extraction_config.data_directory / f”{extraction_config.data_prefix}_structures/{refcode}.cif”
Update and log status (total structures fetched, failures, retries).
- Raises:
IOError – If any CIF fails to download or write to disk.
Selects one representative structure per cluster using the vdWFV metric.
Selection Criteria:
Minimum van der Waals free volume (1 - packing coefficient)
Lexicographic tie-breaking for identical values
Structure quality validation
_extract_structure_data()
- CrystalAnalyzer._extract_structure_data()[source]
Parse each downloaded CIF and extract fundamental structure data into HDF5.
For each CIF in extraction_config.data_directory: - Use StructureDataExtractor to read atomic labels, fractional coordinates,
symmetry operations, lattice parameters, and partial charges.
Organize the extracted data into a pandas DataFrame.
Batch-write the data to: extraction_config.data_directory / f”{extraction_config.data_prefix}_structure_data.h5”
Log the total number of structures processed and any parse errors.
This method ensures that all per-structure numerics (coords, masks, labels) are stored in GPU-friendly formats for further GPU processing.
- Raises:
ValueError – If CIF parsing yields inconsistent shapes (e.g., mismatched atom count vs. mask).
IOError – If HDF5 write fails due to disk space or file permissions.
Extracts detailed molecular and crystal data for selected representatives.
Data Extracted:
Atomic coordinates, labels, and properties
Bond connectivity and rotatability
Intermolecular contacts and hydrogen bonds
Crystal parameters and symmetry operations
_post_extraction_process()
- CrystalAnalyzer._post_extraction_process()[source]
Perform all post-extraction computations on the raw structure data.
This step typically includes: - Fragment identification (rigid-fragment or molecular fragment detection) - Computation of fragment centers of mass (Cartesian & fractional) - Computation of fragment inertia tensors, eigenvalues, and quaternions - Computation of all intermolecular contacts and hydrogen-bond identification - Computation of distances/vectors from each contact atom to fragment COM - Augmentation of HDF5 datasets with new variable-length datasets for
fragment-related properties
Notes
Data flows:
raw_HDF5 -> (load into torch tensors on CPU) -> run batch computations -> append new datasets to HDF5 -> log memory utilization and batch progress
- Raises:
RuntimeError – If any batched computation fails (e.g., OOM) or if shape mismatches occur when writing back to HDF5.
IOError – If appending to or reading from the HDF5 file fails.
Computes advanced features and descriptors using GPU acceleration.
Features Computed:
Fragment identification and properties
Geometric descriptors (angles, torsions, planarity)
Contact mapping and interaction analysis
Statistical order parameters
Private Attributes
- extraction_configExtractionConfig
Configuration object controlling pipeline behavior
- csd_opsCSDOperations
Handler for CSD database operations
- extractorStructureDataExtractor
Component for raw data extraction
- data_dirpathlib.Path
Directory for intermediate and output files
Error Handling
The CrystalAnalyzer includes comprehensive error handling:
- Validation Errors
Configuration validation failures, missing files, invalid parameters
- Database Errors
CSD connectivity issues, license problems, corrupted entries
- Resource Errors
Insufficient memory, disk space, or GPU resources
- Processing Errors
Structure validation failures, computation errors, file I/O issues
All errors are logged with detailed context information to facilitate debugging.
Configuration Dependencies
The CrystalAnalyzer requires a properly configured ExtractionConfig object:
{
"extraction": {
"data_directory": "./output",
"data_prefix": "analysis",
"actions": {
"get_refcode_families": true,
"cluster_refcode_families": true,
"get_unique_structures": true,
"get_structure_data": true,
"post_extraction_process": true
},
"filters": {
"target_z_prime_values": [1],
"crystal_type": ["homomolecular"],
"molecule_weight_limit": 500.0,
"target_species": ["C", "H", "N", "O"]
},
"extraction_batch_size": 32,
"post_extraction_batch_size": 16
}
}
Performance Considerations
- Memory Usage
Peak memory usage occurs during post-extraction processing and scales with:
Batch size settings
Structure complexity (atoms, contacts)
GPU memory availability
- Processing Time
Pipeline duration depends on:
Dataset size (number of families/structures)
Similarity clustering complexity
Available computational resources
- Storage Requirements
Output file sizes scale with:
Number of selected structures
Average structure complexity
Feature completeness
Optimization Tips
Use GPU acceleration for stages 4-5
Optimize batch sizes for available memory
Use SSD storage for HDF5 operations
Monitor resource usage during processing
See Also
csa_config module : Configuration management csd_operations module : CSD database operations structure_data_extractor module : Raw data extraction structure_post_extraction_processor module : Feature engineering