crystal_analyzer module

Module: crystal_analyzer.py

Main orchestration logic for extracting and processing molecular-crystal data from the Cambridge Structural Database (CSD).

This module defines the CrystalAnalyzer class, which orchestrates the end-to-end pipeline for: - Extraction of refcode families - Clustering of structures - Extraction of structure-specific data - Post-extraction processing (e.g., computing fragment properties)

Dependencies

pandas torch csa_config csd_operations structure_data_extractor structure_post_extraction_processor

class crystal_analyzer.CrystalAnalyzer(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]

Bases: object

Orchestrates the end-to-end extraction and processing pipeline for molecular- crystal data from the CSD.

extraction_config

Controls which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.

Type:

ExtractionConfig

csd_ops

Handles direct interactions with the CSD (refcode families, downloads, etc.).

Type:

CSDOperations

extractor

Performs detailed per-structure data extraction and parsing into HDF5.

Type:

StructureDataExtractor

data_dir

Directory where intermediate and output data (CSV, HDF5) are stored.

Type:

pathlib.Path

__init__(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]

Initialize the CrystalAnalyzer pipeline with specified configurations.

Parameters:
  • extraction_config (ExtractionConfig) – Configuration object controlling which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.

  • csd_ops_cls (Type[CSDOperations], optional) – Class implementing CSD operations. Default is CSDOperations.

  • extractor_cls (Type[StructureDataExtractor], optional) – Class for extracting structure-specific data. Default is StructureDataExtractor.

Raises:
  • RuntimeError – If any batched computation fails (e.g., OOM) or if shape mismatches occur when writing back to HDF5.

  • IOError – If appending to or reading from the HDF5 file fails.

Notes

Data flows:

raw_HDF5 -> (load into torch tensors on CPU) -> run batch computations
         -> append new datasets to HDF5 -> log memory utilization and
         batch progress
extract_data()[source]

Execute all data-extraction substeps specified by extraction_config.actions.

The sequence of substeps is: 1. _extract_refcode_families (if actions.get(“get_refcode_families”) is True) 2. _cluster_refcode_families (if actions.get(“cluster_refcode_families”) is True) 3. _extract_unique_structures (if actions.get(“get_unique_structures”) is True) 4. _extract_structure_data (if actions.get(“get_structure_data”) is True) 5. _post_extraction_process (if actions.get(“post_extraction_process”) is True)

During each substep, corresponding CSV/HDF5 files are generated (refcode lists, clustered families, per-structure atom lists, fragment datasets, etc.). The elapsed time for the entire pipeline is logged at INFO level.

Raises:

Exception – If any substep fails (e.g., network error fetching from CSD, parsing error).

CrystalAnalyzer Class

class crystal_analyzer.CrystalAnalyzer(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]

Bases: object

Orchestrates the end-to-end extraction and processing pipeline for molecular- crystal data from the CSD.

extraction_config

Controls which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.

Type:

ExtractionConfig

csd_ops

Handles direct interactions with the CSD (refcode families, downloads, etc.).

Type:

CSDOperations

extractor

Performs detailed per-structure data extraction and parsing into HDF5.

Type:

StructureDataExtractor

data_dir

Directory where intermediate and output data (CSV, HDF5) are stored.

Type:

pathlib.Path

__init__(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]

Initialize the CrystalAnalyzer pipeline with specified configurations.

Parameters:
  • extraction_config (ExtractionConfig) – Configuration object controlling which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.

  • csd_ops_cls (Type[CSDOperations], optional) – Class implementing CSD operations. Default is CSDOperations.

  • extractor_cls (Type[StructureDataExtractor], optional) – Class for extracting structure-specific data. Default is StructureDataExtractor.

Raises:
  • RuntimeError – If any batched computation fails (e.g., OOM) or if shape mismatches occur when writing back to HDF5.

  • IOError – If appending to or reading from the HDF5 file fails.

Notes

Data flows:

raw_HDF5 -> (load into torch tensors on CPU) -> run batch computations
         -> append new datasets to HDF5 -> log memory utilization and
         batch progress

The main orchestration class for the CSA pipeline. This class coordinates all five stages of the analysis workflow and manages the flow of data between components.

Key Responsibilities:

  • Pipeline orchestration and stage management

  • Configuration validation and setup

  • Resource management and cleanup

  • Error handling and recovery

  • Progress monitoring and logging

Usage Example:

from crystal_analyzer import CrystalAnalyzer
from csa_config import load_config

# Load configuration
config = load_config('analysis_config.json')

# Initialize analyzer
analyzer = CrystalAnalyzer(extraction_config=config)

# Run complete pipeline
analyzer.extract_data()
__init__(extraction_config, csd_ops_cls=<class 'csd_operations.CSDOperations'>, extractor_cls=<class 'structure_data_extractor.StructureDataExtractor'>)[source]

Initialize the CrystalAnalyzer pipeline with specified configurations.

Parameters:
  • extraction_config (ExtractionConfig) – Configuration object controlling which extraction substeps to run, batch sizes, file paths, and CSD filtering criteria.

  • csd_ops_cls (Type[CSDOperations], optional) – Class implementing CSD operations. Default is CSDOperations.

  • extractor_cls (Type[StructureDataExtractor], optional) – Class for extracting structure-specific data. Default is StructureDataExtractor.

Raises:
  • RuntimeError – If any batched computation fails (e.g., OOM) or if shape mismatches occur when writing back to HDF5.

  • IOError – If appending to or reading from the HDF5 file fails.

Notes

Data flows:

raw_HDF5 -> (load into torch tensors on CPU) -> run batch computations
         -> append new datasets to HDF5 -> log memory utilization and
         batch progress
extract_data()[source]

Execute all data-extraction substeps specified by extraction_config.actions.

The sequence of substeps is: 1. _extract_refcode_families (if actions.get(“get_refcode_families”) is True) 2. _cluster_refcode_families (if actions.get(“cluster_refcode_families”) is True) 3. _extract_unique_structures (if actions.get(“get_unique_structures”) is True) 4. _extract_structure_data (if actions.get(“get_structure_data”) is True) 5. _post_extraction_process (if actions.get(“post_extraction_process”) is True)

During each substep, corresponding CSV/HDF5 files are generated (refcode lists, clustered families, per-structure atom lists, fragment datasets, etc.). The elapsed time for the entire pipeline is logged at INFO level.

Raises:

Exception – If any substep fails (e.g., network error fetching from CSD, parsing error).

Methods

extract_data()

CrystalAnalyzer.extract_data()[source]

Execute all data-extraction substeps specified by extraction_config.actions.

The sequence of substeps is: 1. _extract_refcode_families (if actions.get(“get_refcode_families”) is True) 2. _cluster_refcode_families (if actions.get(“cluster_refcode_families”) is True) 3. _extract_unique_structures (if actions.get(“get_unique_structures”) is True) 4. _extract_structure_data (if actions.get(“get_structure_data”) is True) 5. _post_extraction_process (if actions.get(“post_extraction_process”) is True)

During each substep, corresponding CSV/HDF5 files are generated (refcode lists, clustered families, per-structure atom lists, fragment datasets, etc.). The elapsed time for the entire pipeline is logged at INFO level.

Raises:

Exception – If any substep fails (e.g., network error fetching from CSD, parsing error).

Executes the complete five-stage CSA pipeline:

  1. Family Extraction - Query CSD for structure families

  2. Similarity Clustering - Group similar crystal packings

  3. Representative Selection - Choose optimal structures

  4. Data Extraction - Extract detailed structural data

  5. Feature Engineering - Compute advanced descriptors

Each stage can be enabled/disabled via the configuration file.

Private Methods

_extract_refcode_families()

CrystalAnalyzer._extract_refcode_families()[source]

Query CSD to retrieve all refcode families, save to disk, and return.

This method performs the following steps: - Invoke self.csd_ops.get_refcode_families_df() - Receive a DataFrame with columns [‘family_id’, ‘refcode’] - Write the DataFrame to disk at:

extraction_config.data_directory / f”{extraction_config.data_prefix}_refcode_families.csv”

  • Log the number of families retrieved at INFO level

Returns:

DataFrame with columns: - family_id : Unique integer or string ID for each refcode family - refcode : CSD refcode belonging to that family

Return type:

pandas.DataFrame

Queries the CSD and groups structures into families based on refcode prefixes.

Returns:

  • CSV file with family mappings

  • Statistics on family sizes and composition

_cluster_refcode_families()

CrystalAnalyzer._cluster_refcode_families()[source]

Group structures within each refcode family according to packing similarity.

This method performs the following steps: - Read the CSV produced by _extract_refcode_families() - For each family_id, call self.csd_ops.cluster_families(family_id,

output_path) to perform clustering of atomic coordinates.

  • Save clustering results to: extraction_config.data_directory / f”{extraction_config.data_prefix}_clustered_families.csv”

  • Log the number of clusters and cluster sizes at INFO level.

Raises:

RuntimeError – If clustering fails for any family (e.g., insufficient data, corrupted CIF).

Performs packing similarity clustering within each family using CCDC algorithms.

Process:

  1. Validates structures against filter criteria

  2. Computes 3D packing similarity for all pairs

  3. Builds similarity graphs and identifies clusters

  4. Outputs clustered family assignments

_extract_unique_structures()

CrystalAnalyzer._extract_unique_structures()[source]

Retrieve unique crystal structures for each cluster representative.

This method performs the following steps: - Read the clustered families CSV to identify one representative refcode per

cluster.

  • For each representative refcode: - Use self.csd_ops.get_unique_structures() to fetch atomic coordinates,

    symmetry operators, and other metadata.

    • Save the raw CIF to: extraction_config.data_directory / f”{extraction_config.data_prefix}_structures/{refcode}.cif”

  • Update and log status (total structures fetched, failures, retries).

Raises:

IOError – If any CIF fails to download or write to disk.

Selects one representative structure per cluster using the vdWFV metric.

Selection Criteria:

  • Minimum van der Waals free volume (1 - packing coefficient)

  • Lexicographic tie-breaking for identical values

  • Structure quality validation

_extract_structure_data()

CrystalAnalyzer._extract_structure_data()[source]

Parse each downloaded CIF and extract fundamental structure data into HDF5.

For each CIF in extraction_config.data_directory: - Use StructureDataExtractor to read atomic labels, fractional coordinates,

symmetry operations, lattice parameters, and partial charges.

  • Organize the extracted data into a pandas DataFrame.

  • Batch-write the data to: extraction_config.data_directory / f”{extraction_config.data_prefix}_structure_data.h5”

  • Log the total number of structures processed and any parse errors.

This method ensures that all per-structure numerics (coords, masks, labels) are stored in GPU-friendly formats for further GPU processing.

Raises:
  • ValueError – If CIF parsing yields inconsistent shapes (e.g., mismatched atom count vs. mask).

  • IOError – If HDF5 write fails due to disk space or file permissions.

Extracts detailed molecular and crystal data for selected representatives.

Data Extracted:

  • Atomic coordinates, labels, and properties

  • Bond connectivity and rotatability

  • Intermolecular contacts and hydrogen bonds

  • Crystal parameters and symmetry operations

_post_extraction_process()

CrystalAnalyzer._post_extraction_process()[source]

Perform all post-extraction computations on the raw structure data.

This step typically includes: - Fragment identification (rigid-fragment or molecular fragment detection) - Computation of fragment centers of mass (Cartesian & fractional) - Computation of fragment inertia tensors, eigenvalues, and quaternions - Computation of all intermolecular contacts and hydrogen-bond identification - Computation of distances/vectors from each contact atom to fragment COM - Augmentation of HDF5 datasets with new variable-length datasets for

fragment-related properties

Notes

Data flows:

raw_HDF5 -> (load into torch tensors on CPU) -> run batch computations
         -> append new datasets to HDF5 -> log memory utilization and
         batch progress
Raises:
  • RuntimeError – If any batched computation fails (e.g., OOM) or if shape mismatches occur when writing back to HDF5.

  • IOError – If appending to or reading from the HDF5 file fails.

Computes advanced features and descriptors using GPU acceleration.

Features Computed:

  • Fragment identification and properties

  • Geometric descriptors (angles, torsions, planarity)

  • Contact mapping and interaction analysis

  • Statistical order parameters

Private Attributes

extraction_configExtractionConfig

Configuration object controlling pipeline behavior

csd_opsCSDOperations

Handler for CSD database operations

extractorStructureDataExtractor

Component for raw data extraction

data_dirpathlib.Path

Directory for intermediate and output files

Error Handling

The CrystalAnalyzer includes comprehensive error handling:

Validation Errors

Configuration validation failures, missing files, invalid parameters

Database Errors

CSD connectivity issues, license problems, corrupted entries

Resource Errors

Insufficient memory, disk space, or GPU resources

Processing Errors

Structure validation failures, computation errors, file I/O issues

All errors are logged with detailed context information to facilitate debugging.

Configuration Dependencies

The CrystalAnalyzer requires a properly configured ExtractionConfig object:

{
  "extraction": {
    "data_directory": "./output",
    "data_prefix": "analysis",
    "actions": {
      "get_refcode_families": true,
      "cluster_refcode_families": true,
      "get_unique_structures": true,
      "get_structure_data": true,
      "post_extraction_process": true
    },
    "filters": {
      "target_z_prime_values": [1],
      "crystal_type": ["homomolecular"],
      "molecule_weight_limit": 500.0,
      "target_species": ["C", "H", "N", "O"]
    },
    "extraction_batch_size": 32,
    "post_extraction_batch_size": 16
  }
}

Performance Considerations

Memory Usage

Peak memory usage occurs during post-extraction processing and scales with:

  • Batch size settings

  • Structure complexity (atoms, contacts)

  • GPU memory availability

Processing Time

Pipeline duration depends on:

  • Dataset size (number of families/structures)

  • Similarity clustering complexity

  • Available computational resources

Storage Requirements

Output file sizes scale with:

  • Number of selected structures

  • Average structure complexity

  • Feature completeness

Optimization Tips

  • Use GPU acceleration for stages 4-5

  • Optimize batch sizes for available memory

  • Use SSD storage for HDF5 operations

  • Monitor resource usage during processing

See Also

csa_config module : Configuration management csd_operations module : CSD database operations structure_data_extractor module : Raw data extraction structure_post_extraction_processor module : Feature engineering