csa_config module

Module: csa_config.py

Configuration objects and loader for the Crystal Structure Analysis pipeline.

This module defines: - ExtractionConfig: dataclass controlling extraction parameters. - load_config: utility to construct ExtractionConfig from a JSON file.

class csa_config.ExtractionConfig(data_directory, data_prefix, actions, filters, extraction_batch_size, post_extraction_batch_size)[source]

Bases: object

Configuration settings for the data-extraction pipeline.

Parameters:
  • data_directory (Path) – Directory under which all raw and intermediate extraction outputs will be stored. Subdirectories (e.g. “structures/”, “csv/”) are created automatically.

  • data_prefix (str) – Prefix used when naming output files, for example "{data_prefix}_refcode_families.csv".

  • actions (Dict[str, bool]) – Flags to enable or skip individual extraction substeps: - get_refcode_families - cluster_refcode_families - get_unique_structures - get_structure_data - post_extraction_process

  • filters (Dict[str, Any]) – Criteria for filtering CSD entries, for example: - elements (List[str]): only structures containing these elements - min_resolution (float): only structures with resolution ≤ this value - space_groups (List[str]): only structures in these space groups

  • extraction_batch_size (int) – Number of structures or refcode families to process per batch during extraction

  • post_extraction_batch_size (int) – Number of structures to process per batch during post-extraction

from_json(cls, json_path)[source]

Load and validate fields from the “extraction” section of a JSON file.

data_directory: Path
data_prefix: str
actions: Dict[str, bool]
filters: Dict[str, Any]
extraction_batch_size: int
post_extraction_batch_size: int
classmethod from_json(json_path)[source]

Load an ExtractionConfig from a JSON file.

Parameters:

json_path (Union[str, Path]) – Path to the JSON configuration file.

Returns:

Instance populated from the “extraction” section.

Return type:

ExtractionConfig

Raises:
__init__(data_directory, data_prefix, actions, filters, extraction_batch_size, post_extraction_batch_size)
csa_config.load_config(config_path)[source]

Read a JSON configuration file and return an ExtractionConfig instance.

Parameters:

config_path (Union[str, Path]) – Path to the JSON configuration file.

Returns:

Dataclass instance loaded from the “extraction” section.

Return type:

ExtractionConfig

Raises:

Configuration Management for CSA Pipeline

The csa_config module provides robust configuration management for the Crystal Structure Analysis pipeline through the ExtractionConfig dataclass and associated loading utilities.

ExtractionConfig Class

class csa_config.ExtractionConfig(data_directory, data_prefix, actions, filters, extraction_batch_size, post_extraction_batch_size)[source]

Bases: object

Configuration settings for the data-extraction pipeline.

Parameters:
  • data_directory (Path) – Directory under which all raw and intermediate extraction outputs will be stored. Subdirectories (e.g. “structures/”, “csv/”) are created automatically.

  • data_prefix (str) – Prefix used when naming output files, for example "{data_prefix}_refcode_families.csv".

  • actions (Dict[str, bool]) – Flags to enable or skip individual extraction substeps: - get_refcode_families - cluster_refcode_families - get_unique_structures - get_structure_data - post_extraction_process

  • filters (Dict[str, Any]) – Criteria for filtering CSD entries, for example: - elements (List[str]): only structures containing these elements - min_resolution (float): only structures with resolution ≤ this value - space_groups (List[str]): only structures in these space groups

  • extraction_batch_size (int) – Number of structures or refcode families to process per batch during extraction

  • post_extraction_batch_size (int) – Number of structures to process per batch during post-extraction

from_json(cls, json_path)[source]

Load and validate fields from the “extraction” section of a JSON file.

Configuration dataclass controlling all aspects of the CSA extraction pipeline.

Core Configuration Parameters:

  • data_directory (Path) - Base directory for all extraction outputs

  • data_prefix (str) - Filename prefix for generated files

  • actions (Dict[str, bool]) - Pipeline stage enable/disable flags

  • filters (Dict[str, Any]) - Structure filtering and validation criteria

  • extraction_batch_size (int) - Batch size for raw data extraction

  • post_extraction_batch_size (int) - Batch size for feature computation

Pipeline Actions Control:

The actions dictionary controls which pipeline stages execute:

actions = {
    "get_refcode_families": True,      # Stage 1: Family extraction
    "cluster_refcode_families": True,  # Stage 2: Similarity clustering
    "get_unique_structures": True,     # Stage 3: Representative selection
    "get_structure_data": True,        # Stage 4: Raw data extraction
    "post_extraction_process": True    # Stage 5: Feature engineering
}

Filter Criteria Examples:

Quality filters ensure reliable structural data:

filters = {
    "target_z_prime_values": [1],           # Z' constraint
    "crystal_type": ["homomolecular"],      # Single molecule type
    "molecule_weight_limit": 500.0,         # Dalton upper limit
    "target_species": ["C", "H", "N", "O"], # Allowed elements
    "min_resolution": 1.5,                  # Angstrom resolution
    "max_r_factor": 0.05,                   # R-factor quality
    "exclude_disorder": True,               # Structural quality
    "exclude_polymers": True,
    "exclude_solvates": True
}

Performance Tuning:

Batch sizes should be optimized for available hardware:

# For systems with 16GB+ GPU memory
extraction_batch_size = 64
post_extraction_batch_size = 32

# For systems with 8GB GPU memory
extraction_batch_size = 32
post_extraction_batch_size = 16
classmethod from_json(json_path)[source]

Load an ExtractionConfig from a JSON file.

Parameters:

json_path (Union[str, Path]) – Path to the JSON configuration file.

Returns:

Instance populated from the “extraction” section.

Return type:

ExtractionConfig

Raises:

Load configuration from JSON file’s “extraction” section.

JSON Structure Expected:

{
  "extraction": {
    "data_directory": "./analysis_output",
    "data_prefix": "my_analysis",
    "actions": {
      "get_refcode_families": true,
      "cluster_refcode_families": true,
      "get_unique_structures": true,
      "get_structure_data": true,
      "post_extraction_process": true
    },
    "filters": {
      "target_z_prime_values": [1],
      "crystal_type": ["homomolecular"],
      "molecule_weight_limit": 500.0,
      "target_species": ["C", "H", "N", "O"]
    },
    "extraction_batch_size": 32,
    "post_extraction_batch_size": 16
  }
}

Validation Performed:

  • File existence and readability

  • Valid JSON syntax

  • Presence of “extraction” section

  • Parameter type validation

  • Path conversion for data_directory

Returns:

ExtractionConfig instance with validated parameters

Raises:
data_directory: Path
data_prefix: str
actions: Dict[str, bool]
filters: Dict[str, Any]
extraction_batch_size: int
post_extraction_batch_size: int
classmethod from_json(json_path)[source]

Load an ExtractionConfig from a JSON file.

Parameters:

json_path (Union[str, Path]) – Path to the JSON configuration file.

Returns:

Instance populated from the “extraction” section.

Return type:

ExtractionConfig

Raises:
__init__(data_directory, data_prefix, actions, filters, extraction_batch_size, post_extraction_batch_size)

Configuration Loading Functions

csa_config.load_config(config_path)[source]

Read a JSON configuration file and return an ExtractionConfig instance.

Parameters:

config_path (Union[str, Path]) – Path to the JSON configuration file.

Returns:

Dataclass instance loaded from the “extraction” section.

Return type:

ExtractionConfig

Raises:

Primary entry point for loading CSA configurations.

Usage Pattern:

from csa_config import load_config
from crystal_analyzer import CrystalAnalyzer

# Load configuration
config = load_config('my_analysis.json')

# Initialize analyzer with config
analyzer = CrystalAnalyzer(extraction_config=config)

# Execute pipeline
analyzer.extract_data()

Configuration Templates:

CSA provides template configurations for common use cases:

  • templates/pharmaceutical.json - Drug crystal analysis

  • templates/materials.json - Materials science applications

  • templates/organic.json - General organic crystal analysis

  • templates/high_throughput.json - Large-scale screening

Parameters:
  • config_path (Union[str, Path]) - Path to JSON configuration file

Returns:

ExtractionConfig instance ready for pipeline execution

Raises:

Configuration Validation

Pre-Flight Validation

The configuration system performs comprehensive validation at load time:

try:
    config = load_config('analysis.json')
    print("✓ Configuration valid")
except FileNotFoundError:
    print("✗ Configuration file not found")
except KeyError as e:
    print(f"✗ Missing configuration section: {e}")
except json.JSONDecodeError as e:
    print(f"✗ Invalid JSON syntax: {e}")

Field Validation

Each configuration parameter is validated for:

  • Type correctness - String, number, boolean, array types

  • Required presence - Essential fields must be specified

  • Value ranges - Numeric parameters within valid bounds

  • Path validity - Directory paths must be accessible

Common Configuration Errors

Missing Required Fields:

KeyError: 'extraction' section missing in config.json

Solution: Ensure JSON contains top-level “extraction” object

Invalid Path Specifications:

FileNotFoundError: Config file not found: /invalid/path/config.json

Solution: Verify file paths and permissions

Type Mismatches:

TypeError: Expected int for extraction_batch_size, got str

Solution: Check numeric fields are not quoted in JSON

Examples

Basic Pharmaceutical Analysis:

from csa_config import load_config
from crystal_analyzer import CrystalAnalyzer

# Load pharmaceutical-focused configuration
config = load_config('templates/pharmaceutical.json')

# Customize for specific drug class
config.filters.update({
    "target_species": ["C", "H", "N", "O", "S", "Cl"],
    "molecule_weight_limit": 800.0,
    "target_z_prime_values": [1, 2]
})

# Run analysis
analyzer = CrystalAnalyzer(extraction_config=config)
analyzer.extract_data()

High-Throughput Materials Screening:

# Load high-throughput template
config = load_config('templates/high_throughput.json')

# Optimize for speed over completeness
config.actions.update({
    "cluster_refcode_families": False,  # Skip clustering for speed
    "post_extraction_process": False    # Skip advanced features
})

# Increase batch sizes for throughput
config.extraction_batch_size = 128

# Execute streamlined pipeline
analyzer = CrystalAnalyzer(extraction_config=config)
analyzer.extract_data()

Custom Local CIF Analysis:

# Configure for local CIF files
config = load_config('templates/organic.json')

# Disable CSD querying stages
config.actions.update({
    "get_refcode_families": False,
    "cluster_refcode_families": False,
    "get_unique_structures": False
})

# Point to local CIF directory
config.filters["structure_list"] = ["cif", "/path/to/cif/files"]

# Process local structures
analyzer = CrystalAnalyzer(extraction_config=config)
analyzer.extract_data()

See Also

crystal_analyzer module : Main pipeline orchestration csa_main module : Command-line interface ../getting_started/configuration : Configuration guide