csa_config module
Module: csa_config.py
Configuration objects and loader for the Crystal Structure Analysis pipeline.
This module defines: - ExtractionConfig: dataclass controlling extraction parameters. - load_config: utility to construct ExtractionConfig from a JSON file.
- class csa_config.ExtractionConfig(data_directory, data_prefix, actions, filters, extraction_batch_size, post_extraction_batch_size)[source]
Bases:
objectConfiguration settings for the data-extraction pipeline.
- Parameters:
data_directory (Path) – Directory under which all raw and intermediate extraction outputs will be stored. Subdirectories (e.g. “structures/”, “csv/”) are created automatically.
data_prefix (str) – Prefix used when naming output files, for example
"{data_prefix}_refcode_families.csv".actions (Dict[str, bool]) – Flags to enable or skip individual extraction substeps: -
get_refcode_families-cluster_refcode_families-get_unique_structures-get_structure_data-post_extraction_processfilters (Dict[str, Any]) – Criteria for filtering CSD entries, for example: -
elements(List[str]): only structures containing these elements -min_resolution(float): only structures with resolution ≤ this value -space_groups(List[str]): only structures in these space groupsextraction_batch_size (int) – Number of structures or refcode families to process per batch during extraction
post_extraction_batch_size (int) – Number of structures to process per batch during post-extraction
- from_json(cls, json_path)[source]
Load and validate fields from the “extraction” section of a JSON file.
- classmethod from_json(json_path)[source]
Load an ExtractionConfig from a JSON file.
- Parameters:
json_path (Union[str, Path]) – Path to the JSON configuration file.
- Returns:
Instance populated from the “extraction” section.
- Return type:
- Raises:
FileNotFoundError – If the file does not exist.
KeyError – If the “extraction” section is missing.
json.JSONDecodeError – If the file contains invalid JSON.
- __init__(data_directory, data_prefix, actions, filters, extraction_batch_size, post_extraction_batch_size)
- csa_config.load_config(config_path)[source]
Read a JSON configuration file and return an ExtractionConfig instance.
- Parameters:
config_path (Union[str, Path]) – Path to the JSON configuration file.
- Returns:
Dataclass instance loaded from the “extraction” section.
- Return type:
- Raises:
FileNotFoundError – If the file does not exist.
KeyError – If the “extraction” section is missing.
json.JSONDecodeError – If the file contains invalid JSON.
Configuration Management for CSA Pipeline
The csa_config module provides robust configuration management for the Crystal Structure Analysis pipeline through the ExtractionConfig dataclass and associated loading utilities.
ExtractionConfig Class
- class csa_config.ExtractionConfig(data_directory, data_prefix, actions, filters, extraction_batch_size, post_extraction_batch_size)[source]
Bases:
objectConfiguration settings for the data-extraction pipeline.
- Parameters:
data_directory (Path) – Directory under which all raw and intermediate extraction outputs will be stored. Subdirectories (e.g. “structures/”, “csv/”) are created automatically.
data_prefix (str) – Prefix used when naming output files, for example
"{data_prefix}_refcode_families.csv".actions (Dict[str, bool]) – Flags to enable or skip individual extraction substeps: -
get_refcode_families-cluster_refcode_families-get_unique_structures-get_structure_data-post_extraction_processfilters (Dict[str, Any]) – Criteria for filtering CSD entries, for example: -
elements(List[str]): only structures containing these elements -min_resolution(float): only structures with resolution ≤ this value -space_groups(List[str]): only structures in these space groupsextraction_batch_size (int) – Number of structures or refcode families to process per batch during extraction
post_extraction_batch_size (int) – Number of structures to process per batch during post-extraction
- from_json(cls, json_path)[source]
Load and validate fields from the “extraction” section of a JSON file.
Configuration dataclass controlling all aspects of the CSA extraction pipeline.
Core Configuration Parameters:
data_directory (
Path) - Base directory for all extraction outputsdata_prefix (
str) - Filename prefix for generated filesactions (
Dict[str, bool]) - Pipeline stage enable/disable flagsfilters (
Dict[str, Any]) - Structure filtering and validation criteriaextraction_batch_size (
int) - Batch size for raw data extractionpost_extraction_batch_size (
int) - Batch size for feature computation
Pipeline Actions Control:
The
actionsdictionary controls which pipeline stages execute:actions = { "get_refcode_families": True, # Stage 1: Family extraction "cluster_refcode_families": True, # Stage 2: Similarity clustering "get_unique_structures": True, # Stage 3: Representative selection "get_structure_data": True, # Stage 4: Raw data extraction "post_extraction_process": True # Stage 5: Feature engineering }
Filter Criteria Examples:
Quality filters ensure reliable structural data:
filters = { "target_z_prime_values": [1], # Z' constraint "crystal_type": ["homomolecular"], # Single molecule type "molecule_weight_limit": 500.0, # Dalton upper limit "target_species": ["C", "H", "N", "O"], # Allowed elements "min_resolution": 1.5, # Angstrom resolution "max_r_factor": 0.05, # R-factor quality "exclude_disorder": True, # Structural quality "exclude_polymers": True, "exclude_solvates": True }
Performance Tuning:
Batch sizes should be optimized for available hardware:
# For systems with 16GB+ GPU memory extraction_batch_size = 64 post_extraction_batch_size = 32 # For systems with 8GB GPU memory extraction_batch_size = 32 post_extraction_batch_size = 16
- classmethod from_json(json_path)[source]
Load an ExtractionConfig from a JSON file.
- Parameters:
json_path (Union[str, Path]) – Path to the JSON configuration file.
- Returns:
Instance populated from the “extraction” section.
- Return type:
- Raises:
FileNotFoundError – If the file does not exist.
KeyError – If the “extraction” section is missing.
json.JSONDecodeError – If the file contains invalid JSON.
Load configuration from JSON file’s “extraction” section.
JSON Structure Expected:
{ "extraction": { "data_directory": "./analysis_output", "data_prefix": "my_analysis", "actions": { "get_refcode_families": true, "cluster_refcode_families": true, "get_unique_structures": true, "get_structure_data": true, "post_extraction_process": true }, "filters": { "target_z_prime_values": [1], "crystal_type": ["homomolecular"], "molecule_weight_limit": 500.0, "target_species": ["C", "H", "N", "O"] }, "extraction_batch_size": 32, "post_extraction_batch_size": 16 } }
Validation Performed:
File existence and readability
Valid JSON syntax
Presence of “extraction” section
Parameter type validation
Path conversion for data_directory
- Returns:
ExtractionConfiginstance with validated parameters- Raises:
FileNotFoundError- Configuration file not foundKeyError- Missing “extraction” sectionjson.JSONDecodeError- Invalid JSON syntax
- classmethod from_json(json_path)[source]
Load an ExtractionConfig from a JSON file.
- Parameters:
json_path (Union[str, Path]) – Path to the JSON configuration file.
- Returns:
Instance populated from the “extraction” section.
- Return type:
- Raises:
FileNotFoundError – If the file does not exist.
KeyError – If the “extraction” section is missing.
json.JSONDecodeError – If the file contains invalid JSON.
- __init__(data_directory, data_prefix, actions, filters, extraction_batch_size, post_extraction_batch_size)
Configuration Loading Functions
- csa_config.load_config(config_path)[source]
Read a JSON configuration file and return an ExtractionConfig instance.
- Parameters:
config_path (Union[str, Path]) – Path to the JSON configuration file.
- Returns:
Dataclass instance loaded from the “extraction” section.
- Return type:
- Raises:
FileNotFoundError – If the file does not exist.
KeyError – If the “extraction” section is missing.
json.JSONDecodeError – If the file contains invalid JSON.
Primary entry point for loading CSA configurations.
Usage Pattern:
from csa_config import load_config from crystal_analyzer import CrystalAnalyzer # Load configuration config = load_config('my_analysis.json') # Initialize analyzer with config analyzer = CrystalAnalyzer(extraction_config=config) # Execute pipeline analyzer.extract_data()
Configuration Templates:
CSA provides template configurations for common use cases:
templates/pharmaceutical.json- Drug crystal analysistemplates/materials.json- Materials science applicationstemplates/organic.json- General organic crystal analysistemplates/high_throughput.json- Large-scale screening
- Parameters:
config_path (
Union[str, Path]) - Path to JSON configuration file
- Returns:
ExtractionConfiginstance ready for pipeline execution- Raises:
FileNotFoundError- Configuration file not foundKeyError- Missing “extraction” sectionjson.JSONDecodeError- Invalid JSON syntax
Configuration Validation
Pre-Flight Validation
The configuration system performs comprehensive validation at load time:
try:
config = load_config('analysis.json')
print("✓ Configuration valid")
except FileNotFoundError:
print("✗ Configuration file not found")
except KeyError as e:
print(f"✗ Missing configuration section: {e}")
except json.JSONDecodeError as e:
print(f"✗ Invalid JSON syntax: {e}")
Field Validation
Each configuration parameter is validated for:
Type correctness - String, number, boolean, array types
Required presence - Essential fields must be specified
Value ranges - Numeric parameters within valid bounds
Path validity - Directory paths must be accessible
Common Configuration Errors
Missing Required Fields:
KeyError: 'extraction' section missing in config.json
Solution: Ensure JSON contains top-level “extraction” object
Invalid Path Specifications:
FileNotFoundError: Config file not found: /invalid/path/config.json
Solution: Verify file paths and permissions
Type Mismatches:
TypeError: Expected int for extraction_batch_size, got str
Solution: Check numeric fields are not quoted in JSON
Examples
Basic Pharmaceutical Analysis:
from csa_config import load_config
from crystal_analyzer import CrystalAnalyzer
# Load pharmaceutical-focused configuration
config = load_config('templates/pharmaceutical.json')
# Customize for specific drug class
config.filters.update({
"target_species": ["C", "H", "N", "O", "S", "Cl"],
"molecule_weight_limit": 800.0,
"target_z_prime_values": [1, 2]
})
# Run analysis
analyzer = CrystalAnalyzer(extraction_config=config)
analyzer.extract_data()
High-Throughput Materials Screening:
# Load high-throughput template
config = load_config('templates/high_throughput.json')
# Optimize for speed over completeness
config.actions.update({
"cluster_refcode_families": False, # Skip clustering for speed
"post_extraction_process": False # Skip advanced features
})
# Increase batch sizes for throughput
config.extraction_batch_size = 128
# Execute streamlined pipeline
analyzer = CrystalAnalyzer(extraction_config=config)
analyzer.extract_data()
Custom Local CIF Analysis:
# Configure for local CIF files
config = load_config('templates/organic.json')
# Disable CSD querying stages
config.actions.update({
"get_refcode_families": False,
"cluster_refcode_families": False,
"get_unique_structures": False
})
# Point to local CIF directory
config.filters["structure_list"] = ["cif", "/path/to/cif/files"]
# Process local structures
analyzer = CrystalAnalyzer(extraction_config=config)
analyzer.extract_data()
See Also
crystal_analyzer module : Main pipeline orchestration csa_main module : Command-line interface ../getting_started/configuration : Configuration guide