Configuration
This guide covers advanced configuration strategies for CSA workflows. While the Configuration guide covers basic setup, this section focuses on optimizing configurations for specific research scenarios and performance requirements.
Note
This guide assumes familiarity with basic CSA configuration. Review the Configuration guide first if you’re new to CSA.
Research-Driven Configuration
Configuring for Specific Research Questions
Polymorphism Studies
Analyzing different crystal forms of the same molecule:
{
"extraction": {
"data_directory": "./polymorphism_study",
"data_prefix": "polymorphs",
"actions": {
"get_refcode_families": true,
"cluster_refcode_families": false,
"get_unique_structures": false,
"get_structure_data": true,
"post_extraction_process": true
},
"filters": {
"target_z_prime_values": [1, 2, 3, 4],
"crystal_type": ["homomolecular"],
"molecule_formal_charges": [0],
"target_species": ["C", "H", "N", "O"],
"structure_list": ["csd-unique"]
}
}
}
Hydrogen Bonding Analysis
Focusing on structures with strong H-bond donors/acceptors:
{
"extraction": {
"data_directory": "./hbond_analysis",
"data_prefix": "hydrogen_bonds",
"filters": {
"target_z_prime_values": [1],
"crystal_type": ["homomolecular"],
"molecule_formal_charges": [0],
"molecule_weight_limit": 400.0,
"target_species": ["C", "H", "N", "O"],
"has_hbond_donors": true,
"has_hbond_acceptors": true
},
"extraction_batch_size": 48,
"post_extraction_batch_size": 24
}
}
Conformational Flexibility
Studying molecules with rotatable bonds:
{
"extraction": {
"data_directory": "./flexibility_study",
"data_prefix": "conformers",
"filters": {
"target_z_prime_values": [1],
"crystal_type": ["homomolecular"],
"molecule_formal_charges": [0],
"min_rotatable_bonds": 2,
"max_rotatable_bonds": 8,
"molecule_weight_limit": 600.0
}
}
}
Metal-Organic Systems
Including coordination compounds:
{
"extraction": {
"data_directory": "./metal_organic",
"data_prefix": "coordination",
"filters": {
"target_z_prime_values": [1, 2],
"crystal_type": ["homomolecular", "organometallic"],
"molecule_formal_charges": [0, 1, -1, 2, -2],
"target_species": ["C", "H", "N", "O", "Fe", "Cu", "Zn", "Ni"],
"exclude_disorder": true
}
}
}
Performance-Driven Configuration
Hardware-Optimized Settings
High-Memory Systems (64GB+ RAM)
Maximize throughput with large batches:
{
"extraction": {
"extraction_batch_size": 256,
"post_extraction_batch_size": 128,
"parallel_workers": 16,
"cache_intermediate_results": true
}
}
GPU-Accelerated Workstations
Balance GPU memory and compute:
{
"extraction": {
"extraction_batch_size": 128,
"post_extraction_batch_size": 64,
"use_gpu_acceleration": true,
"gpu_memory_fraction": 0.8
}
}
Cluster/HPC Environments
Optimize for distributed processing:
{
"extraction": {
"extraction_batch_size": 64,
"post_extraction_batch_size": 32,
"parallel_workers": 8,
"checkpoint_frequency": 1000,
"restart_on_failure": true
}
}
Limited Resource Systems
Conservative settings for laptops/workstations:
{
"extraction": {
"extraction_batch_size": 16,
"post_extraction_batch_size": 8,
"parallel_workers": 4,
"memory_optimization": "aggressive"
}
}
Workflow-Specific Configurations
Iterative Development
Rapid Prototyping
Quick testing with small datasets:
{
"extraction": {
"data_directory": "./prototype",
"data_prefix": "test",
"filters": {
"target_z_prime_values": [1],
"crystal_type": ["homomolecular"],
"molecule_weight_limit": 200.0,
"max_structures": 100
},
"extraction_batch_size": 32,
"post_extraction_batch_size": 16
}
}
Development Validation
Testing pipeline changes:
{
"extraction": {
"data_directory": "./validation",
"data_prefix": "dev_test",
"actions": {
"get_refcode_families": false,
"cluster_refcode_families": false,
"get_unique_structures": false,
"get_structure_data": true,
"post_extraction_process": true
},
"filters": {
"structure_list": ["cif", "/path/to/test/structures"]
},
"debug_mode": true,
"verbose_logging": true
}
}
Production Workflows
Large-Scale Screening
High-throughput analysis of thousands of structures:
{
"extraction": {
"data_directory": "./large_scale_screening",
"data_prefix": "hts_run_001",
"filters": {
"target_z_prime_values": [1],
"crystal_type": ["homomolecular"],
"molecule_weight_limit": 1000.0
},
"extraction_batch_size": 128,
"post_extraction_batch_size": 64,
"checkpoint_frequency": 500,
"compression": "lz4",
"backup_results": true
}
}
Reproducible Research
Ensuring consistent results across runs:
{
"extraction": {
"data_directory": "./reproducible_analysis",
"data_prefix": "paper_dataset_v1",
"random_seed": 42,
"deterministic_clustering": true,
"version_metadata": {
"csa_version": "2.0.0",
"csd_version": "2024.1",
"analysis_date": "2025-01-15",
"description": "Dataset for publication XYZ"
}
}
}
Advanced Filter Strategies
Chemical Space Sampling
Diverse Chemical Sampling
Ensuring broad coverage of chemical space:
{
"extraction": {
"filters": {
"target_z_prime_values": [1],
"crystal_type": ["homomolecular"],
"molecule_formal_charges": [0],
"sampling_strategy": "diverse",
"molecular_descriptors": {
"weight_range": [100, 800],
"logp_range": [-2, 6],
"hbd_range": [0, 5],
"hba_range": [0, 10],
"rotatable_bonds_range": [0, 15]
}
}
}
}
Focused Chemical Series
Analyzing structurally related compounds:
{
"extraction": {
"filters": {
"target_z_prime_values": [1],
"crystal_type": ["homomolecular"],
"structural_patterns": [
"benzene_ring",
"carboxylic_acid",
"amide_group"
],
"similarity_threshold": 0.7,
"scaffold_filtering": true
}
}
}
Quality Control Filters
High-Quality Structures Only
Strict quality criteria for reliable analysis:
{
"extraction": {
"filters": {
"min_resolution": 1.5,
"max_r_factor": 0.05,
"exclude_disorder": true,
"exclude_polymers": true,
"exclude_solvates": true,
"min_temperature": 100,
"max_temperature": 300,
"quality_flags": ["high_precision", "complete_structure"]
}
}
}
Experimental Condition Filters
Controlling for experimental variables:
{
"extraction": {
"filters": {
"temperature_range": [90, 120],
"pressure_range": ["ambient"],
"radiation_type": ["Mo_Ka", "Cu_Ka"],
"exclude_neutron": false,
"min_completeness": 0.95,
"min_observed_reflections": 1000
}
}
}
Configuration Templates
Template Library
CSA includes pre-defined configuration templates for common use cases:
Pharmaceutical Template
# Copy pharmaceutical template
cp templates/pharmaceutical.json my_pharma_config.json
# Edit specific parameters
nano my_pharma_config.json
Materials Science Template
# Copy materials template
cp templates/materials.json my_materials_config.json
Organic Chemistry Template
# Copy organic template
cp templates/organic.json my_organic_config.json
Custom Template Creation
Creating Project Templates
# create_template.py
import json
from pathlib import Path
def create_project_template(project_name, base_template="organic"):
"""Create a custom configuration template."""
template_dir = Path("templates")
base_config = json.loads((template_dir / f"{base_template}.json").read_text())
# Customize for project
base_config["extraction"]["data_directory"] = f"./{project_name}"
base_config["extraction"]["data_prefix"] = project_name
# Save custom template
output_path = template_dir / f"{project_name}.json"
output_path.write_text(json.dumps(base_config, indent=2))
return output_path
Configuration Validation
Pre-Flight Checks
Validate Before Running
from csa_config import ExtractionConfig, validate_configuration
def validate_config_file(config_path):
"""Comprehensive configuration validation."""
try:
# Load and parse
config = ExtractionConfig.from_json(config_path)
# Check resource requirements
estimated_memory = estimate_memory_usage(config)
estimated_time = estimate_runtime(config)
print(f"Configuration valid!")
print(f"Estimated memory: {estimated_memory:.1f} GB")
print(f"Estimated runtime: {estimated_time:.1f} hours")
# Check for common issues
warnings = check_common_issues(config)
if warnings:
print("Warnings:")
for warning in warnings:
print(f" - {warning}")
except Exception as e:
print(f"Configuration error: {e}")
return False
return True
Resource Estimation
def estimate_resources(config):
"""Estimate computational requirements."""
# Estimate dataset size
estimated_structures = estimate_structure_count(config.filters)
# Memory requirements
memory_per_structure = 2.5 # MB average
peak_memory = (estimated_structures * memory_per_structure *
config.extraction_batch_size / 1024) # GB
# Runtime estimation
structures_per_hour = 1000 # Typical throughput
estimated_hours = estimated_structures / structures_per_hour
return {
'structures': estimated_structures,
'peak_memory_gb': peak_memory,
'estimated_hours': estimated_hours
}
Configuration Management
Version Control
Track Configuration Changes
# Initialize git repository for configs
mkdir csa_configurations
cd csa_configurations
git init
# Add configurations
cp ../my_analysis.json ./
git add my_analysis.json
git commit -m "Initial analysis configuration"
# Track changes
git log --oneline my_analysis.json
Configuration Branches
# Create branches for different experiments
git checkout -b experiment_1
# Modify configuration...
git commit -m "Experiment 1: increased batch size"
git checkout -b experiment_2
# Different modifications...
git commit -m "Experiment 2: additional filters"
Environment-Specific Configs
Development vs Production
# config_manager.py
import os
import json
from pathlib import Path
def load_environment_config(base_config_path, environment="development"):
"""Load configuration with environment-specific overrides."""
base_config = json.loads(Path(base_config_path).read_text())
# Look for environment-specific overrides
env_config_path = Path(base_config_path).with_suffix(f'.{environment}.json')
if env_config_path.exists():
env_overrides = json.loads(env_config_path.read_text())
base_config = merge_configs(base_config, env_overrides)
# Apply environment variables
base_config = apply_env_vars(base_config)
return base_config
def merge_configs(base, overrides):
"""Deep merge configuration dictionaries."""
for key, value in overrides.items():
if isinstance(value, dict) and key in base:
base[key] = merge_configs(base[key], value)
else:
base[key] = value
return base
Best Practices
Configuration Organization
Use descriptive naming: Include project, date, and version in config names
Document parameters: Add comments explaining non-standard settings
Version control: Track configuration changes alongside code
Environment separation: Maintain different configs for dev/test/prod
Template usage: Start from validated templates rather than from scratch
Performance Optimization
Profile first: Measure before optimizing batch sizes
Incremental scaling: Gradually increase batch sizes to find optimal settings
Monitor resources: Watch memory and GPU utilization during runs
Cache strategies: Use appropriate caching for repeated analyses
Checkpoint frequently: Save progress for long-running analyses
Reproducibility
Pin versions: Document CSA, PyTorch, and CCDC versions
Set random seeds: Ensure deterministic behavior
Save complete configs: Store exact configuration with results
Document environment: Record hardware and software environment
Validate consistency: Test configurations across different systems
Troubleshooting
Common Configuration Issues
Memory Problems
ERROR: CUDA out of memory
Solutions:
- Reduce extraction_batch_size and post_extraction_batch_size
- Enable memory optimization: "memory_optimization": "aggressive"
- Use CPU processing: "use_gpu_acceleration": false
Performance Issues
WARNING: Very slow processing detected
Solutions: - Increase batch sizes if memory allows - Enable GPU acceleration - Reduce dataset size with more restrictive filters - Use faster storage (SSD) for data directory
Filter Problems
ERROR: No structures match the specified filters
Solutions: - Relax restrictive filters gradually - Check filter syntax and valid values - Use filter validation tools before running - Start with broader filters and refine iteratively
File System Issues
ERROR: Permission denied writing to data directory
Solutions:
- Check directory permissions: chmod 755 /path/to/data
- Ensure sufficient disk space
- Use absolute paths in configuration
- Verify write access: touch /path/to/data/test_file
Next Steps
With advanced configuration mastery:
Optimize for your hardware: Find the best performance settings
Develop analysis templates: Create reusable configurations
Automate workflows: Script configuration generation and validation
Share configurations: Collaborate with standardized templates
Monitor and improve: Continuously optimize based on usage patterns
See Also
Configuration : Basic configuration guide Basic Analysis : Apply configurations in analysis workflows ../technical_details/performance : Performance optimization details csa_config module : Configuration API reference