csa_main module

Module: main.py

Entry-point script for the Crystal Structure Analysis pipeline.

This module provides a command-line interface to execute the full CSD data extraction pipeline via CrystalAnalyzer.extract_data().

Examples

python csa_main.py –config /path/to/config.json

Dependencies

crystal_analyzer.CrystalAnalyzer csa_config.load_config argparse logging pathlib sys

csa_main.setup_logging()[source]

Configure the root logger for console output.

This function performs the following actions: - Remove any existing handlers from the root logger. - Create a StreamHandler to stderr with format:

“%(asctime)s - %(name)s - %(levelname)s - %(message)s”

Attach the handler to the root logger.
Set the root logger level to INFO.

csa_main.parse_args()[source]

Parse and validate command-line arguments.

Returns:: Namespace object with the following attribute: - config (Path): Path to the JSON configuration file.
Return type:: argparse.Namespace
Raises:: SystemExit – If argument parsing fails.

csa_main.run_extraction(config_path)[source]

Run the CSD data extraction pipeline.

This function: - Loads the extraction configuration from the provided JSON file. - Instantiates a CrystalAnalyzer with the configuration. - Invokes analyzer.extract_data() to perform all extraction steps.

Parameters:: config_path (Path) – Path to the JSON configuration file. Must exist and be readable.
Raises:: Exception – If any error occurs during extraction, it is logged and re-raised.

csa_main.main()[source]

Entry point for the csa_main script.

Workflow

Configure logging by calling setup_logging().
Parse command-line arguments via parse_args().
Invoke run_extraction() with the resolved config path.
Log success or catch and log exceptions before exiting.

raises SystemExit:: If argument parsing fails.
raises Exception:: If run_extraction() throws an error; it is logged before propagation.

Command-Line Interface for CSA Pipeline

The csa_main module provides the primary command-line interface for executing the Crystal Structure Analysis pipeline. It handles argument parsing, logging configuration, and orchestrates the complete extraction workflow.

Main Functions

csa_main.main()[source]

Entry point for the csa_main script.

Workflow

Configure logging by calling setup_logging().
Parse command-line arguments via parse_args().
Invoke run_extraction() with the resolved config path.
Log success or catch and log exceptions before exiting.

raises SystemExit:: If argument parsing fails.
raises Exception:: If run_extraction() throws an error; it is logged before propagation.

Entry Point for CSA Execution

Main entry point that orchestrates the complete CSA pipeline execution. This function coordinates logging setup, argument parsing, and pipeline execution with comprehensive error handling.

Execution Workflow:

Logging Configuration - Sets up structured console logging
Argument Parsing - Processes command-line arguments and validates inputs
Pipeline Execution - Loads configuration and runs extraction pipeline
Error Handling - Captures and logs any execution failures
Success Reporting - Confirms successful completion

Usage from Command Line:

# Basic execution with default configuration
python csa_main.py --config analysis.json

# Using alternative configuration path
python csa_main.py --config /path/to/custom/config.json

# With output redirection and logging
python csa_main.py --config analysis.json 2>&1 | tee analysis.log

Usage from Python:

# Direct function call
import sys
sys.argv = ['csa_main.py', '--config', 'my_analysis.json']

from csa_main import main
main()

Exit Codes:

0 - Successful completion
1 - Configuration or parsing error
2 - Pipeline execution failure

Raises:

SystemExit - On argument parsing failures
Exception - On pipeline execution errors (logged and re-raised)

csa_main.run_extraction(config_path)[source]

Run the CSD data extraction pipeline.

This function: - Loads the extraction configuration from the provided JSON file. - Instantiates a CrystalAnalyzer with the configuration. - Invokes analyzer.extract_data() to perform all extraction steps.

Parameters:: config_path (Path) – Path to the JSON configuration file. Must exist and be readable.
Raises:: Exception – If any error occurs during extraction, it is logged and re-raised.

Execute CSA Data Extraction Pipeline

Coordinates the complete CSA extraction workflow by loading configuration, initializing the analyzer, and executing all enabled pipeline stages.

Pipeline Stages Executed:

Refcode Family Extraction - Query CSD for structure families
Similarity Clustering - Group similar crystal packings
Representative Selection - Choose optimal structures per cluster
Raw Data Extraction - Extract atomic coordinates and contacts
Feature Engineering - Compute advanced structural descriptors

Parameters:

config_path (Path) - Path to JSON configuration file

Configuration Loading:

# Configuration is loaded and validated
extraction_cfg = load_config(config_path)

# CrystalAnalyzer is initialized with config
analyzer = CrystalAnalyzer(extraction_config=extraction_cfg)

# Complete pipeline is executed
analyzer.extract_data()

Progress Monitoring:

The function provides detailed logging of pipeline progress:

INFO - Loading configuration from /path/to/config.json
INFO - Starting extraction step...
INFO - Extracting refcode families into DataFrame...
INFO - Extracted 45,823 structures across 12,456 families
INFO - Clustering refcode families...
INFO - Refcode families clustered into 8,234 groups
INFO - Selecting unique structures...
INFO - Unique structures selected: 8,234 structures
INFO - Extracting detailed structure data...
INFO - Raw data extraction complete
INFO - Starting post-extraction processing...
INFO - Post-extraction processing complete

Error Recovery:

Errors are logged with full context and re-raised:

try:
    analyzer.extract_data()
except Exception as e:
    logging.exception("Data extraction failed with an error.")
    raise  # Re-raise for calling code

Output Files Generated:

{prefix}_refcode_families.csv - Initial family assignments
{prefix}_refcode_families_clustered.csv - Clustered families
{prefix}_refcode_families_unique.csv - Selected representatives
{prefix}.h5 - Raw structural data in HDF5 format
{prefix}_processed.h5 - Computed features and descriptors

Raises:

Exception - Any error during extraction (logged and re-raised)

Configuration and Logging

csa_main.setup_logging()[source]

Configure the root logger for console output.

This function performs the following actions: - Remove any existing handlers from the root logger. - Create a StreamHandler to stderr with format:

“%(asctime)s - %(name)s - %(levelname)s - %(message)s”

Attach the handler to the root logger.
Set the root logger level to INFO.

Configure Structured Console Logging

Establishes standardized logging configuration for the CSA pipeline with appropriate formatting and output handling.

Logging Configuration:

Handler Cleanup - Removes any existing root logger handlers
Stream Handler - Directs output to stderr for proper console display
Formatter - Structured format with timestamps and module identification
Log Level - Set to INFO for operational visibility

Log Format:

%(asctime)s - %(name)s - %(levelname)s - %(message)s

Example Output:

2024-01-15 10:30:15 - crystal_analyzer - INFO - Starting data extraction pipeline
2024-01-15 10:30:16 - csd_operations - INFO - Extracting refcode families
2024-01-15 10:31:45 - structure_data_extractor - INFO - Processing batch 1/256

Usage in Scripts:

from csa_main import setup_logging

# Configure logging for custom scripts
setup_logging()

import logging
logger = logging.getLogger(__name__)
logger.info("Custom script starting...")

csa_main.parse_args()[source]

Parse and validate command-line arguments.

Returns:: Namespace object with the following attribute: - config (Path): Path to the JSON configuration file.
Return type:: argparse.Namespace
Raises:: SystemExit – If argument parsing fails.

Parse and Validate Command-Line Arguments

Processes command-line arguments for the CSA pipeline with validation and default value handling.

Supported Arguments:

-c, --config

Path to JSON configuration file

Type: Path object
Default: ../config/csa_config.json (relative to script location)
Required: No (uses default if not specified)

Argument Processing:

parser = argparse.ArgumentParser(
    description="Run CSD data extraction pipeline"
)
parser.add_argument(
    '-c', '--config',
    type=Path,
    default=Path('../config/csa_config.json').expanduser(),
    help="Path to the JSON configuration file"
)

Usage Examples:

# Use default configuration
python csa_main.py

# Specify custom configuration
python csa_main.py --config my_analysis.json

# Full path specification
python csa_main.py --config /home/user/projects/config.json

Validation:

Path objects are automatically created and expanded
User home directory expansion (~) is supported
Relative paths are resolved from script location

Returns:: argparse.Namespace with validated config attribute
Raises:: SystemExit - On argument parsing failures or help requests

Command-Line Usage

Basic Execution Patterns

# Standard execution
python src/csa_main.py --config analysis.json

# With logging to file
python src/csa_main.py --config analysis.json 2>&1 | tee analysis.log

# Background execution
nohup python src/csa_main.py --config analysis.json > analysis.log 2>&1 &

Configuration File Management

# Validate configuration before running
python -c "from csa_config import load_config; load_config('analysis.json')"

# Copy and modify template
cp templates/pharmaceutical.json my_analysis.json
nano my_analysis.json

# Run with custom configuration
python src/csa_main.py --config my_analysis.json

Resource Monitoring

# Monitor resource usage during execution
python src/csa_main.py --config analysis.json &
watch -n 5 'ps aux | grep csa_main'

# Track GPU usage if applicable
watch -n 2 nvidia-smi

Error Handling and Recovery

# Capture full error traces
python src/csa_main.py --config analysis.json 2>&1 | tee full_log.txt

# Resume from checkpoint (if implemented)
python src/csa_main.py --config analysis.json --resume

# Debug with verbose output
python -u src/csa_main.py --config analysis.json

Integration Examples

Batch Processing Scripts

#!/usr/bin/env python3
"""Batch process multiple configurations."""

import subprocess
import sys
from pathlib import Path

def run_csa_batch(config_files):
    """Execute CSA for multiple configurations."""
    for config_file in config_files:
        print(f"Processing {config_file}...")

        result = subprocess.run([
            sys.executable, 'src/csa_main.py',
            '--config', str(config_file)
        ], capture_output=True, text=True)

        if result.returncode == 0:
            print(f"✓ {config_file} completed successfully")
        else:
            print(f"✗ {config_file} failed:")
            print(result.stderr)

# Process all configurations in directory
config_dir = Path('configurations')
configs = list(config_dir.glob('*.json'))
run_csa_batch(configs)

Workflow Integration

"""Integrate CSA into larger analysis workflow."""

from csa_main import run_extraction
from pathlib import Path
import logging

def complete_analysis_workflow(base_config):
    """Execute complete analysis with pre/post processing."""

    # Pre-processing steps
    logging.info("Starting pre-processing...")
    prepare_analysis_environment()

    # CSA execution
    logging.info("Running CSA extraction...")
    config_path = Path(base_config)
    run_extraction(config_path)

    # Post-processing steps
    logging.info("Starting post-processing...")
    analyze_extracted_data()
    generate_reports()

    logging.info("Complete workflow finished")

Error Recovery and Monitoring

"""Monitor and recover from CSA execution errors."""

import time
import subprocess
from pathlib import Path

def monitored_csa_execution(config_path, max_retries=3):
    """Execute CSA with automatic retry on failure."""

    for attempt in range(max_retries):
        try:
            result = subprocess.run([
                'python', 'src/csa_main.py',
                '--config', str(config_path)
            ], check=True, capture_output=True, text=True)

            print("CSA execution completed successfully")
            return result

        except subprocess.CalledProcessError as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                print(f"Retrying in 60 seconds...")
                time.sleep(60)
            else:
                print("All retry attempts exhausted")
                raise

Performance Considerations

Memory Management

The command-line interface provides monitoring for memory usage:

# Memory usage is logged during execution
INFO - Peak memory usage: 8.2 GB
INFO - GPU memory allocated: 3.1 GB
INFO - Processing batch 45/128 (35% complete)

Execution Time Estimation

INFO - Stage 1 (Family Extraction): 2.3 minutes
INFO - Stage 2 (Clustering): 15.7 minutes
INFO - Stage 3 (Selection): 1.1 minutes
INFO - Stage 4 (Data Extraction): 45.2 minutes
INFO - Stage 5 (Feature Engineering): 23.8 minutes
INFO - Total execution time: 88.1 minutes

Resource Requirements

CPU: Multi-core recommended for parallel processing
Memory: 8GB+ RAM for typical datasets
GPU: CUDA-compatible GPU recommended for stages 4-5
Storage: 10GB+ free space for intermediate files

csa_main module

Dependencies

Workflow

Command-Line Interface for CSA Pipeline

Main Functions

Workflow

Configuration and Logging

Command-Line Usage

Integration Examples

Performance Considerations

See Also