Configuration
CSA uses JSON configuration files to control all aspects of the analysis pipeline. This guide covers the essential configuration concepts you need to get started quickly.
Note
This guide covers basic configuration for getting started. For advanced research-driven configurations, see Configuration.
Quick Start Configuration
Minimal Configuration
The simplest CSA configuration requires only two parameters:
{
"extraction": {
"data_directory": "./my_analysis",
"data_prefix": "my_first_run"
}
}
This will use default settings for all other parameters.
Complete Basic Configuration
For your first analysis, use this template:
{
"extraction": {
"data_directory": "./csa_output",
"data_prefix": "organic_analysis",
"actions": {
"get_refcode_families": true,
"cluster_refcode_families": true,
"get_unique_structures": true,
"get_structure_data": true,
"post_extraction_process": true
},
"filters": {
"structure_list": ["csd-unique"],
"crystal_type": ["homomolecular"],
"target_species": ["C", "H", "N", "O"],
"target_space_groups": ["P1", "P-1", "P21", "C2", "Pc", "Cc", "P21/m", "C2/m", "P2/c", "P21/c", "P21/n", "C2/c", "P21212", "P212121", "Pca21", "Pna21", "Pbcn", "Pbca", "Pnma", "R-3", "I41/a"],,
"target_z_prime_values": [1],
"molecule_weight_limit": 300.0,
"molecule_formal_charges": [0],
"unique_structures_clustering_method": "vdWFV",
},
"extraction_batch_size": 32,
"post_extraction_batch_size": 32
}
}
Essential Parameters
Required Settings
- data_directory
Where CSA will save all output files
Example:
"./my_analysis_output"- data_prefix
Prefix for all generated filenames
Example:
"pharma_study"→pharma_study_structures.h5
Pipeline Control
actions - Controls which pipeline stages to run:
"actions": {
"get_refcode_families": true, // Query CSD for structures
"cluster_refcode_families": true, // Group similar packings
"get_unique_structures": true, // Select representatives
"get_structure_data": true, // Extract coordinates
"post_extraction_process": true // Compute features
}
Set any action to false to skip that stage.
Basic Filters
- structure_list
Structure database that will be used
Options:
["csd-unique","all"],["cif","path-to-cif-files"]- crystal_type
Type of crystal structures to include
Options:
["homomolecular"],["co-crystal"],["organometallic"]- target_species
Required chemical elements
Examples: -
["C", "H", "N", "O"]- Basic organics -["C", "H", "N", "O", "S", "F", "Cl"]- Pharmaceuticals- target_space_groups
Required space groups
Examples: -
["P-1", "P21/c", "P212121", "C2/c", "P21"]- ive most common space groups -["P-1", "P1"]- Triclinic space groups- target_z_prime_values
Number of molecules per asymmetric unit
Common values:
[1](most structures),[1, 2](include Z’=2)- molecule_weight_limit
Maximum molecular weight in Daltons
Examples:
300.0(small molecules),500.0(larger molecules)- molecule_formal_charges
Allowed molecular charges
Typical:
[0](neutral),[0, 1, -1](include ions)- unique_structures_clustering_method
Clustering method to determine unique structures in a cluster
Options:
vdWFV
Element Species Filtering
Automatic Isotope Handling
CSA automatically normalizes isotopes for structural analysis:
Deuterium (D) → Hydrogen (H): All deuterium atoms are treated as hydrogen
Chemical equivalence: For crystal packing analysis, isotopic differences are negligible
"filters": {
"target_species": ["C", "H"] // Accepts both H and D atoms
}
Performance Settings
Batch Sizes
Control memory usage and speed:
- extraction_batch_size
Structures processed together during data extraction
Start with:
32If you have lots of RAM/GPU memory:
64or128If you get memory errors:
16or8
- post_extraction_batch_size
Structures processed together during feature computation
Start with:
16If you have lots of RAM/GPU memory:
32or64If you get memory errors:
8or4
Common Configurations
Small Organic Molecules
{
"extraction": {
"data_directory": "../small_organics",
"data_prefix": "small_molecules",
"filters": {
"structure_list": ["csd-unique"],
"crystal_type": ["homomolecular"],
"target_species": ["C", "H", "N", "O"],
"target_space_groups": ["P1", "P-1", "P21", "C2", "Pc", "Cc", "P21/m", "C2/m", "P2/c", "P21/c", "P21/n", "C2/c", "P21212", "P212121", "Pca21", "Pna21", "Pbcn", "Pbca", "Pnma", "R-3", "I41/a"],,
"target_z_prime_values": [1],
"molecule_weight_limit": 300.0,
"molecule_formal_charges": [0],
"unique_structures_clustering_method": "vdWFV",
},
"extraction_batch_size": 32,
"post_extraction_batch_size": 16
}
}
Drug-Like Molecules
{
"extraction": {
"data_directory": "../pharmaceuticals",
"data_prefix": "drug_molecules",
"filters": {
"structure_list": ["csd-unique"],
"crystal_type": ["homomolecular"],
"target_species": ["C", "H", "N", "O", "S", "F", "Cl", "Br"],
"target_space_groups": ["P1", "P-1", "P21", "C2", "Pc", "Cc", "P21/m", "C2/m", "P2/c", "P21/c", "P21/n", "C2/c", "P21212", "P212121", "Pca21", "Pna21", "Pbcn", "Pbca", "Pnma", "R-3", "I41/a"],,
"target_z_prime_values": [1],
"molecule_weight_limit": 300.0,
"molecule_formal_charges": [0],
"unique_structures_clustering_method": "vdWFV",
},
"extraction_batch_size": 32,
"post_extraction_batch_size": 16
}
}
Creating Your Configuration
Step-by-Step Process
Copy a template from the examples above
Modify the basics: - Change
data_directoryto your desired output location - Setdata_prefixto describe your analysisAdjust filters for your research: - Set appropriate molecular weight limit - Choose relevant chemical elements - Decide on charge states and Z’ values
Set batch sizes based on your hardware: - Start with the defaults (32, 16) - Reduce if you get memory errors - Increase if you have powerful hardware
Save as a .json file
Validation and Testing
Check Your JSON
Before running CSA, validate your JSON syntax:
Use an online JSON validator (search “JSON validator”)
Check for common errors: - Missing commas between items - Missing quotes around strings - Mismatched brackets or braces
Common Beginner Mistakes
JSON Syntax Errors
Missing Commas
❌ Wrong: .. code-block:: json
- {
“data_directory”: “./output” “data_prefix”: “analysis”
}
✅ Correct: .. code-block:: json
- {
“data_directory”: “./output”, “data_prefix”: “analysis”
}
Quotes Around Strings
❌ Wrong: .. code-block:: json
- {
“target_species”: [C, H, N, O]
}
✅ Correct: .. code-block:: json
- {
“target_species”: [“C”, “H”, “N”, “O”]
}
Configuration Issues
Too Restrictive Filters
If CSA finds no structures, your filters might be too strict:
- Increase molecule_weight_limit
- Add more elements to target_species
- Include more charge states or Z’ values
Memory Problems
If you get “out of memory” errors:
- Reduce extraction_batch_size to 16 or 8
- Reduce post_extraction_batch_size to 8 or 4
- Use fewer structures for testing
Very Slow Processing
If CSA runs very slowly: - Check that you have GPU acceleration enabled - Reduce the dataset size for initial testing - Consider using a more powerful computer
Getting Help
When Things Go Wrong
Check the error message - CSA provides detailed error information
Validate your JSON - Use an online JSON validator
Try a simpler configuration - Start with the minimal example and add complexity
Test with fewer structures - Add
"max_structures": 50to your filtersCheck the examples - Compare your configuration to the working examples above
Next Steps
Once you have a working configuration:
Run your first analysis - Follow the Quickstart Guide guide
Explore the results - Learn what CSA produces
Try different filters - Experiment with different molecular systems
Learn advanced configuration - Check Configuration for research-specific setups
See Also
Quickstart Guide : Run your first analysis with your configuration Configuration : Advanced configuration strategies Basic Analysis : Understanding and analyzing your results