Quickstart Guide
This guide will get you running your first CSA analysis in under 15 minutes. You’ll go from a basic configuration to analyzing crystal structure data.
Prerequisites
Before starting, ensure you have:
✅ CSA installed (see Installation)
✅ CCDC license and CSD database access
✅ Virtual environment activated
# Activate your CSA environment
source csa_env/bin/activate # macOS/Linux
csa_env\Scripts\activate # Windows
The Five-Stage CSA Pipeline
CSA transforms CSD data through five stages:
Family Extraction - Query CSD for structure families
Similarity Clustering - Group similar crystal packings
Representative Selection - Choose optimal structures
Data Extraction - Extract detailed structural data
Feature Engineering - Compute advanced descriptors
Let’s run through each stage with your first analysis.
Step 1: Create Your Configuration
Create a Simple Configuration
Create a file named my_first_analysis.json:
{
"extraction": {
"data_directory": "../my_first_csa_run/",
"data_prefix": "small_hydrocarbons",
"actions": {
"get_refcode_families": true,
"cluster_refcode_families": true,
"get_unique_structures": true,
"get_structure_data": true,
"post_extraction_process": true
},
"filters": {
"structure_list": ["csd-unique"],
"crystal_type": ["homomolecular"],
"target_species": ["C", "H"],
"target_space_groups": ["P21/c","P-1"],
"target_z_prime_values": [1.0],
"molecule_weight_limit": 300.0,
"molecule_formal_charges": [0],
"unique_structures_clustering_method": "vdWFV"
},
"extraction_batch_size": 32,
"post_extraction_batch_size": 32
}
}
Configuration Explained
Step 2: Run Your First Analysis
Execute the Pipeline
Navigate to your CSA directory and run:
cd /path/to/crystal-structure-analysis
python src/csa_main.py --config my_first_analysis.json
Monitor Progress
You’ll see output like this:
2025-05-03 20:02:39,843 - root - INFO - Loading configuration from csa_config.json
2025-05-03 20:02:39,846 - root - INFO - Starting extraction step...
2025-05-03 20:02:39,846 - crystal_analyzer - INFO - Starting data extraction pipeline...
2025-05-03 20:02:39,846 - crystal_analyzer - INFO - Extracting refcode families into DataFrame...
2025-05-03 20:20:04,663 - crystal_analyzer - INFO - Extracted 1284316 structures across 1151944 families
2025-05-03 20:20:04,717 - crystal_analyzer - INFO - Clustering refcode families...
2025-05-03 20:47:49,881 - csd_operations - INFO - Saved clustered families to ..\my_first_csa_run\small_hydrocarbons_refcode_families_clustered.csv
2025-05-03 20:47:50,014 - crystal_analyzer - INFO - Refcode families clustered into 407 groups.
2025-05-03 20:47:50,023 - crystal_analyzer - INFO - Selecting unique structures …
2025-05-03 20:47:58,430 - csd_operations - INFO - Saved unique structures to ..\my_first_csa_run\small_hydrocarbons_refcode_families_unique.csv
2025-05-03 20:47:58,431 - crystal_analyzer - INFO - Unique structures selected: 310 structures across 309 families
2025-05-03 20:47:58,431 - crystal_analyzer - INFO - Extracting detailed structure data into ..\my_first_csa_run\small_hydrocarbons.h5 …
2025-05-03 20:47:58,439 - structure_data_extractor - INFO - 310 structures to extract (batch size 1024)
2025-05-03 20:47:58,441 - structure_data_extractor - INFO - Extracting batch 1 (size 310)
2025-05-03 20:48:27,091 - structure_data_extractor - INFO - Raw data extraction complete; HDF5 file closed.
2025-05-03 20:48:27,091 - crystal_analyzer - INFO - Detailed structure data extracted and saved to ..\my_first_csa_run\small_hydrocarbons.h5
2025-05-03 20:48:27,236 - structure_post_extraction_processor - INFO - Found 310 structures to process.
2025-05-03 20:48:27,236 - structure_post_extraction_processor - INFO - Processing structures 1 to 310
2025-05-03 20:48:49,495 - structure_post_extraction_processor - INFO - Post-extraction fast processing complete.
2025-05-03 20:48:49,495 - crystal_analyzer - INFO - Data extraction completed in 0:46:09.649176
2025-05-03 20:48:49,581 - root - INFO - Data extraction completed successfully.
Total time: Typically less than 1 hour for this configuration.
Understanding the Stages
- Stage 1 (~5 minutes)
Queries the CSD database using your filters
- Stage 2 (~30 minutes)
Groups structures with similar crystal packing
- Stage 3 (~10 minutes)
Picks the best representative from each cluster
- Stage 4 (~2 minutes)
Extracts atomic coordinates and basic properties
- Stage 5 (~2 minutes)
Computes advanced molecular descriptors
Step 3: Explore Your Results
Check Generated Files
After completion, examine your output directory:
ls -la my_first_csa_run/
You should see:
my_first_csa_run/
├── small_hydrocarbons_refcode_families.csv # Stage 1 output
├── small_hydrocarbons_clustered_families.csv # Stage 2 output
├── small_hydrocarbons_unique_structures.csv # Stage 3 output
├── small_hydrocarbons.h5 # Stage 4 output
└── small_hydrocarbons_processed.h5 # Stage 5 output
Quick Data Overview
Use this Python script to inspect your results:
import h5py
import pandas as pd
import numpy as np
# Basic dataset information
with h5py.File('../my_first_csa_run/small_hydrocarbons_processed.h5', 'r') as f:
refcodes = f['refcode_list'][...].astype(str)
n_structures = len(refcodes)
print(f"🎉 Successfully processed {n_structures:,} crystal structures!")
print(f"📝 First 5 refcodes: {refcodes[:5].tolist()}")
# Crystal properties overview
space_groups = [f['space_group'][i].decode() for i in range(min(n_structures, 1000))]
unique_sgs = set(space_groups)
print(f"🔬 Found {len(unique_sgs)} different space groups")
cell_volumes = f['cell_volume'][...]
print(f"📏 Cell volume range: {cell_volumes.min():.1f} - {cell_volumes.max():.1f} ")
n_atoms = f['n_atoms'][...]
print(f"⚛️ Molecular size: {n_atoms.min()}-{n_atoms.max()} atoms (avg: {n_atoms.mean():.1f})")
# Fragment analysis
n_fragments = f['n_fragments'][...]
print(f"🧩 Fragments per molecule: {n_fragments.min()}-{n_fragments.max()} (avg: {n_fragments.mean():.1f})")
# Contact analysis
n_contacts = f['inter_cc_n_contacts'][...]
structures_with_contacts = np.sum(n_contacts > 0)
print(f"🤝 {structures_with_contacts:,} structures have intermolecular contacts ({structures_with_contacts/n_structures*100:.1f}%)")
Save this as inspect_results.py and run:
python inspect_results.py
Expected output:
🎉 Successfully processed 310 crystal structures!
📝 First 5 refcodes: ['ACAMAT', 'ANANTH01', 'ANNULE10', 'ANOKUM', 'ATAKOV']
🔬 Found 2 different space groups
📏 Cell volume range: 252.6 - 1944.4
⚛️ Molecular size: 9-104 atoms (avg: 40.0)
🧩 Fragments per molecule: 1-14 (avg: 2.2)
🤝 310 structures have intermolecular contacts (100.0%)
Step 4: Your First Analysis
Crystal Property Analysis
Let’s analyze the crystal properties you just extracted:
import h5py
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Load data
with h5py.File('../my_first_csa_run/small_hydrocarbons_processed.h5', 'r') as f:
data = {
'refcode': f['refcode_list'][...].astype(str),
'space_group': [f['space_group'][i].decode() for i in range(len(f['refcode_list']))],
'cell_volume': f['cell_volume'][...],
'cell_density': f['cell_density'][...],
'n_atoms': f['n_atoms'][...],
'packing_coefficient': f['packing_coefficient'][...]
}
df = pd.DataFrame(data)
# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# 1. Density distribution
axes[0,0].hist(df['cell_density'], bins=50, alpha=0.7, color='skyblue', edgecolor='black')
axes[0,0].set_xlabel('Crystal Density (g/cm³)')
axes[0,0].set_ylabel('Number of Structures')
axes[0,0].set_title('Crystal Density Distribution')
axes[0,0].axvline(df['cell_density'].mean(), color='red', linestyle='--',
label=f'Mean: {df["cell_density"].mean():.2f}')
axes[0,0].legend()
# 2. Volume vs molecular size
axes[0,1].scatter(df['n_atoms'], df['cell_volume'], alpha=0.6, color='orange')
axes[0,1].set_xlabel('Number of Atoms')
axes[0,1].set_ylabel('Cell Volume (Ų)')
axes[0,1].set_title('Cell Volume vs Molecular Size')
# 3. Top 10 space groups
top_sgs = df['space_group'].value_counts().head(10)
axes[1,0].barh(range(len(top_sgs)), top_sgs.values, color='lightgreen')
axes[1,0].set_yticks(range(len(top_sgs)))
axes[1,0].set_yticklabels(top_sgs.index)
axes[1,0].set_xlabel('Number of Structures')
axes[1,0].set_title('Most Common Space Groups')
# 4. Packing efficiency
axes[1,1].hist(df['packing_coefficient'], bins=50, alpha=0.7, color='purple', edgecolor='black')
axes[1,1].set_xlabel('Packing Coefficient')
axes[1,1].set_ylabel('Number of Structures')
axes[1,1].set_title('Crystal Packing Efficiency')
axes[1,1].axvline(df['packing_coefficient'].mean(), color='red', linestyle='--',
label=f'Mean: {df["packing_coefficient"].mean():.3f}')
axes[1,1].legend()
plt.tight_layout()
plt.savefig('my_first_csa_run/crystal_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
print(f"📊 Analysis complete! Plot saved to: my_first_csa_run/crystal_analysis.png")
print(f"📈 Key findings:")
print(f" • Average density: {df['cell_density'].mean():.2f} g/cm³")
print(f" • Most common space group: {df['space_group'].mode()[0]} ({df['space_group'].value_counts().iloc[0]} structures)")
print(f" • Average packing efficiency: {df['packing_coefficient'].mean():.3f}")
Fragment Shape Analysis
Explore molecular fragment shapes:
import h5py
import matplotlib.pyplot as plt
import numpy as np
# Load fragment data
fragment_data = []
with h5py.File('../my_first_csa_run/small_hydrocarbons_processed.h5', 'r') as f:
for i in range(min(1000, len(f['refcode_list']))): # First 1000 structures
refcode = f['refcode_list'][i].decode()
n_frags = f['n_fragments'][i]
if n_frags > 0:
# Get inertia eigenvalues for shape analysis
inertia_flat = f['fragment_inertia_eigvals'][i]
inertia_eigvals = inertia_flat.reshape(n_frags, 3)
for j in range(n_frags):
# Calculate shape descriptors
asphericity = inertia_eigvals[j, 2] - 0.5*(inertia_eigvals[j, 0] + inertia_eigvals[j, 1])
acylindricity = inertia_eigvals[j, 1] - inertia_eigvals[j, 0]
fragment_data.append({
'refcode': refcode,
'asphericity': asphericity,
'acylindricity': acylindricity
})
# Classify shapes
shapes = []
for frag in fragment_data:
if frag['asphericity'] < 0.1 and frag['acylindricity'] < 0.1:
shapes.append('spherical')
elif frag['acylindricity'] < 0.1:
shapes.append('oblate')
elif frag['asphericity'] > 0.3:
shapes.append('prolate')
else:
shapes.append('intermediate')
# Plot results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Shape distribution
shape_counts = pd.Series(shapes).value_counts()
ax1.pie(shape_counts.values, labels=shape_counts.index, autopct='%1.1f%%')
ax1.set_title('Molecular Fragment Shapes')
# Shape parameter space
asph = [frag['asphericity'] for frag in fragment_data]
acyl = [frag['acylindricity'] for frag in fragment_data]
scatter = ax2.scatter(asph, acyl, c=[{'spherical': 0, 'oblate': 1, 'prolate': 2, 'intermediate': 3}[s] for s in shapes],
alpha=0.6, cmap='viridis')
ax2.set_xlabel('Asphericity')
ax2.set_ylabel('Acylindricity')
ax2.set_title('Fragment Shape Parameter Space')
plt.tight_layout()
plt.savefig('my_first_csa_run/fragment_shapes.png', dpi=300, bbox_inches='tight')
plt.show()
print(f"🧩 Fragment shape analysis complete!")
print(f" • Analyzed {len(fragment_data)} molecular fragments")
print(f" • Shape distribution: {dict(shape_counts)}")
Step 5: Understanding Your Results
What You’ve Created
Your CSA analysis has generated:
Structure Database: 310 carefully selected, non-redundant crystal structures
Molecular Properties: Comprehensive geometric and chemical descriptors
Fragment Analysis: Rigid molecular fragment identification and characterization
Contact Networks: Detailed intermolecular interaction data
Crystal Properties: Unit cell, symmetry, and packing information
Key Insights from Your Data
From this analysis, you can now investigate:
Packing Preferences: Which space groups are most common for organic molecules?
Size-Property Relationships: How does molecular size affect crystal density?
Shape Analysis: What molecular shapes are most prevalent?
Packing Efficiency: How efficiently do organic molecules pack in crystals?
Common Issues and Solutions
“No structures found”
Cause: Filters too restrictive Solution: Increase molecular weight limit or add more chemical elements
"filters": {
"molecule_weight_limit": 600.0, // Increase from 400
"target_species": ["C", "H", "N", "O", "S", "F"] // Add sulfur and fluorine
}
“Out of memory” errors
Cause: Batch sizes too large for your system Solution: Reduce batch sizes
"extraction_batch_size": 16, // Reduce from 32
"post_extraction_batch_size": 8 // Reduce from 16
“Very slow processing”
Cause: CPU-only processing Solutions: - Enable GPU acceleration (see Installation) - Use smaller test dataset first - Consider cloud computing for large analyses
Next Steps: Expanding Your Analysis
Try Different Chemical Systems
Pharmaceutical molecules:
"filters": {
"target_species": ["C", "H", "N", "O", "S", "F", "Cl", "Br"],
"molecule_weight_limit": 600.0
}
Coordination compounds:
"filters": {
"target_species": ["C", "H", "N", "O", "Fe", "Cu", "Zn"],
"crystal_type": ["organometallic"]
}
Explore Advanced Features
Now that you have working CSA installation and data:
Learn the full data model - Data Model
Try advanced analysis workflows - Basic Analysis
Explore domain-specific tutorials - Tutorials
Optimize for your research - Configuration
Scale Up Your Research
Remove size limits for comprehensive surveys
Add performance optimizations for larger datasets
Integrate with your existing analysis workflows
Explore machine learning applications with your data
Congratulations!
🎉 You’ve successfully completed your first CSA analysis!
You now have: - ✅ A working CSA installation - ✅ Understanding of the five-stage pipeline - ✅ Your first crystal structure dataset - ✅ Basic analysis and visualization skills - ✅ Knowledge to expand to your research questions
Ready for More?
Continue your CSA journey:
Understand your data better → Basic Analysis
Learn advanced configuration → Configuration
Try domain-specific examples → Tutorials
Explore all CSA features → User Guide
Welcome to the CSA community! 🚀