Basic Analysis
This guide walks through fundamental analysis workflows using CSA, from running your first pipeline to extracting meaningful insights from crystal structure data. You’ll learn to execute the five-stage pipeline, access results, and perform common analyses.
Note
This guide assumes you have CSA installed and configured. See Installation and Configuration if you need setup help.
Getting Started: Your First Analysis
Complete Pipeline Workflow
The CSA pipeline transforms raw CSD queries into analysis-ready datasets through five sequential stages. Here’s how to execute the complete workflow:
Step 1: Create Configuration
{
"extraction": {
"data_directory": "../my_first_csa_run/",
"data_prefix": "small_hydrocarbons",
"actions": {
"get_refcode_families": true,
"cluster_refcode_families": true,
"get_unique_structures": true,
"get_structure_data": true,
"post_extraction_process": true
},
"filters": {
"structure_list": ["csd-unique"],
"crystal_type": ["homomolecular"],
"target_species": ["C", "H"],
"target_space_groups": ["P21/c","P-1"],
"target_z_prime_values": [1.0],
"molecule_weight_limit": 300.0,
"molecule_formal_charges": [0],
"unique_structures_clustering_method": "vdWFV"
},
"extraction_batch_size": 32,
"post_extraction_batch_size": 32
}
}
Step 2: Execute Pipeline
# Navigate to CSA directory
cd /path/to/crystal-structure-analysis
# Run complete pipeline
python src/csa_main.py --config my_first_csa_run.json
Step 3: Verify Output
After successful completion, you should see these files:
my_first_csa_run/
├── small_hydrocarbons_refcode_families.csv
├── small_hydrocarbons_clustered_families.csv
├── small_hydrocarbons_unique_structures.csv
├── small_hydrocarbons.h5
└── small_hydrocarbons_processed.h5
Understanding Pipeline Output
Each stage creates specific outputs that feed into subsequent analyses:
- Stage 1: Family Extraction →
small_hydrocarbons_refcode_families.csv Groups structures into chemical families
- Stage 2: Similarity Clustering →
small_hydrocarbons_clustered_families.csv Groups similar crystal packings within families
- Stage 3: Representative Selection →
small_hydrocarbons_unique_structures.csv Selects one representative per cluster
- Stage 4: Structure Data Extraction →
small_hydrocarbons_structures.h5 Raw structural data with coordinates and properties
- Stage 5: Feature Engineering →
small_hydrocarbons_structures_processed.h5 Analysis-ready data with computed features
Quick validation of your results:
import pandas as pd
import h5py
# Check family extraction results
families = pd.read_csv('../my_first_csa_run/small_hydrocarbons_refcode_families.csv')
print(f"Found {len(families)} structures across {families['family'].nunique()} families")
# Check processed data
with h5py.File('../my_first_csa_run/small_hydrocarbons_processed.h5', 'r') as f:
n_structures = len(f['refcode_list'])
print(f"Successfully processed {n_structures} representative structures")
Accessing Your Data
Loading and Exploring Results
CSA stores results in HDF5 format for efficient access. Here’s how to load and explore your data:
Basic Data Loading
import numpy as np
import pandas as pd
import h5py
def load_crystal_data(hdf5_file):
"""Load basic crystal properties into a pandas DataFrame."""
with h5py.File(hdf5_file, 'r') as f:
data = {
'refcode': f['refcode_list'][...].astype(str),
'space_group': [f['space_group'][i].decode() for i in range(len(f['refcode_list']))],
'z_prime': f['z_prime'][...],
'cell_volume': f['cell_volume'][...],
'cell_density': f['cell_density'][...],
'packing_coefficient': f['packing_coefficient'][...],
'n_atoms': f['n_atoms'][...],
'n_fragments': f['n_fragments'][...]
}
return pd.DataFrame(data)
# Load your data
crystal_df = load_crystal_data('../my_first_csa_run/small_hydrocarbons_processed.h5')
print(crystal_df.head())
Data Summary and Statistics
def summarize_dataset(crystal_df):
"""Generate a comprehensive dataset summary."""
print("="*50)
print("DATASET SUMMARY")
print("="*50)
print(f"Total structures: {len(crystal_df)}")
print(f"Unique space groups: {crystal_df['space_group'].nunique()}")
print("\nMolecular size distribution:")
print(crystal_df['n_atoms'].describe())
print("\nZ' distribution:")
print(crystal_df['z_prime'].value_counts().sort_index())
print("\nTop 10 space groups:")
print(crystal_df['space_group'].value_counts().head(10))
return crystal_df.describe()
summary_stats = summarize_dataset(crystal_df)
print(summary_stats)
Essential Analysis Workflows
Crystal Property Analysis
Distribution Analysis
import matplotlib.pyplot as plt
import seaborn as sns
def plot_crystal_properties(crystal_df):
"""Plot key crystal property distributions."""
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()
# Cell volume distribution
axes[0].hist(crystal_df['cell_volume'], bins=50, alpha=0.7, color='skyblue')
axes[0].set_xlabel('Cell Volume (ų)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Cell Volume Distribution')
# Density distribution
axes[1].hist(crystal_df['cell_density'], bins=50, alpha=0.7, color='lightgreen')
axes[1].set_xlabel('Density (g/cm³)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Crystal Density Distribution')
# Packing coefficient distribution
axes[2].hist(crystal_df['packing_coefficient'], bins=50, alpha=0.7, color='orange')
axes[2].set_xlabel('Packing coefficient (%)')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Packing Coefficient Distribution')
# Z' distribution
z_counts = crystal_df['z_prime'].value_counts().sort_index()
axes[3].bar(z_counts.index, z_counts.values, color='coral')
axes[3].set_xlabel("Z'")
axes[3].set_ylabel('Count')
axes[3].set_title("Z' Distribution")
# Molecular size
axes[4].hist(crystal_df['n_atoms'], bins=50, alpha=0.7, color='gold')
axes[4].set_xlabel('Number of Atoms')
axes[4].set_ylabel('Frequency')
axes[4].set_title('Molecular Size Distribution')
# Fragment count
axes[5].hist(crystal_df['n_fragments'], bins=max(crystal_df['n_fragments'])+1,
alpha=0.7, color='mediumpurple')
axes[5].set_xlabel('Number of Fragments')
axes[5].set_ylabel('Frequency')
axes[5].set_title('Fragment Count Distribution')
plt.tight_layout()
plt.savefig('crystal_property_distributions.png', dpi=300, bbox_inches='tight')
plt.show()
plot_crystal_properties(crystal_df)
Correlation Analysis
def analyze_property_correlations(crystal_df):
"""Analyze correlations between crystal properties."""
# Select numeric properties for correlation analysis
numeric_cols = ['z_prime', 'cell_volume', 'cell_density', 'packing_coefficient',
'n_atoms', 'n_fragments']
corr_data = crystal_df[numeric_cols]
correlation_matrix = corr_data.corr()
# Create correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
square=True, fmt='.3f', cbar_kws={'label': 'Correlation'})
plt.title('Crystal Property Correlations')
plt.tight_layout()
plt.savefig('property_correlations.png', dpi=300, bbox_inches='tight')
plt.show()
# Identify strong correlations
print("Strong correlations (|r| > 0.5):")
for i in range(len(correlation_matrix.columns)):
for j in range(i+1, len(correlation_matrix.columns)):
corr_val = correlation_matrix.iloc[i, j]
if abs(corr_val) > 0.5:
prop1 = correlation_matrix.columns[i]
prop2 = correlation_matrix.columns[j]
print(f"{prop1} vs {prop2}: r = {corr_val:.3f}")
return correlation_matrix
correlations = analyze_property_correlations(crystal_df)
Fragment Analysis
Fragment Properties and Shapes
def analyze_molecular_fragments(hdf5_file):
"""Analyze molecular fragment properties and shapes."""
fragment_data = []
with h5py.File(hdf5_file, 'r') as f:
n_structures = len(f['refcode_list'])
for i in range(n_structures):
refcode = f['refcode_list'][i].decode()
n_frags = f['n_fragments'][i]
if n_frags > 0:
# Fragment formulas
formulas = f['fragment_formula'][i].astype(str)
# Centers of mass coordinates
com_flat = f['fragment_com_coords'][i]
com_coords = com_flat.reshape(n_frags, 3)
# Inertia eigenvalues for shape analysis
inertia_flat = f['fragment_inertia_eigvals'][i]
inertia_eigvals = inertia_flat.reshape(n_frags, 3)
for j in range(n_frags):
# Calculate shape descriptors
asphericity = inertia_eigvals[j, 2] - 0.5*(inertia_eigvals[j, 0] + inertia_eigvals[j, 1])
acylindricity = inertia_eigvals[j, 1] - inertia_eigvals[j, 0]
fragment_data.append({
'refcode': refcode,
'fragment_id': j,
'formula': formulas[j],
'com_x': com_coords[j, 0],
'com_y': com_coords[j, 1],
'com_z': com_coords[j, 2],
'inertia_1': inertia_eigvals[j, 0],
'inertia_2': inertia_eigvals[j, 1],
'inertia_3': inertia_eigvals[j, 2],
'asphericity': asphericity,
'acylindricity': acylindricity
})
return pd.DataFrame(fragment_data)
# Analyze fragments
fragment_df = analyze_molecular_fragments('../my_first_csa_run/small_hydrocarbons_processed.h5')
print(f"Analyzed {len(fragment_df)} fragments")
print("\nMost common fragment formulas:")
print(fragment_df['formula'].value_counts().head(10))
Fragment Shape Classification
def classify_fragment_shapes(fragment_df):
"""Classify molecular fragments by shape."""
def classify_shape(row):
"""Classify shape based on asphericity and acylindricity."""
if row['asphericity'] < 0.1 and row['acylindricity'] < 0.1:
return 'spherical'
elif row['acylindricity'] < 0.1:
return 'oblate'
elif row['asphericity'] > 0.3:
return 'prolate'
else:
return 'intermediate'
fragment_df['shape_class'] = fragment_df.apply(classify_shape, axis=1)
# Plot shape distribution
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
shape_counts = fragment_df['shape_class'].value_counts()
plt.pie(shape_counts.values, labels=shape_counts.index, autopct='%1.1f%%')
plt.title('Fragment Shape Distribution')
plt.subplot(1, 2, 2)
colors = {'spherical': 'blue', 'oblate': 'green', 'prolate': 'red', 'intermediate': 'orange'}
for shape in fragment_df['shape_class'].unique():
data = fragment_df[fragment_df['shape_class'] == shape]
plt.scatter(data['asphericity'], data['acylindricity'],
c=colors[shape], label=shape, alpha=0.6)
plt.xlabel('Asphericity')
plt.ylabel('Acylindricity')
plt.title('Shape Parameter Space')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('fragment_shapes.png', dpi=300, bbox_inches='tight')
plt.show()
print("Shape classification results:")
print(fragment_df['shape_class'].value_counts())
return fragment_df
fragment_df = classify_fragment_shapes(fragment_df)
Intermolecular Contact Analysis
Contact Detection and Classification
def analyze_intermolecular_contacts(hdf5_file):
"""Analyze intermolecular contacts and hydrogen bonds."""
contact_data = []
with h5py.File(hdf5_file, 'r') as f:
n_structures = len(f['refcode_list'])
for i in range(n_structures):
refcode = f['refcode_list'][i].decode()
n_contacts = f['inter_cc_n_contacts'][i]
atom_symbols = f['atom_symbol'][i].astype(str)
if n_contacts > 0:
# Contact information
central_atoms = atom_symbols[f['inter_cc_central_atom_symbol'][i]]
contact_atoms = atom_symbols[f['inter_cc_contact_atom_symbol'][i]]
lengths = f['inter_cc_length'][i]
is_hbond = f['inter_cc_is_hbond'][i]
for j in range(n_contacts):
contact_data.append({
'refcode': refcode,
'central_atom': central_atoms[j],
'contact_atom': contact_atoms[j],
'length': lengths[j],
'is_hbond': bool(is_hbond[j])
})
return pd.DataFrame(contact_data)
# Analyze contacts
contact_df = analyze_intermolecular_contacts('../my_first_csa_run/small_hydrocarbons_processed.h5')
print(f"Found {len(contact_df)} intermolecular contacts")
print(f"Hydrogen bonds: {contact_df['is_hbond'].sum()} ({contact_df['is_hbond'].mean()*100:.1f}%)")
Contact Distance Analysis
def plot_contact_analysis(contact_df):
"""Plot contact distance distributions and types."""
# Separate hydrogen bonds from other contacts
hbonds = contact_df[contact_df['is_hbond'] == True]
other_contacts = contact_df[contact_df['is_hbond'] == False]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Distance distributions
axes[0,0].hist(other_contacts['length'], bins=50, alpha=0.7,
label='Other contacts', color='blue', density=True)
axes[0,0].hist(hbonds['length'], bins=50, alpha=0.7,
label='Hydrogen bonds', color='red', density=True)
axes[0,0].set_xlabel('Contact Distance (Å)')
axes[0,0].set_ylabel('Density')
axes[0,0].set_title('Contact Distance Distributions')
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)
# Contact type frequencies
contact_types = []
for _, row in contact_df.iterrows():
contact_type = f"{row['central_atom']}-{row['contact_atom']}"
contact_types.append(contact_type)
contact_type_df = pd.DataFrame({'type': contact_types, 'distance': contact_df['length']})
top_types = contact_type_df['type'].value_counts().head(10)
axes[0,1].barh(range(len(top_types)), top_types.values)
axes[0,1].set_yticks(range(len(top_types)))
axes[0,1].set_yticklabels(top_types.index)
axes[0,1].set_xlabel('Count')
axes[0,1].set_title('Most Common Contact Types')
# Box plot of distances by contact type
top_types_list = top_types.index[:8].tolist()
filtered_df = contact_type_df[contact_type_df['type'].isin(top_types_list)]
if len(filtered_df) > 0:
sns.boxplot(data=filtered_df, x='type', y='distance', ax=axes[1,0])
axes[1,0].set_xticklabels(axes[1,0].get_xticklabels(), rotation=45)
axes[1,0].set_title('Distance Distributions by Contact Type')
# H-bond vs other contact statistics
stats_data = {
'Contact Type': ['Hydrogen Bonds', 'Other Contacts'],
'Count': [len(hbonds), len(other_contacts)],
'Mean Distance': [hbonds['length'].mean() if len(hbonds) > 0 else 0,
other_contacts['length'].mean()],
'Std Distance': [hbonds['length'].std() if len(hbonds) > 0 else 0,
other_contacts['length'].std()]
}
stats_df = pd.DataFrame(stats_data)
axes[1,1].axis('tight')
axes[1,1].axis('off')
table = axes[1,1].table(cellText=stats_df.round(3).values,
colLabels=stats_df.columns,
cellLoc='center',
loc='center')
table.auto_set_font_size(False)
table.set_fontsize(10)
axes[1,1].set_title('Contact Statistics Summary')
plt.tight_layout()
plt.savefig('contact_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
# Print summary statistics
print("\nContact Analysis Summary:")
print(f"Total contacts: {len(contact_df)}")
print(f"Hydrogen bonds: {len(hbonds)} ({len(hbonds)/len(contact_df)*100:.1f}%)")
print(f"Average contact distance: {contact_df['length'].mean():.2f} ± {contact_df['length'].std():.2f} Å")
if len(hbonds) > 0:
print(f"Average H-bond distance: {hbonds['length'].mean():.2f} ± {hbonds['length'].std():.2f} Å")
if len(other_contacts) > 0:
print(f"Average other contact distance: {other_contacts['length'].mean():.2f} ± {other_contacts['length'].std():.2f} Å")
plot_contact_analysis(contact_df)
Advanced Analysis Examples
Comparative Studies
Space Group Effects
def compare_space_groups(crystal_df, contact_df, top_n=5):
"""Compare crystal properties across different space groups."""
# Focus on most common space groups
top_sgs = crystal_df['space_group'].value_counts().head(top_n).index
sg_data = crystal_df[crystal_df['space_group'].isin(top_sgs)]
# Statistical comparison
print("Space Group Comparison:")
print("="*50)
for sg in top_sgs:
sg_subset = sg_data[sg_data['space_group'] == sg]
print(f"\n{sg} (n={len(sg_subset)}):")
print(f" Volume: {sg_subset['cell_volume'].mean():.1f} ± {sg_subset['cell_volume'].std():.1f} ų")
print(f" Density: {sg_subset['cell_density'].mean():.2f} ± {sg_subset['cell_density'].std():.2f} g/cm³")
print(f" Z': {sg_subset['z_prime'].mode()[0] if len(sg_subset['z_prime'].mode()) > 0 else 'N/A'} (mode)")
# Visualize comparisons
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Volume comparison
sns.boxplot(data=sg_data, x='space_group', y='cell_volume', ax=axes[0])
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=45)
axes[0].set_title('Cell Volume by Space Group')
# Density comparison
sns.boxplot(data=sg_data, x='space_group', y='cell_density', ax=axes[1])
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45)
axes[1].set_title('Density by Space Group')
# Z' distribution
zprime_counts = sg_data.groupby(['space_group', 'z_prime']).size().unstack(fill_value=0)
zprime_counts.plot(kind='bar', ax=axes[2], stacked=True)
axes[2].set_title("Z' Distribution by Space Group")
axes[2].set_xticklabels(axes[2].get_xticklabels(), rotation=45)
axes[2].legend(title="Z'")
plt.tight_layout()
plt.savefig('space_group_comparison.png', dpi=300, bbox_inches='tight')
plt.show()
compare_space_groups(crystal_df, contact_df)
Z’ Effects Analysis
def analyze_zprime_effects(crystal_df, contact_df):
"""Analyze how Z' affects crystal properties and intermolecular contacts."""
# Merge contact data with crystal data
contact_with_zprime = contact_df.merge(
crystal_df[['refcode', 'z_prime']], on='refcode'
)
print("Z' Effects Analysis:")
print("="*50)
# Crystal property effects
zprime_stats = crystal_df.groupby('z_prime').agg({
'cell_volume': ['mean', 'std', 'count'],
'cell_density': ['mean', 'std'],
'n_atoms': ['mean', 'std']
}).round(2)
print("\nCrystal Properties by Z':")
print(zprime_stats)
# Contact effects
contact_stats = contact_with_zprime.groupby('z_prime').agg({
'length': ['mean', 'std', 'count'],
'is_hbond': 'mean'
}).round(3)
print("\nContact Properties by Z':")
print(contact_stats)
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Density vs Z'
sns.boxplot(data=crystal_df, x='z_prime', y='cell_density', ax=axes[0,0])
axes[0,0].set_title("Density vs Z'")
# Volume vs Z'
sns.boxplot(data=crystal_df, x='z_prime', y='cell_volume', ax=axes[0,1])
axes[0,1].set_title("Volume vs Z'")
# Contact distances vs Z'
sns.boxplot(data=contact_with_zprime, x='z_prime', y='length', ax=axes[1,0])
axes[1,0].set_title("Contact Distances vs Z'")
# H-bond frequency vs Z'
hbond_freq = contact_with_zprime.groupby('z_prime')['is_hbond'].mean()
axes[1,1].bar(hbond_freq.index, hbond_freq.values)
axes[1,1].set_xlabel("Z'")
axes[1,1].set_ylabel('H-bond Frequency')
axes[1,1].set_title("H-bond Frequency vs Z'")
plt.tight_layout()
plt.savefig('zprime_effects.png', dpi=300, bbox_inches='tight')
plt.show()
analyze_zprime_effects(crystal_df, contact_df)
Data Export and Sharing
Export Results
CSV Export for External Analysis
def export_analysis_results(crystal_df, fragment_df, contact_df, output_dir='analysis_exports'):
"""Export analysis results for external use."""
import os
os.makedirs(output_dir, exist_ok=True)
# Crystal properties summary
crystal_summary = crystal_df.groupby('space_group').agg({
'cell_volume': ['mean', 'std', 'min', 'max'],
'cell_density': ['mean', 'std', 'min', 'max'],
'n_atoms': ['mean', 'std', 'min', 'max'],
'n_fragments': ['mean', 'std', 'min', 'max']
}).round(3)
crystal_summary.to_csv(f'{output_dir}/crystal_properties_by_space_group.csv')
# Fragment analysis summary
if len(fragment_df) > 0:
fragment_summary = fragment_df.groupby('formula').agg({
'asphericity': ['mean', 'std'],
'acylindricity': ['mean', 'std'],
'inertia_1': ['mean', 'std'],
'inertia_2': ['mean', 'std'],
'inertia_3': ['mean', 'std']
}).round(3)
fragment_summary.to_csv(f'{output_dir}/fragment_properties_by_formula.csv')
# Contact statistics
if len(contact_df) > 0:
contact_summary = contact_df.groupby(['central_atom', 'contact_atom']).agg({
'length': ['mean', 'std', 'min', 'max', 'count'],
'is_hbond': 'mean'
}).round(3)
contact_summary.to_csv(f'{output_dir}/contact_statistics.csv')
# Export raw data
crystal_df.to_csv(f'{output_dir}/crystal_data.csv', index=False)
if len(fragment_df) > 0:
fragment_df.to_csv(f'{output_dir}/fragment_data.csv', index=False)
if len(contact_df) > 0:
contact_df.to_csv(f'{output_dir}/contact_data.csv', index=False)
print(f"Analysis results exported to {output_dir}/")
print("Files created:")
for file in os.listdir(output_dir):
print(f" - {file}")
export_analysis_results(crystal_df, fragment_df, contact_df, output_dir='../my_first_csa_run/analysis_exports/')
Create Analysis Report
def generate_analysis_report(crystal_df, fragment_df, contact_df, output_file='analysis_report.txt'):
"""Generate a comprehensive analysis report."""
with open(output_file, 'w') as f:
f.write("CSA Analysis Report\n")
f.write("=" * 50 + "\n\n")
# Dataset overview
f.write("DATASET OVERVIEW\n")
f.write("-" * 20 + "\n")
f.write(f"Total structures analyzed: {len(crystal_df)}\n")
f.write(f"Unique space groups: {crystal_df['space_group'].nunique()}\n")
f.write(f"Temperature range: {crystal_df['temperature'].min():.0f} - {crystal_df['temperature'].max():.0f} K\n")
f.write(f"Total molecular fragments: {len(fragment_df)}\n")
f.write(f"Total intermolecular contacts: {len(contact_df)}\n\n")
# Crystal properties
f.write("CRYSTAL PROPERTIES\n")
f.write("-" * 20 + "\n")
f.write(f"Average cell volume: {crystal_df['cell_volume'].mean():.1f} ± {crystal_df['cell_volume'].std():.1f} ų\n")
f.write(f"Average density: {crystal_df['cell_density'].mean():.2f} ± {crystal_df['cell_density'].std():.2f} g/cm³\n")
f.write(f"Average molecular size: {crystal_df['n_atoms'].mean():.1f} ± {crystal_df['n_atoms'].std():.1f} atoms\n\n")
# Most common space groups
f.write("TOP SPACE GROUPS\n")
f.write("-" * 20 + "\n")
top_sgs = crystal_df['space_group'].value_counts().head(10)
for sg, count in top_sgs.items():
f.write(f"{sg}: {count} structures ({count/len(crystal_df)*100:.1f}%)\n")
f.write("\n")
# Contact analysis
if len(contact_df) > 0:
hbonds = contact_df[contact_df['is_hbond'] == True]
f.write("INTERMOLECULAR CONTACTS\n")
f.write("-" * 25 + "\n")
f.write(f"Total contacts: {len(contact_df)}\n")
f.write(f"Hydrogen bonds: {len(hbonds)} ({len(hbonds)/len(contact_df)*100:.1f}%)\n")
f.write(f"Average contact distance: {contact_df['length'].mean():.2f} ± {contact_df['length'].std():.2f} Å\n")
if len(hbonds) > 0:
f.write(f"Average H-bond distance: {hbonds['length'].mean():.2f} ± {hbonds['length'].std():.2f} Å\n")
f.write("\n")
# Fragment analysis
if len(fragment_df) > 0:
f.write("FRAGMENT ANALYSIS\n")
f.write("-" * 20 + "\n")
f.write(f"Average asphericity: {fragment_df['asphericity'].mean():.3f} ± {fragment_df['asphericity'].std():.3f}\n")
f.write(f"Average acylindricity: {fragment_df['acylindricity'].mean():.3f} ± {fragment_df['acylindricity'].std():.3f}\n")
if 'shape_class' in fragment_df.columns:
f.write("\nShape distribution:\n")
for shape, count in fragment_df['shape_class'].value_counts().items():
f.write(f" {shape}: {count} ({count/len(fragment_df)*100:.1f}%)\n")
print(f"Analysis report saved to {output_file}")
generate_analysis_report(crystal_df, fragment_df, contact_df, output_file='../my_first_csa_run/analysis_exports/analysis_report.txt')
Troubleshooting and Optimization
Performance Monitoring
import time
import psutil
import os
def monitor_analysis_performance(analysis_function, *args, **kwargs):
"""Monitor memory and time usage of analysis functions."""
# Initial memory measurement
process = psutil.Process(os.getpid())
initial_memory = process.memory_info().rss / 1024 / 1024 # MB
# Time the analysis
start_time = time.time()
result = analysis_function(*args, **kwargs)
end_time = time.time()
# Final memory measurement
final_memory = process.memory_info().rss / 1024 / 1024 # MB
# Report performance
print(f"Analysis Performance:")
print(f" Execution time: {end_time - start_time:.2f} seconds")
print(f" Initial memory: {initial_memory:.1f} MB")
print(f" Final memory: {final_memory:.1f} MB")
print(f" Memory increase: {final_memory - initial_memory:.1f} MB")
return result
# Example usage
crystal_df = monitor_analysis_performance(load_crystal_data,
'../my_first_csa_run/small_hydrocarbons_processed.h5')
Common Issues and Solutions
Data Access Problems
def validate_data_file(hdf5_file):
"""Check if HDF5 file is valid and contains expected data."""
try:
with h5py.File(hdf5_file, 'r') as f:
# Check required datasets
required_datasets = ['refcode_list', 'space_group', 'cell_volume', 'cell_density']
missing_datasets = [ds for ds in required_datasets if ds not in f]
if missing_datasets:
print(f"ERROR: Missing datasets: {missing_datasets}")
return False
n_structures = len(f['refcode_list'])
print(f"✓ File is valid with {n_structures} structures")
# Check for common optional datasets
optional_datasets = ['n_fragments', 'inter_cc_n_contacts', 'fragment_formula']
available_optional = [ds for ds in optional_datasets if ds in f]
print(f"✓ Available optional datasets: {available_optional}")
return True
except Exception as e:
print(f"ERROR reading file: {e}")
return False
# Validate your data file
validate_data_file('../my_first_csa_run/small_hydrocarbons_processed.h5')
Memory Issues with Large Datasets
def load_data_efficiently(hdf5_file, chunk_size=1000):
"""Load large datasets efficiently using chunking."""
results = []
with h5py.File(hdf5_file, 'r') as f:
n_structures = len(f['refcode_list'])
for start in range(0, n_structures, chunk_size):
end = min(start + chunk_size, n_structures)
chunk_data = {
'refcode': f['refcode_list'][start:end].astype(str),
'cell_volume': f['cell_volume'][start:end],
'cell_density': f['cell_density'][start:end]
}
results.append(pd.DataFrame(chunk_data))
return pd.concat(results, ignore_index=True)
# Use for very large datasets
# crystal_df = load_data_efficiently('your_large_dataset.h5')
Next Steps
With basic analysis mastery, you’re ready to:
Explore Domain-Specific Tutorials: Learn analysis techniques for your research area - ../tutorials/pharmaceutical_analysis for drug development - ../tutorials/materials_science for materials research - ../tutorials/organic_chemistry for synthetic chemistry
Try Advanced Workflows: Experiment with sophisticated analysis techniques - advanced_features for complex analysis workflows - custom_workflows for specialized research questions
Scale Your Analysis: Apply techniques to larger datasets - batch_processing for high-throughput analysis - ../technical_details/performance for optimization strategies
Share and Collaborate: Export results for publication and collaboration - data_export for sharing data and results - Examples for ready-to-run analysis scripts
Develop Custom Analysis: Create your own analysis functions - API Reference for programmatic access - ../technical_details/architecture for extending CSA
See Also
Tutorials : Step-by-step tutorials for specific analyses Examples : Ready-to-run analysis examples Data Model : Understanding CSA’s data organization Configuration : Advanced configuration strategies ../technical_details/performance : Performance optimization techniques