Your First Analysis
Note
Duration: ~45 minutes | Prerequisites: CSA installation complete
Downloads: benzene_tutorial.json | benzene_tutorial_families.csv
Welcome to your first hands-on experience with Crystal Structure Analysis (CSA)! This tutorial will guide you through the complete five-stage CSA pipeline using a focused dataset of benzene structures from the CSD.
Learning Objectives
By the end of this tutorial, you will:
Successfully execute all five stages of the CSA pipeline
Understand the data flow between pipeline stages
Navigate and interpret CSA output files
Validate results and troubleshoot common issues
Customize basic configuration parameters for your research
Prerequisites
Before starting, ensure you have:
✅ CSA installed and tested (Installation)
✅ Valid CCDC license and CSD database access
✅ At least 8GB available disk space
✅ Basic familiarity with command line and Python
Tip
If you encounter issues, check the troubleshooting section at the bottom of this tutorial.
Tutorial Overview
We’ll analyze benzene structures from the CSD to demonstrate CSA’s complete workflow. This tutorial uses a pre-defined family list to focus on a manageable dataset perfect for learning:
Focused dataset: Benzene family (BENZEN) structures from CSD
Clear relationships: All structures share the benzene core
Manageable size: ~25-30 structures for reasonable processing time
Well-characterized: Extensively studied polymorphic system
The five pipeline stages will transform our initial family list into rich, analysis-ready data:
Family Extraction - Load pre-defined benzene family refcodes (skipped for focused analysis)
Similarity Clustering - Group benzene structures by crystal packing similarity
Representative Selection - Choose optimal structures from each cluster
Data Extraction - Extract atomic coordinates, bonds, and properties
Feature Engineering - Compute geometric descriptors and contact maps
Understanding the CSA Pipeline
Before diving into the tutorial, it’s important to understand how CSA actually works:
- Stage 1: Family Extraction (`get_refcode_families`)
Queries the entire CSD database to find all refcode families
No filters applied - this stage simply catalogs what’s available
Output: Complete list of family_id and refcode pairs
For focused studies, users can provide a pre-made CSV instead
- Stage 2: Similarity Clustering (`cluster_refcode_families`)
Groups structures within each family by 3D packing similarity
Applies most filters (except structure_list) to validate structures
Uses CCDC packing similarity algorithms
Output: Cluster assignments for each valid structure
- Stage 3: Representative Selection (`get_unique_structures`)
Selects one representative per cluster using vdWFV (van der Waals Fit Volume)
Chooses structure closest to cluster median packing density
Output: List of unique representative structures
- Stage 4: Data Extraction (`get_structure_data`)
Extracts detailed structural data from CSD or local CIF files
`structure_list` filter determines source: “csd-unique” (default) or “cif”
Processes representatives into HDF5 format
Output: Raw structural data with coordinates, bonds, contacts
- Stage 5: Feature Engineering (`post_extraction_process`)
Computes advanced descriptors and geometric features
GPU-accelerated tensor operations
Output: Analysis-ready feature datasets
Step 1: Setup and Configuration
Create Tutorial Directory
First, let’s set up a dedicated workspace for this tutorial:
# Create tutorial directory
mkdir csa_first_analysis
cd csa_first_analysis
# Create subdirectories for organization
mkdir configs
mkdir scripts
mkdir results
Download Tutorial Files
For this focused tutorial, we’ll use a pre-defined family list for the benzene family. Create a file named results/benzene_tutorial_families.csv with the following content:
family_id,refcode
BENZEN,BENZEN
BENZEN,BENZEN01
BENZEN,BENZEN02
BENZEN,BENZEN03
BENZEN,BENZEN04
BENZEN,BENZEN05
BENZEN,BENZEN06
BENZEN,BENZEN07
BENZEN,BENZEN08
BENZEN,BENZEN09
BENZEN,BENZEN10
BENZEN,BENZEN11
BENZEN,BENZEN12
BENZEN,BENZEN13
BENZEN,BENZEN14
BENZEN,BENZEN15
BENZEN,BENZEN16
BENZEN,BENZEN17
BENZEN,BENZEN18
BENZEN,BENZEN19
BENZEN,BENZEN20
BENZEN,BENZEN21
BENZEN,BENZEN22
BENZEN,BENZEN23
BENZEN,BENZEN24
BENZEN,BENZEN25
BENZEN,BENZEN26
BENZEN,BENZEN27
BENZEN,BENZEN28
Note
This CSV has the exact format that CSA expects: family_id,refcode with the benzene family containing all available BENZEN refcodes from the CSD.
Create Configuration File
Create a file named configs/benzene_tutorial.json with the following configuration:
{
"extraction": {
"data_directory": "../benzene_tutorial/",
"data_prefix": "benzene_tutorial",
"actions": {
"get_refcode_families": false,
"cluster_refcode_families": true,
"get_unique_structures": true,
"get_structure_data": true,
"post_extraction_process": true
},
"filters": {
"structure_list": ["csd-unique"],
"crystal_type": ["homomolecular"],
"target_species": ["C", "H"],
"target_space_groups": ["P21/c","Pbca"],
"target_z_prime_values": [0.5],
"molecule_weight_limit": 100.0,
"molecule_formal_charges": [0],
"unique_structures_clustering_method": "vdWFV",
},
"extraction_batch_size": 32,
"post_extraction_batch_size": 32
}
}
Configuration Explanation
Let’s understand the key parameters in our configuration:
Parameter |
Value |
Purpose |
|---|---|---|
|
|
Skip CSD-wide family extraction (using pre-made list) |
|
|
Use CSD database (not local CIF files) |
|
|
Single molecular species crystals |
|
|
Simple hydrocarbons only |
|
|
Use only the two availabe space groups for the known benzene structures |
|
|
The availabe |
|
|
Focus on benzene (78 Da) and simple derivatives |
|
|
Neutral molecules |
|
|
Metric to select unique structure from a cluster |
Note
These parameters create a focused, high-quality dataset perfect for learning CSA fundamentals. The filters are applied during clustering, not during family extraction.
Step 2: Pipeline Execution
Running the Complete Pipeline
Now let’s execute the CSA pipeline with our configuration:
# Navigate to CSA installation directory
cd /path/to/crystal-structure-analysis
# Run the pipeline (adjust path to your tutorial directory)
python src/csa_main.py --config /path/to/csa_first_analysis/configs/benzene_tutorial.json
Expected Progress Output
You should see output similar to this:
2025-05-04 17:21:30,846 - root - INFO - Loading configuration from csa_config.json
2025-05-04 17:21:30,846 - root - INFO - Starting extraction step...
2025-05-04 17:21:30,846 - crystal_analyzer - INFO - Starting data extraction pipeline...
2025-05-04 17:21:30,846 - crystal_analyzer - INFO - Clustering refcode families...
2025-05-04 17:21:56,171 - csd_operations - INFO - Saved clustered families to ..\benzene_tutorial\benzene_tutorial_refcode_families_clustered.csv
2025-05-04 17:21:56,171 - crystal_analyzer - INFO - Refcode families clustered into 23 groups.
2025-05-04 17:21:56,171 - crystal_analyzer - INFO - Selecting unique structures …
2025-05-04 17:21:58,029 - csd_operations - INFO - Saved unique structures to ..\benzene_tutorial\benzene_tutorial_refcode_families_unique.csv
2025-05-04 17:21:58,029 - crystal_analyzer - INFO - Unique structures selected: 2 structures across 1 families
2025-05-04 17:21:58,029 - crystal_analyzer - INFO - Extracting detailed structure data into ..\benzene_tutorial\benzene_tutorial.h5 …
2025-05-04 17:21:58,029 - structure_data_extractor - INFO - Overwriting existing HDF5 file: ..\benzene_tutorial\benzene_tutorial.h5
2025-05-04 17:21:58,037 - structure_data_extractor - INFO - 2 structures to extract (batch size 1000)
2025-05-04 17:21:58,037 - structure_data_extractor - INFO - Extracting batch 1 (size 2)
2025-05-04 17:21:59,893 - structure_data_extractor - INFO - Raw data extraction complete; HDF5 file closed.
2025-05-04 17:21:59,893 - crystal_analyzer - INFO - Detailed structure data extracted and saved to ..\benzene_tutorial\benzene_tutorial.h5
2025-05-04 17:21:59,893 - structure_post_extraction_processor - INFO - Removing existing processed file: ..\benzene_tutorial\benzene_tutorial_processed.h5
2025-05-04 17:21:59,906 - structure_post_extraction_processor - INFO - Found 2 structures to process.
2025-05-04 17:21:59,906 - structure_post_extraction_processor - INFO - Processing structures 1 to 2
2025-05-04 17:22:00,292 - structure_post_extraction_processor - INFO - Post-extraction fast processing complete.
2025-05-04 17:22:00,292 - crystal_analyzer - INFO - Data extraction completed in 0:00:29.445523
2025-05-04 17:22:00,292 - root - INFO - Data extraction completed successfully.
Performance Expectations
Expected performance for this tutorial:
Stage 1 (Not performed)
- Stage 2 (<2 minutes)
Groups structures with similar crystal packing
- Stage 3 (<1 minute)
Picks the best representative from each cluster
- Stage 4 (<1 minute)
Extracts atomic coordinates and basic properties
- Stage 5 (<1 minute)
Computes advanced molecular descriptors
Step 3: Exploring the Results
Output File Structure
After successful completion, your results directory should contain:
results/
├── benzene_tutorial_families.csv # Pre-made family list (input)
├── benzene_tutorial_clustered_families.csv # Stage 2 output
├── benzene_tutorial_unique_structures.csv # Stage 3 output
├── benzene_tutorial_structures.h5 # Stage 4 output
└── benzene_tutorial_structures_processed.h5 # Stage 5 output
Understanding CSV Outputs
1. Input Family List
import pandas as pd
# Load and examine the input family list
families_df = pd.read_csv('../benzene_tutorial/benzene_tutorial_refcode_families.csv')
print(f"Input structures: {len(families_df)}")
print(f"Families: {families_df['family_id'].nunique()}")
# Show the family structure
print(families_df.head(5))
Expected output:
Input structures: 29
Families: 1
family_id refcode
0 BENZEN BENZEN
1 BENZEN BENZEN01
2 BENZEN BENZEN02
3 BENZEN BENZEN03
4 BENZEN BENZEN04
5 BENZEN BENZEN05
6 BENZEN BENZEN06
7 BENZEN BENZEN07
8 BENZEN BENZEN08
9 BENZEN BENZEN09
2. Clustered Families (Stage 2)
# Load clustering results
clustered_df = pd.read_csv('../benzene_tutorial/benzene_tutorial_refcode_families_clustered..csv')
print(f"Structures after filtering: {len(clustered_df)}")
print(f"Total clusters formed: {clustered_df['cluster_id'].nunique()}")
# Analyze cluster sizes
cluster_sizes = clustered_df.groupby('cluster_id').size()
print(f"Average cluster size: {cluster_sizes.mean():.2f}")
print(f"Largest cluster: {cluster_sizes.max()} structures")
print(f"Cluster size distribution:")
print(cluster_sizes.value_counts().sort_index())
Expected output:
Structures after filtering: 23
Total clusters formed: 2
Average cluster size: 11.50
Largest cluster: 16 structures
Cluster size distribution:
7 1
16 1
3. Representative Structures (Stage 3)
# Load final structure selection
unique_df = pd.read_csv('../benzene_tutorial/benzene_tutorial_refcode_families_unique.csv')
print(f"Representative structures selected: {len(unique_df)}")
# Show selected representatives
print("Selected representative structures:")
print(unique_df[['family_id', 'refcode']].to_string(index=False))
Expected output:
Representative structures selected: 2
Selected representative structures:
family_id refcode
BENZEN BENZEN22
BENZEN BENZEN24
Congratulations!
🎉 Your first CSA data extraction is complete!
You have successfully:
✅ Executed the complete CSA pipeline from clustering to feature engineering ✅ Generated analysis-ready datasets with 2 representative benzene structures ✅ Created HDF5 files containing atomic coordinates, molecular descriptors, and contact maps ✅ Understood the data flow between all five pipeline stages ✅ Learned to interpret CSV outputs and validate results
What You’ve Accomplished
Your tutorial has produced:
2 representative benzene structures selected from 23 valid CSD entries
Complete structural data including atomic coordinates and bond connectivity
Advanced molecular descriptors like fragment properties and shape parameters
Intermolecular contact maps identifying hydrogen bonds and close contacts
Analysis-ready HDF5 datasets optimized for computational analysis
Next Steps: Analyzing Your Data
Now that you have working CSA datasets, it’s time to explore and analyze your results:
- Start with Data Access
📖 Basic Analysis → “Accessing Your Data” section
Learn how to load and navigate your HDF5 files, extract crystal properties, and understand the data structure CSA has created.
- Explore Analysis Workflows
📊 Basic Analysis → “Essential Analysis Workflows” section
Discover practical analysis patterns including property distributions, fragment analysis, and contact network exploration.
Recommended Learning Path
Immediate next step: Accessing Your Data to load and inspect your benzene dataset
Then explore: Crystal Property Analysis to visualize your results
Advanced analysis: Fragment Analysis to study benzene molecular shapes
Finally try: Contact Analysis to map intermolecular interactions
Ready for More?
Try different chemical systems → Modify your configuration to study other molecular families
Scale up your analysis → Remove size restrictions and analyze larger datasets
Explore domain-specific tutorials → ../tutorials/organic_chemistry for hydrocarbon-specific workflows
Learn advanced configuration → Configuration for research-optimized setups
Welcome to the CSA community! 🚀 You’re now ready to tackle real crystallographic research questions with confidence.