A Scientific Approach to Understanding Cannabis.
For years, cannabis has been categorized using broad labels like Indica, Sativa, and Hybrid—but these classifications don’t fully capture a strain’s chemical makeup or effects. Without standardization, strain names are often inconsistent, leading to confusion for consumers and industry professionals alike.
A more precise way to classify cannabis is through its ingredients, primarily its terpene profile—the aromatic compounds responsible for flavor, scent, and potential effects. By analyzing thousands of lab-tested strains, we’ve developed a terpene-based classification system that better represents cannabis varieties.
To create a meaningful classification system, we conducted a large-scale study using state-certified lab testing on thousands of cannabis samples. Here’s how:
Advanced Lab Testing: We used Gas Chromatography (GC-MS/FID) to analyze terpenes in over 2,400 unique strains, identifying both their presence and concentration.
Data Refinement: The dataset was standardized and cleaned, focusing only on Type I (or high-THC, low-CBD) strains to ensure consistency.
Principal Component Analysis (PCA): This method highlighted key terpenes that differentiate strains, showing which compounds drive the most variation.
Cluster Analysis: We grouped strains based on similar terpene profiles, revealing six (6) categories.
Breaking Strains into Clusters
Through K-Means Clustering, we identified six primary strain categories, each with distinct terpene profiles. These groupings help consumers understand strains not just by name, but by their actual chemical makeup.
----------
Click on each tab to explore the key visualizations from our analysis.
A principal component analysis (PCA) was used to reduce the complexity of the terpene dataset and highlight patterns. It creates new variables called principal components (PCs) that summarize the dataset by explaining the most variation in the data. This helps us understand the relationships between terpenes and identify key trends.
The scree plot shows how much variance each PC explains. The first five PCs account for 89.5% of the variation in the dataset. This means they capture most of the chemical distinction between strains.
----------
The heatmap shows how much each terpene contributes to the PCs. More intense colors mean a stronger correlation, with positive and negative values showing how terpenes relate to each PC.
This highlights which terpenes are the most influential for each component. Terpenes were ranked based on how much they contribute to the variance explained by all PCs. The top-ranked terpenes are the most important for understanding differences between the samples.
----------
The Silhouette Method helps to identify the best number of clusters
This method, which measures how well samples fit within their clusters, found the best grouping at three (3) clusters. Another possible grouping at six (6) clusters was also noted, suggesting a potential alternative for granularity.
----------
A Silhouette plot visually shows how well samples fit into their clusters. In this analysis, the six-cluster solution distributed strains more evenly across clusters compared to the three-cluster solution.
All six clusters had Silhouette Coefficients above the average (marked by the vertical red line), meaning that most samples were correctly grouped into their clusters with clear separation and compactness.
----------
The Euclidean Distance Heatmap illustrates how closely related certain strains are to one another. A random subset of 100 strains from each of the six categories was selected due to the limited pixel capacity of the screen.
Pixel Intensity reflects similarity. Brighter pixels indicate more similarity in terpene profiles. Darker pixels indicate greater differences. Strains are grouped by category and ordered based on proximity to their cluster centroids.
OBSERVATIONS
----------
The Parallel Coordinate Plot is a tool for visualizing complex, high-dimensional data.
Each line represents a strain. The axes are terpenes, and the line’s position shows each strain’s values for these terpenes. The colors represent the different strain categories.
INSIGHTS
Patterns emerge between terpene levels and strain categories.
For example, samples with high myrcene levels and low limonene levels tend to belong to the same category. This suggests a possible relationship between these two terpenes.
----------
To determine whether our six terpene-based classifications align with traditional Sativa, Indica, and Hybrid labels, we conducted an additional analysis:
Key Findings:
By combining this statistical validation with our Silhouette Analysis, we derived the Venn Diagram visualization, which showcases the overlap between traditional classifications and our terpene-based categories.
Find a Retailer Near You