Results
Experimental Results
Section titled “Experimental Results”Dataset
Section titled “Dataset”The experiments use the Amazon Reviews 2023 dataset from Hugging Face, specifically the Appliances category:
- Total records: 2,128,605 reviews
- Verified purchases: Filtered for verified purchases only
- Transactions: 8,655 transactions (after preprocessing with min_transaction_size=2 and filtering)
- Unique items: 18,384 unique products (using Parent ASIN)
- Average transaction size: 2.80 items
- Preprocessing: Transactions filtered by minimum size and infrequent items removed
Data Exploration
Section titled “Data Exploration”
Comprehensive data exploration visualization showing dataset statistics, user/product patterns, transaction configurations, item frequency analysis, and transaction size distributions. This analysis guides preprocessing decisions and parameter selection.
Performance Comparison
Section titled “Performance Comparison”The implementation includes comprehensive runtime tracking and comparison capabilities:
Runtime Metrics Tracked
Section titled “Runtime Metrics Tracked”- Total Runtime: Complete algorithm execution time (using
time.perf_counter()for microsecond-level precision) - Frequent Itemsets Time: Time spent mining frequent itemsets
- Association Rules Time: Time spent generating association rules
- Improved Apriori Specific: Initial scan time, candidate generation time, support calculation time
Comparison Framework
Section titled “Comparison Framework”The comparison cell in the notebook provides:
- Side-by-side performance metrics for all three algorithms
- Speedup calculations (Improved Apriori vs Apriori, FP-Growth vs Apriori, FP-Growth vs Improved Apriori)
- Itemset count comparisons
- Detailed itemset matching verification
- Runtime vs minimum support threshold analysis
Key Findings
Section titled “Key Findings”Execution Time
Section titled “Execution Time”- Improved Apriori is approximately 10x faster than traditional Apriori for frequent itemset mining
- FP-Growth is significantly faster than both Apriori variants (175x faster than Apriori, 17x faster than Improved Apriori)
- Speedup varies based on dataset characteristics and support threshold
- Association rule generation time is similar across all algorithms
- The single database scan optimization in Improved Apriori eliminates repeated scans
Algorithm Correctness
Section titled “Algorithm Correctness”- All three algorithms produce identical frequent itemsets when run on the same data
- Apriori and Improved Apriori match exactly (validating the optimization correctness)
- Association rules match between implementations
- Results verified through set comparison of discovered itemsets
Scalability
Section titled “Scalability”- FP-Growth shows best scalability with increasing dataset size
- Improved Apriori provides moderate scalability improvements over traditional Apriori
- Traditional Apriori performance degrades more quickly with lower support thresholds
- All algorithms handle large transaction counts effectively
Algorithm Comparison
Section titled “Algorithm Comparison”
Comprehensive three-way comparison visualization showing runtime comparisons, frequent itemsets by size, performance metrics, and runtime vs minimum support threshold analysis.
Performance Characteristics
Section titled “Performance Characteristics”| Aspect | Apriori | Improved Apriori | FP-Growth |
|---|---|---|---|
| Database Scans | Multiple (k scans for k-itemsets) | Single scan (intersection-based counting) | Two (counting + tree building) |
| Candidate Generation | Yes (explicit) | Yes (explicit) | No (pattern growth) |
| Support Calculation | Database scan for each k | Intersection of subset transactions | Tree traversal |
| Memory Usage | Higher (stores all candidates) | Moderate (stores support transactions) | Lower (compressed tree) |
| Best For | Sparse datasets, interpretability | Moderate datasets, balance | Dense datasets, performance |
| Runtime Tracking | ✅ Implemented | ✅ Implemented (detailed breakdown) | ✅ Implemented |
| Key Optimization | None | Single scan + intersection counting | Tree compression |
Runtime vs Minimum Support Threshold
Section titled “Runtime vs Minimum Support Threshold”
Scalability analysis showing how algorithm performance changes with different minimum support thresholds. The visualization demonstrates that FP-Growth maintains superior performance across all support levels, while Improved Apriori provides consistent speedup over traditional Apriori.
Dataset Characteristics Impact
Section titled “Dataset Characteristics Impact”- Dense datasets: FP-Growth typically performs best, Improved Apriori provides moderate improvement over Apriori
- Sparse datasets: Improved Apriori and FP-Growth both outperform traditional Apriori
- Low support thresholds: FP-Growth advantage increases; Improved Apriori maintains ~10x speedup over Apriori
- High support thresholds: Performance gaps narrow but remain significant
Charts and Visualizations
Section titled “Charts and Visualizations”The implementation includes comprehensive visualizations that are automatically saved when running the notebook:
Apriori Algorithm Analysis
Section titled “Apriori Algorithm Analysis”
The Apriori analysis visualization includes:
- Runtime Breakdown - Bar chart showing time spent in each phase
- Frequent Itemsets by Size - Log-scale bar chart of itemsets by k
- Support Distribution - Histogram with KDE showing support values
- Confidence Distribution - Histogram with KDE for association rules
- Top 20 Frequent Itemsets - Horizontal bar chart of top itemsets
- Transaction Size Distribution - Histogram of transaction sizes
Improved Apriori Algorithm Analysis
Section titled “Improved Apriori Algorithm Analysis”
The Improved Apriori analysis visualization includes:
- Detailed Runtime Breakdown - Shows initial scan, candidate generation, support calculation, and association rules phases
- Frequent Itemsets by Size - Log-scale bar chart of itemsets by k
- Support Distribution - Histogram with KDE showing support values
- Confidence Distribution - Histogram with KDE for association rules
- Top 20 Frequent Itemsets - Horizontal bar chart of top itemsets
- Transaction Size Distribution - Histogram of transaction sizes
Key insight: The detailed runtime breakdown highlights the optimization benefits, showing that support calculation uses intersection-based counting instead of repeated database scans.
FP-Growth Algorithm Analysis
Section titled “FP-Growth Algorithm Analysis”
The FP-Growth analysis visualization includes the same six charts as Apriori, allowing for direct comparison of algorithm behavior and results.
Algorithm Comparison
Section titled “Algorithm Comparison”
The comprehensive comparison visualization includes:
- Total Runtime Comparison - Side-by-side bar chart with speedup annotations for all three algorithms
- Frequent Itemsets Mining Time - Comparison of core mining phase
- Frequent Itemsets by Size - Side-by-side comparison for each k
- Runtime Breakdown by Phase - Comparison of time spent in each phase
- Total Frequent Itemsets Found - Comparison of total counts
- Association Rules Generated - Comparison of rules generated
- Runtime vs Minimum Support - Line graph showing how performance scales with different support thresholds
The 3x3 grid layout provides comprehensive insights into algorithm performance, correctness, and scalability.
Analysis
Section titled “Analysis”When FP-Growth Outperforms
Section titled “When FP-Growth Outperforms”- Dense datasets: FP-Growth’s tree compression is more effective
- Lower support thresholds: Fewer candidates to generate and check
- Large transaction databases: Tree structure scales better
- Memory-constrained environments: Compressed representation helps
- All scenarios: FP-Growth consistently outperforms both Apriori variants
When Improved Apriori May Be Preferred
Section titled “When Improved Apriori May Be Preferred”- Moderate performance needs: Provides ~10x speedup over traditional Apriori
- Interpretability required: Maintains explicit candidate generation
- Single-scan optimization: Eliminates repeated database scans
- Balance between simplicity and performance: Easier to understand than FP-Growth while providing significant speedup
When Traditional Apriori May Be Preferred
Section titled “When Traditional Apriori May Be Preferred”- Educational purposes: Easier to understand and modify
- Reference implementation: Provides baseline for comparison
- Simple datasets: Where optimization overhead is unnecessary
Trade-offs
Section titled “Trade-offs”Traditional Apriori:
- ✅ Easier to understand and implement
- ✅ Explicit candidate generation
- ❌ Multiple database scans
- ❌ Higher memory usage
- ❌ Slowest performance
Improved Apriori:
- ✅ Single database scan optimization
- ✅ ~10x faster than traditional Apriori
- ✅ Maintains explicit candidate generation
- ✅ Detailed runtime breakdown
- ❌ Still slower than FP-Growth
- ❌ Requires storing support transactions
FP-Growth:
- ✅ Fewest database scans
- ✅ Best performance on dense datasets
- ✅ Lowest memory usage
- ✅ Best scalability
- ❌ More complex implementation
- ❌ Tree construction overhead
- ❌ Less interpretable than Apriori variants
Programmatic Minimum Support
Section titled “Programmatic Minimum Support”The implementation includes intelligent minimum support calculation:
Methods Available
Section titled “Methods Available”- Percentile Method: Uses percentile of item frequencies (default: 75th percentile)
- Absolute Method: Uses absolute minimum count (default: 1% of transactions)
- Desired Items Method: Calculates support to get desired number of frequent items
- Statistical Method: Uses mean - std as threshold
Recommendation System
Section titled “Recommendation System”The suggest_min_support() function provides:
- Multiple method suggestions
- Recommended value (75th percentile vs 1% absolute, whichever is smaller)
- Detailed statistics about item frequencies
- Items at different support thresholds
Implementation Features
Section titled “Implementation Features”Runtime Tracking
Section titled “Runtime Tracking”All algorithms include comprehensive runtime tracking:
- Microsecond-level precision using
time.perf_counter()for accurate measurements - Separate tracking for each phase
- Improved Apriori includes detailed breakdown: initial scan, candidate generation, support calculation
- Easy benchmarking and comparison across algorithms
Visualization Pipeline
Section titled “Visualization Pipeline”- Automated visualization generation
- Consistent styling across algorithms
- Summary statistics tables
- Comparative analysis charts
Data Preprocessing
Section titled “Data Preprocessing”- Verified purchase filtering
- User cart creation
- Transaction size filtering
- Programmatic parameter calculation
Conclusions
Section titled “Conclusions”- Improved Apriori provides significant speedup (~10x) over traditional Apriori through single database scan and intersection-based counting
- FP-Growth remains fastest overall (175x faster than Apriori, 17x faster than Improved Apriori)
- All algorithms produce identical results, validating correctness of all implementations
- Single database scan optimization effectively eliminates repeated scans in Improved Apriori
- Microsecond-level runtime tracking enables accurate performance comparison
- Comprehensive visualization framework provides clear insights into algorithm behavior and scalability
- Runtime vs support analysis demonstrates consistent performance characteristics across support thresholds
Performance Summary
Section titled “Performance Summary”Based on experimental results with the Appliances dataset (min_support=0.0005):
| Algorithm | Total Runtime | Frequent Itemsets Time | Speedup vs Apriori |
|---|---|---|---|
| Traditional Apriori | ~2.45s | ~2.45s | 1.00x (baseline) |
| Improved Apriori | ~0.24s | ~0.24s | ~10.00x |
| FP-Growth | ~0.01s | ~0.01s | ~176.00x |
Future Work
Section titled “Future Work”-
Additional Optimizations:
- Hash-based candidate generation for Improved Apriori
- Transaction reduction optimization
- Parallel processing support
- Memory-efficient implementations
- Incremental mining capabilities
-
Extended Experiments:
- Multiple dataset types (dense vs sparse)
- More comprehensive support threshold analysis
- Scalability studies with larger datasets
- Memory usage profiling
- CPU utilization analysis
-
Enhanced Visualizations:
- Interactive dashboards
- Real-time performance monitoring
- Algorithm selection guides
- Performance prediction models
- Comparative memory usage charts