Skip to content

Implementation

  • Language: Python 3.9+
  • Package Manager: uv (unified Python package manager)
  • Dependencies:
    • numpy, pandas, polars (data processing)
    • matplotlib, seaborn (visualization)
    • requests (dataset downloading)
    • ipykernel (Jupyter notebook support)
  • Environment: Jupyter notebooks for exploration and analysis
  • Directorycos-781/
    • Directoryexploration/ - Main implementation and exploration
      • apriori.py - Traditional Apriori implementation
      • fp_growth.py - FP-Growth algorithm implementation
      • improved_apriori.py - Improved Apriori implementation (skeleton)
      • dataset_loader.py - Dataset downloading and caching utilities
      • preprocessing.py - Data preprocessing and transaction utilities
      • exploration.ipynb - Main Jupyter notebook with experiments
      • pyproject.toml - Python project configuration
    • Directorydocs/ - Documentation website
    • Directoryreport/ - LaTeX report
    • pyproject.toml - Root project configuration

Complete implementation of the standard Apriori algorithm with:

  • Candidate generation using join step
  • Candidate pruning using Apriori property
  • Support counting
  • Association rule generation
  • Runtime tracking for benchmarking

Key Features:

  • Apriori class with fit() and generate_association_rules() methods
  • get_runtime_stats() for performance analysis
  • Support for configurable minimum support and confidence thresholds

Complete implementation of the FP-Growth algorithm including:

  • FPNode class for FP-tree nodes
  • FPTree class with tree construction and mining
  • Header table management with node linking
  • Conditional pattern base generation
  • Recursive mining algorithm
  • Runtime tracking for benchmarking

Key Features:

  • FPGrowth class with same interface as Apriori
  • Efficient tree-based pattern mining
  • No candidate generation required
  • get_runtime_stats() for performance analysis

Complete implementation of the Weighted Apriori algorithm with intersection-based counting:

  • Single database scan optimization
  • Intersection-based support calculation
  • Support transaction storage for efficient operations
  • Association rule generation using stored support transactions
  • Comprehensive runtime tracking

Key Features:

  • ImprovedApriori class with fit() and generate_association_rules() methods
  • _apriori_gen() for candidate generation with join and prune steps
  • _calculate_support_by_intersection() for efficient support calculation
  • get_runtime_stats() with detailed metrics including initial scan time, candidate generation time, and support calculation time
  • Support for configurable minimum support and confidence thresholds

Optimization Details:

  • Stores support transactions (transaction IDs) for each itemset
  • Calculates support by intersecting (k-1)-subset support transactions
  • Eliminates repeated database scans
  • Uses efficient set operations for intersection calculations

Utilities for downloading and caching datasets:

  • DatasetCache class for managing cached datasets
  • Automatic caching with MD5-based keys
  • Support for JSONL file downloads
  • Metadata tracking for cached files

Data preprocessing utilities:

  • filter_verified_purchases() - Filter for verified purchases only
  • create_user_carts() - Group products by user to create transactions
  • prepare_transactions() - Complete preprocessing pipeline
  • calculate_min_support() - Programmatic minimum support calculation
  • suggest_min_support() - Intelligent support threshold suggestions
  • get_transaction_stats() - Dataset statistics

All algorithms include comprehensive runtime tracking:

  • fit_time - Total time for fitting
  • frequent_itemsets_time - Time to find frequent itemsets
  • association_rules_time - Time to generate association rules
  • get_runtime_stats() method to retrieve statistics

See the Code Documentation page for detailed usage examples and API reference.

All implementations are available in the exploration/ directory. The main Jupyter notebook (exploration.ipynb) contains:

  • Dataset loading and preprocessing
  • Algorithm execution with runtime tracking
  • Comprehensive visualizations
  • Performance comparisons

See the Code Documentation page for detailed instructions on running the implementations.