Implementation
Implementation Details
Section titled “Implementation Details”Technology Stack
Section titled “Technology Stack”- Language: Python 3.9+
- Package Manager: uv (unified Python package manager)
- Dependencies:
- numpy, pandas, polars (data processing)
- matplotlib, seaborn (visualization)
- requests (dataset downloading)
- ipykernel (Jupyter notebook support)
- Environment: Jupyter notebooks for exploration and analysis
Project Structure
Section titled “Project Structure”Directorycos-781/
Directoryexploration/ - Main implementation and exploration
- apriori.py - Traditional Apriori implementation
- fp_growth.py - FP-Growth algorithm implementation
- improved_apriori.py - Improved Apriori implementation (skeleton)
- dataset_loader.py - Dataset downloading and caching utilities
- preprocessing.py - Data preprocessing and transaction utilities
- exploration.ipynb - Main Jupyter notebook with experiments
- pyproject.toml - Python project configuration
Directorydocs/ - Documentation website
- …
Directoryreport/ - LaTeX report
- …
- pyproject.toml - Root project configuration
Key Components
Section titled “Key Components”1. Traditional Apriori (apriori.py)
Section titled “1. Traditional Apriori (apriori.py)”Complete implementation of the standard Apriori algorithm with:
- Candidate generation using join step
- Candidate pruning using Apriori property
- Support counting
- Association rule generation
- Runtime tracking for benchmarking
Key Features:
Aprioriclass withfit()andgenerate_association_rules()methodsget_runtime_stats()for performance analysis- Support for configurable minimum support and confidence thresholds
2. FP-Growth (fp_growth.py)
Section titled “2. FP-Growth (fp_growth.py)”Complete implementation of the FP-Growth algorithm including:
FPNodeclass for FP-tree nodesFPTreeclass with tree construction and mining- Header table management with node linking
- Conditional pattern base generation
- Recursive mining algorithm
- Runtime tracking for benchmarking
Key Features:
FPGrowthclass with same interface as Apriori- Efficient tree-based pattern mining
- No candidate generation required
get_runtime_stats()for performance analysis
3. Improved Apriori (improved_apriori.py)
Section titled “3. Improved Apriori (improved_apriori.py)”Complete implementation of the Weighted Apriori algorithm with intersection-based counting:
- Single database scan optimization
- Intersection-based support calculation
- Support transaction storage for efficient operations
- Association rule generation using stored support transactions
- Comprehensive runtime tracking
Key Features:
ImprovedAprioriclass withfit()andgenerate_association_rules()methods_apriori_gen()for candidate generation with join and prune steps_calculate_support_by_intersection()for efficient support calculationget_runtime_stats()with detailed metrics including initial scan time, candidate generation time, and support calculation time- Support for configurable minimum support and confidence thresholds
Optimization Details:
- Stores support transactions (transaction IDs) for each itemset
- Calculates support by intersecting (k-1)-subset support transactions
- Eliminates repeated database scans
- Uses efficient set operations for intersection calculations
4. Dataset Loader (dataset_loader.py)
Section titled “4. Dataset Loader (dataset_loader.py)”Utilities for downloading and caching datasets:
DatasetCacheclass for managing cached datasets- Automatic caching with MD5-based keys
- Support for JSONL file downloads
- Metadata tracking for cached files
5. Preprocessing (preprocessing.py)
Section titled “5. Preprocessing (preprocessing.py)”Data preprocessing utilities:
filter_verified_purchases()- Filter for verified purchases onlycreate_user_carts()- Group products by user to create transactionsprepare_transactions()- Complete preprocessing pipelinecalculate_min_support()- Programmatic minimum support calculationsuggest_min_support()- Intelligent support threshold suggestionsget_transaction_stats()- Dataset statistics
Runtime Tracking
Section titled “Runtime Tracking”All algorithms include comprehensive runtime tracking:
fit_time- Total time for fittingfrequent_itemsets_time- Time to find frequent itemsetsassociation_rules_time- Time to generate association rulesget_runtime_stats()method to retrieve statistics
Usage Examples
Section titled “Usage Examples”See the Code Documentation page for detailed usage examples and API reference.
Running the Code
Section titled “Running the Code”All implementations are available in the exploration/ directory. The main Jupyter notebook (exploration.ipynb) contains:
- Dataset loading and preprocessing
- Algorithm execution with runtime tracking
- Comprehensive visualizations
- Performance comparisons
See the Code Documentation page for detailed instructions on running the implementations.