Skip to content

Code Documentation

  1. Clone the repository:
Terminal window
git clone <repository-url>
cd cos-781
  1. Install dependencies using uv:
Terminal window
cd exploration
uv sync

This will:

  • Create a virtual environment in .venv
  • Install all required dependencies (numpy, pandas, matplotlib, seaborn, requests, etc.)
  • Set up the workspace configuration
  1. Start Jupyter notebook:
Terminal window
uv run jupyter notebook exploration.ipynb

The codebase includes implementations of:

  • Traditional Apriori algorithm (apriori.py)
  • FP-Growth algorithm (fp_growth.py)
  • Improved Apriori algorithm (improved_apriori.py) - Weighted Apriori with intersection-based counting
from apriori import Apriori
# Initialize algorithm
apriori_model = Apriori(min_support=0.01, min_confidence=0.5)
# Fit on transactions
apriori_model.fit(transactions)
# Generate association rules
apriori_model.generate_association_rules()
# Get results
frequent_itemsets = apriori_model.get_frequent_itemsets()
association_rules = apriori_model.get_association_rules()
runtime_stats = apriori_model.get_runtime_stats()
from fp_growth import FPGrowth
# Initialize algorithm
fp_model = FPGrowth(min_support=0.01, min_confidence=0.5)
# Fit on transactions
fp_model.fit(transactions)
# Generate association rules
fp_model.generate_association_rules()
# Get results
frequent_itemsets = fp_model.get_frequent_itemsets()
association_rules = fp_model.get_association_rules()
runtime_stats = fp_model.get_runtime_stats()
from improved_apriori import ImprovedApriori
# Initialize algorithm
improved_model = ImprovedApriori(min_support=0.01, min_confidence=0.5)
# Fit on transactions
improved_model.fit(transactions)
# Generate association rules
improved_model.generate_association_rules()
# Get results
frequent_itemsets = improved_model.get_frequent_itemsets()
association_rules = improved_model.get_association_rules()
runtime_stats = improved_model.get_runtime_stats()
# Enhanced runtime statistics include:
# - initial_scan_time: Time for single database scan
# - candidate_generation_time: Total time for candidate generation
# - support_calculation_time: Time for intersection-based support calculation
from dataset_loader import load_jsonl_dataset
# Download and load dataset
url = "https://huggingface.co/datasets/..."
data = load_jsonl_dataset(url)
from preprocessing import (
prepare_transactions,
suggest_min_support,
get_transaction_stats
)
# Prepare transactions from raw data
transactions = prepare_transactions(
data,
use_parent_asin=False,
min_transaction_size=4
)
# Calculate optimal minimum support
support_suggestions = suggest_min_support(transactions)
min_support = support_suggestions['recommended_min_support']
# Get statistics
stats = get_transaction_stats(transactions)
  1. Load Dataset:
from dataset_loader import load_jsonl_dataset
data = load_jsonl_dataset("https://huggingface.co/datasets/...")
  1. Preprocess Data:
from preprocessing import prepare_transactions, suggest_min_support
transactions = prepare_transactions(data, min_transaction_size=4)
min_support = suggest_min_support(transactions)['recommended_min_support']
  1. Run Algorithm:

Traditional Apriori:

from apriori import Apriori
model = Apriori(min_support=min_support, min_confidence=0.5)
model.fit(transactions)
model.generate_association_rules()

Improved Apriori (recommended):

from improved_apriori import ImprovedApriori
model = ImprovedApriori(min_support=min_support, min_confidence=0.5)
model.fit(transactions)
model.generate_association_rules()

FP-Growth:

from fp_growth import FPGrowth
model = FPGrowth(min_support=min_support, min_confidence=0.5)
model.fit(transactions)
model.generate_association_rules()
  1. Analyze Results:
frequent_itemsets = model.get_frequent_itemsets()
association_rules = model.get_association_rules()
runtime_stats = model.get_runtime_stats()

Programmatic Minimum Support Calculation:

from preprocessing import calculate_min_support
# Use percentile method
min_support = calculate_min_support(
transactions,
method='percentile',
percentile=75.0
)
# Use absolute count method
min_support = calculate_min_support(
transactions,
method='absolute',
min_absolute_count=100
)

Runtime Benchmarking:

For Traditional Apriori and FP-Growth:

import time
start = time.time()
model.fit(transactions)
model.generate_association_rules()
total_time = time.time() - start
runtime_stats = model.get_runtime_stats()
print(f"Total time: {total_time:.2f}s")
print(f"Frequent itemsets time: {runtime_stats['frequent_itemsets_time']:.2f}s")
print(f"Association rules time: {runtime_stats['association_rules_time']:.2f}s")

For Improved Apriori (with enhanced metrics):

import time
start = time.time()
model.fit(transactions)
model.generate_association_rules()
total_time = time.time() - start
runtime_stats = model.get_runtime_stats()
print(f"Total time: {total_time:.2f}s")
print(f"Frequent itemsets time: {runtime_stats['frequent_itemsets_time']:.2f}s")
print(f"Initial scan time: {runtime_stats['initial_scan_time']:.4f}s")
print(f"Candidate generation time: {runtime_stats['candidate_generation_time']:.4f}s")
print(f"Support calculation time: {runtime_stats['support_calculation_time']:.4f}s")
print(f"Association rules time: {runtime_stats['association_rules_time']:.2f}s")
  • min_support (float): Minimum support threshold (0.0 to 1.0)
  • min_confidence (float): Minimum confidence threshold (0.0 to 1.0)
  • use_parent_asin (bool): Use parent_asin instead of asin for grouping
  • min_transaction_size (int): Minimum number of items in a transaction
  • Cache directory: data/cache/ (configurable)
  • Automatic caching with MD5-based keys
  • Metadata tracking in metadata.json

The main Jupyter notebook (exploration.ipynb) contains comprehensive examples including:

  1. Dataset Loading and Preprocessing

    • Downloading datasets from Hugging Face
    • Filtering verified purchases
    • Creating transaction format
  2. Algorithm Execution

    • Running Apriori with runtime tracking
    • Running Improved Apriori with runtime tracking
    • Running FP-Growth with runtime tracking
    • Comparing all three algorithms
  3. Visualization

    • Data exploration visualizations
    • Runtime breakdown charts
    • Frequent itemsets by size
    • Support and confidence distributions
    • Top frequent itemsets
    • Transaction size distributions
    • Runtime vs minimum support threshold analysis
    • Auto-saved images: All visualizations are automatically saved to docs/src/assets/images/ as PNG and PDF files
  4. Performance Comparison

    • Three-way algorithm comparison (Apriori, Improved Apriori, FP-Growth)
    • Speedup calculations
    • Itemset matching verification
    • Scalability analysis across support thresholds
exploration/
├── apriori.py # Traditional Apriori implementation
├── fp_growth.py # FP-Growth implementation
├── improved_apriori.py # Improved Apriori with intersection-based counting
├── dataset_loader.py # Dataset utilities
├── preprocessing.py # Preprocessing utilities
└── exploration.ipynb # Main notebook with all experiments

When contributing code:

  1. Follow Python PEP 8 style guidelines
  2. Include docstrings for all public functions and classes
  3. Add type hints for better code documentation
  4. Update this documentation when adding new features
  5. Include runtime tracking in new algorithm implementations