Code Documentation
Code Documentation
Section titled “Code Documentation”Installation
Section titled “Installation”Prerequisites
Section titled “Prerequisites”- Python 3.9 or higher
- uv package manager (install from https://github.com/astral-sh/uv)
- Clone the repository:
git clone <repository-url>cd cos-781- Install dependencies using uv:
cd explorationuv syncThis will:
- Create a virtual environment in
.venv - Install all required dependencies (numpy, pandas, matplotlib, seaborn, requests, etc.)
- Set up the workspace configuration
- Start Jupyter notebook:
uv run jupyter notebook exploration.ipynbAPI Reference
Section titled “API Reference”The codebase includes implementations of:
- Traditional Apriori algorithm (
apriori.py) - FP-Growth algorithm (
fp_growth.py) - Improved Apriori algorithm (
improved_apriori.py) - Weighted Apriori with intersection-based counting
Apriori Algorithm
Section titled “Apriori Algorithm”from apriori import Apriori
# Initialize algorithmapriori_model = Apriori(min_support=0.01, min_confidence=0.5)
# Fit on transactionsapriori_model.fit(transactions)
# Generate association rulesapriori_model.generate_association_rules()
# Get resultsfrequent_itemsets = apriori_model.get_frequent_itemsets()association_rules = apriori_model.get_association_rules()runtime_stats = apriori_model.get_runtime_stats()FP-Growth Algorithm
Section titled “FP-Growth Algorithm”from fp_growth import FPGrowth
# Initialize algorithmfp_model = FPGrowth(min_support=0.01, min_confidence=0.5)
# Fit on transactionsfp_model.fit(transactions)
# Generate association rulesfp_model.generate_association_rules()
# Get resultsfrequent_itemsets = fp_model.get_frequent_itemsets()association_rules = fp_model.get_association_rules()runtime_stats = fp_model.get_runtime_stats()Improved Apriori Algorithm
Section titled “Improved Apriori Algorithm”from improved_apriori import ImprovedApriori
# Initialize algorithmimproved_model = ImprovedApriori(min_support=0.01, min_confidence=0.5)
# Fit on transactionsimproved_model.fit(transactions)
# Generate association rulesimproved_model.generate_association_rules()
# Get resultsfrequent_itemsets = improved_model.get_frequent_itemsets()association_rules = improved_model.get_association_rules()runtime_stats = improved_model.get_runtime_stats()
# Enhanced runtime statistics include:# - initial_scan_time: Time for single database scan# - candidate_generation_time: Total time for candidate generation# - support_calculation_time: Time for intersection-based support calculationDataset Loading
Section titled “Dataset Loading”from dataset_loader import load_jsonl_dataset
# Download and load dataseturl = "https://huggingface.co/datasets/..."data = load_jsonl_dataset(url)Preprocessing
Section titled “Preprocessing”from preprocessing import ( prepare_transactions, suggest_min_support, get_transaction_stats)
# Prepare transactions from raw datatransactions = prepare_transactions( data, use_parent_asin=False, min_transaction_size=4)
# Calculate optimal minimum supportsupport_suggestions = suggest_min_support(transactions)min_support = support_suggestions['recommended_min_support']
# Get statisticsstats = get_transaction_stats(transactions)Usage Guide
Section titled “Usage Guide”Basic Usage
Section titled “Basic Usage”- Load Dataset:
from dataset_loader import load_jsonl_dataset
data = load_jsonl_dataset("https://huggingface.co/datasets/...")- Preprocess Data:
from preprocessing import prepare_transactions, suggest_min_support
transactions = prepare_transactions(data, min_transaction_size=4)min_support = suggest_min_support(transactions)['recommended_min_support']- Run Algorithm:
Traditional Apriori:
from apriori import Apriori
model = Apriori(min_support=min_support, min_confidence=0.5)model.fit(transactions)model.generate_association_rules()Improved Apriori (recommended):
from improved_apriori import ImprovedApriori
model = ImprovedApriori(min_support=min_support, min_confidence=0.5)model.fit(transactions)model.generate_association_rules()FP-Growth:
from fp_growth import FPGrowth
model = FPGrowth(min_support=min_support, min_confidence=0.5)model.fit(transactions)model.generate_association_rules()- Analyze Results:
frequent_itemsets = model.get_frequent_itemsets()association_rules = model.get_association_rules()runtime_stats = model.get_runtime_stats()Advanced Usage
Section titled “Advanced Usage”Programmatic Minimum Support Calculation:
from preprocessing import calculate_min_support
# Use percentile methodmin_support = calculate_min_support( transactions, method='percentile', percentile=75.0)
# Use absolute count methodmin_support = calculate_min_support( transactions, method='absolute', min_absolute_count=100)Runtime Benchmarking:
For Traditional Apriori and FP-Growth:
import time
start = time.time()model.fit(transactions)model.generate_association_rules()total_time = time.time() - start
runtime_stats = model.get_runtime_stats()print(f"Total time: {total_time:.2f}s")print(f"Frequent itemsets time: {runtime_stats['frequent_itemsets_time']:.2f}s")print(f"Association rules time: {runtime_stats['association_rules_time']:.2f}s")For Improved Apriori (with enhanced metrics):
import time
start = time.time()model.fit(transactions)model.generate_association_rules()total_time = time.time() - start
runtime_stats = model.get_runtime_stats()print(f"Total time: {total_time:.2f}s")print(f"Frequent itemsets time: {runtime_stats['frequent_itemsets_time']:.2f}s")print(f"Initial scan time: {runtime_stats['initial_scan_time']:.4f}s")print(f"Candidate generation time: {runtime_stats['candidate_generation_time']:.4f}s")print(f"Support calculation time: {runtime_stats['support_calculation_time']:.4f}s")print(f"Association rules time: {runtime_stats['association_rules_time']:.2f}s")Configuration
Section titled “Configuration”Algorithm Parameters
Section titled “Algorithm Parameters”- min_support (float): Minimum support threshold (0.0 to 1.0)
- min_confidence (float): Minimum confidence threshold (0.0 to 1.0)
Preprocessing Parameters
Section titled “Preprocessing Parameters”- use_parent_asin (bool): Use parent_asin instead of asin for grouping
- min_transaction_size (int): Minimum number of items in a transaction
Dataset Cache Configuration
Section titled “Dataset Cache Configuration”- Cache directory:
data/cache/(configurable) - Automatic caching with MD5-based keys
- Metadata tracking in
metadata.json
Examples
Section titled “Examples”The main Jupyter notebook (exploration.ipynb) contains comprehensive examples including:
-
Dataset Loading and Preprocessing
- Downloading datasets from Hugging Face
- Filtering verified purchases
- Creating transaction format
-
Algorithm Execution
- Running Apriori with runtime tracking
- Running Improved Apriori with runtime tracking
- Running FP-Growth with runtime tracking
- Comparing all three algorithms
-
Visualization
- Data exploration visualizations
- Runtime breakdown charts
- Frequent itemsets by size
- Support and confidence distributions
- Top frequent itemsets
- Transaction size distributions
- Runtime vs minimum support threshold analysis
- Auto-saved images: All visualizations are automatically saved to
docs/src/assets/images/as PNG and PDF files
-
Performance Comparison
- Three-way algorithm comparison (Apriori, Improved Apriori, FP-Growth)
- Speedup calculations
- Itemset matching verification
- Scalability analysis across support thresholds
File Structure
Section titled “File Structure”exploration/├── apriori.py # Traditional Apriori implementation├── fp_growth.py # FP-Growth implementation├── improved_apriori.py # Improved Apriori with intersection-based counting├── dataset_loader.py # Dataset utilities├── preprocessing.py # Preprocessing utilities└── exploration.ipynb # Main notebook with all experimentsContributing
Section titled “Contributing”When contributing code:
- Follow Python PEP 8 style guidelines
- Include docstrings for all public functions and classes
- Add type hints for better code documentation
- Update this documentation when adding new features
- Include runtime tracking in new algorithm implementations