Data Preprocessing
Overview
Section titled “Overview”Data preprocessing is a crucial step that transforms raw Amazon review data into a format suitable for frequent itemset mining algorithms. The preprocessing pipeline handles missing data, performs feature engineering, and applies filtering to ensure data quality and algorithm efficiency.
Preprocessing Pipeline
Section titled “Preprocessing Pipeline”The preprocessing pipeline consists of several sequential steps:
- Data Loading: Load JSONL dataset files
- Missing Data Handling: Filter and handle null values
- Feature Engineering: Create transactions from user-product relationships
- Data Filtering: Remove infrequent items and small transactions
- Format Conversion: Convert to algorithm-ready format
Handling Missing Data
Section titled “Handling Missing Data”Missing Value Types
Section titled “Missing Value Types”The Amazon Reviews dataset contains several types of missing values that require different handling strategies:
1. Missing parent_asin Values
Section titled “1. Missing parent_asin Values”Problem: Many products don’t have a parent ASIN (null values)
Solution: Fallback strategy
# Use parent_asin if available, otherwise use asindf = df.with_columns( pl.when(pl.col("parent_asin").is_not_null()) .then(pl.col("parent_asin")) .otherwise(pl.col("asin")) .alias("product_id"))Rationale:
- Parent ASIN groups product variants together (e.g., different colors/sizes)
- When parent ASIN is unavailable, individual ASIN is used
- This ensures no data loss while maximizing product grouping benefits
Impact:
- Reduces unique item count when parent ASINs are available
- Creates more meaningful product associations
- Improves algorithm performance by reducing sparsity
2. Missing verified_purchase Values
Section titled “2. Missing verified_purchase Values”Problem: Some records may have null or False values for verified_purchase
Solution: Filter to only include verified purchases
def filter_verified_purchases(data): df = _to_dataframe(data) # Filter for verified purchases, handle null values filtered_df = df.filter( pl.col("verified_purchase").fill_null(False) == True ) return filtered_df.to_dicts()Rationale:
- Verified purchases ensure data quality
- Unverified reviews may not represent actual purchases
- Reduces noise in transaction data
Impact:
- Typically reduces dataset size by 20-40%
- Improves data quality and result reliability
- Focuses analysis on actual purchase behavior
3. Missing user_id or asin Values
Section titled “3. Missing user_id or asin Values”Problem: Critical fields may be missing, preventing transaction creation
Solution: Filter out records with missing critical fields
# Filter out rows with missing user_id or product_iddf = df.filter( pl.col("user_id").is_not_null() & pl.col("product_id").is_not_null())Rationale:
- Both
user_idandasinare required for transaction creation - Records without these fields cannot be used
- Better to exclude than impute (no meaningful default values)
Impact:
- Minimal data loss (these fields are rarely missing)
- Ensures data integrity for transaction creation
Missing Data Summary
Section titled “Missing Data Summary”| Field | Missing Strategy | Impact |
|---|---|---|
parent_asin | Fallback to asin | No data loss, maximizes grouping |
verified_purchase | Filter (keep only True) | 20-40% data reduction, quality improvement |
user_id | Filter (exclude) | Minimal loss, ensures integrity |
asin | Filter (exclude) | Minimal loss, ensures integrity |
Feature Engineering
Section titled “Feature Engineering”Feature engineering transforms raw review data into transaction format suitable for association rule mining.
Transaction Creation
Section titled “Transaction Creation”The core feature engineering step is creating transactions from user-product relationships:
Concept
Section titled “Concept”- Transaction: A set of items (products) purchased by a single user
- Item: A product (ASIN or Parent ASIN)
- User Cart: All products purchased by a user form one transaction
Implementation
Section titled “Implementation”def create_user_carts(data, use_parent_asin=True): """ Group products by user_id to create transactions.
Each user's purchases form a transaction (cart), where items are the products (ASINs) they purchased. """ df = _to_dataframe(data)
# Select product_id based on use_parent_asin if use_parent_asin and "parent_asin" in df.columns: df = df.with_columns( pl.when(pl.col("parent_asin").is_not_null()) .then(pl.col("parent_asin")) .otherwise(pl.col("asin")) .alias("product_id") ) else: df = df.with_columns(pl.col("asin").alias("product_id"))
# Group by user_id and collect unique product_ids user_carts_df = ( df.select(["user_id", "product_id"]) .unique(subset=["user_id", "product_id"]) # Remove duplicates .group_by("user_id") .agg(pl.col("product_id").unique().alias("products")) )
return user_cartsKey Design Decisions
Section titled “Key Design Decisions”-
Product Grouping Strategy:
- Option 1: Use individual ASINs (more granular)
- Option 2: Use Parent ASINs (groups variants)
- Choice: Parent ASIN with fallback (best of both worlds)
-
Duplicate Handling:
- Issue: Same user may review same product multiple times
- Solution: Use
.unique()to ensure one item per transaction - Rationale: Association rules don’t benefit from duplicate items
-
Transaction Format:
- Internal: Dictionary mapping
user_idto set of products - Output: List of transactions (list of item lists)
- Alternative: Polars DataFrame in long format
(transaction_id, item)
- Internal: Dictionary mapping
Feature Transformations
Section titled “Feature Transformations”1. Product ID Selection
Section titled “1. Product ID Selection”Transformation: Choose between ASIN and Parent ASIN
- When to use Parent ASIN: When available, for better grouping
- When to use ASIN: Fallback when Parent ASIN is null
- Benefit: Maximizes product grouping while avoiding data loss
2. User-Product Aggregation
Section titled “2. User-Product Aggregation”Transformation: Group multiple reviews by same user-product pair
- Method: Use
.unique()to remove duplicates - Benefit: Clean transaction data, one item per user-product
3. Transaction Format Conversion
Section titled “3. Transaction Format Conversion”Transformation: Convert between formats
- Dictionary → List:
[list(cart) for cart in user_carts.values()] - List → DataFrame: Long format
(transaction_id, item)for efficient Polars operations - Benefit: Algorithm compatibility and performance optimization
Feature Engineering Benefits
Section titled “Feature Engineering Benefits”-
Dimensionality Reduction:
- Groups product variants together
- Reduces unique item count
- Improves algorithm scalability
-
Meaningful Associations:
- Captures product family relationships
- More interpretable association rules
- Better for business insights
-
Data Quality:
- Removes duplicate entries
- Ensures transaction integrity
- Focuses on verified purchases
Data Filtering
Section titled “Data Filtering”Data filtering removes low-quality or uninformative data to improve algorithm performance and result quality.
Filtering Strategies
Section titled “Filtering Strategies”1. Minimum Transaction Size Filtering
Section titled “1. Minimum Transaction Size Filtering”Purpose: Remove transactions with too few items
Implementation:
def prepare_transactions(data, min_transaction_size=1): # ... create transactions ...
# Filter by minimum transaction size filtered_transactions = [ t for t in transactions if len(t) >= min_transaction_size ]
return filtered_transactionsRationale:
- Single-item transactions don’t contribute to association rules
- Association rules require at least 2 items
- Reduces noise and computational overhead
Typical Values:
min_transaction_size=1: Keep all transactions (for exploration)min_transaction_size=2: Standard for association rule miningmin_transaction_size=3: More restrictive, higher quality transactions
Impact:
- Reduces transaction count (typically 10-30% reduction for
min_size=2) - Improves average transaction size
- Faster algorithm execution
2. Infrequent Item Filtering
Section titled “2. Infrequent Item Filtering”Purpose: Remove items that appear too rarely
Implementation:
def filter_infrequent_items( transactions, min_item_frequency=2, min_transaction_size=2): # Calculate item frequencies item_frequencies = calculate_item_frequencies(transactions)
# Find frequent items frequent_items = { item for item, count in item_frequencies.items() if count >= min_item_frequency }
# Filter transactions: keep only frequent items filtered_transactions = [ [item for item in transaction if item in frequent_items] for transaction in transactions ]
# Remove transactions that become too small filtered_transactions = [ t for t in filtered_transactions if len(t) >= min_transaction_size ]
return filtered_transactionsRationale:
- Items appearing in very few transactions are unlikely to form frequent itemsets
- Reduces computational overhead (fewer candidates)
- Removes noise from rare products
Typical Values:
min_item_frequency=1: Keep all items (no filtering)min_item_frequency=2: Remove singletonsmin_item_frequency=3: More aggressive filtering (used in experiments)
Impact:
- Reduces unique item count significantly (30-50% reduction)
- Faster candidate generation
- Cleaner, more meaningful results
3. Support-Based Filtering
Section titled “3. Support-Based Filtering”Purpose: Programmatically determine filtering thresholds
Implementation:
def suggest_min_support(transactions, show_stats=True): """ Suggest minimum support value based on item frequency distribution. """ item_frequencies = calculate_item_frequencies(transactions) frequencies = list(item_frequencies.values())
# Calculate suggestions using different methods suggestions = { 'percentile_75': calculate_min_support( transactions, method='percentile', percentile=75.0 ), 'absolute_1pct': calculate_min_support( transactions, method='absolute', min_absolute_count=int(num_transactions * 0.01) ), 'statistical': calculate_min_support( transactions, method='statistical' ), }
return suggestionsMethods:
- Percentile: Use percentile of item frequencies (e.g., 75th percentile)
- Absolute: Use percentage of total transactions (e.g., 1%)
- Statistical: Use mean ± standard deviation
- Desired Items: Calculate to get desired number of frequent items
Impact:
- Guides minimum support threshold selection
- Balances completeness vs. computational cost
- Ensures meaningful frequent itemsets
Filtering Pipeline
Section titled “Filtering Pipeline”The complete filtering pipeline:
# Step 1: Filter verified purchasesverified_data = filter_verified_purchases(data)
# Step 2: Create transactionstransactions_df = prepare_transactions( verified_data, use_parent_asin=True, min_transaction_size=2, return_format="dataframe")
# Step 3: Filter infrequent itemsfiltered_transactions = filter_infrequent_items( transactions_df, min_item_frequency=3, min_transaction_size=2, return_format="dataframe")
# Step 4: Convert to algorithm formattransactions_list = ( filtered_transactions .group_by("transaction_id") .agg(pl.col("item").unique().alias("items")) .select("items") .to_series() .to_list())Filtering Impact Summary
Section titled “Filtering Impact Summary”| Filter | Parameter | Typical Reduction | Benefit |
|---|---|---|---|
| Verified Purchases | verified_purchase=True | 20-40% records | Data quality |
| Min Transaction Size | min_size=2 | 10-30% transactions | Rule quality |
| Infrequent Items | min_freq=3 | 30-50% items | Performance |
Standardization/Normalization
Section titled “Standardization/Normalization”For frequent itemset mining, traditional standardization/normalization (e.g., z-score, min-max scaling) is not applicable because:
- Categorical Data: Items are categorical (product IDs), not numerical
- Binary Presence: Items either appear in a transaction or they don’t
- Support-Based: Support is already normalized (count / total transactions)
However, conceptual normalization occurs through:
Support Normalization
Section titled “Support Normalization”Support values are inherently normalized:
- Support = Count / Total Transactions
- Range: [0, 1]
- Already standardized across different dataset sizes
Frequency Normalization
Section titled “Frequency Normalization”Item frequencies are normalized by total transactions:
- Enables comparison across datasets
- Allows meaningful support thresholds
- Independent of dataset size
Preprocessing Best Practices
Section titled “Preprocessing Best Practices”1. Start with Exploration
Section titled “1. Start with Exploration”- Run EDA before preprocessing
- Understand data distributions
- Make informed filtering decisions
2. Progressive Filtering
Section titled “2. Progressive Filtering”- Apply filters incrementally
- Measure impact at each step
- Balance quality vs. quantity
3. Parameter Tuning
Section titled “3. Parameter Tuning”- Experiment with different thresholds
- Use programmatic suggestions
- Validate with algorithm results
4. Data Quality First
Section titled “4. Data Quality First”- Filter verified purchases
- Handle missing values appropriately
- Ensure transaction integrity
5. Performance Considerations
Section titled “5. Performance Considerations”- Use Polars for efficient operations
- Filter early to reduce data size
- Cache preprocessed data when possible
Implementation Reference
Section titled “Implementation Reference”The preprocessing pipeline is implemented in preprocessing.py with the following key functions:
filter_verified_purchases(): Filter to verified purchases onlycreate_user_carts(): Create user-product transactionsprepare_transactions(): Complete preprocessing pipelinefilter_infrequent_items(): Remove rare itemsget_transaction_stats(): Calculate statisticscalculate_item_frequencies(): Count item frequenciescalculate_min_support(): Programmatic support calculationsuggest_min_support(): Suggest optimal support thresholds
Example Usage
Section titled “Example Usage”from preprocessing import prepare_transactions, filter_infrequent_itemsfrom dataset_loader import load_jsonl_dataset_polars
# Load datasetdata = load_jsonl_dataset_polars(url)
# Complete preprocessing pipelinetransactions_df = prepare_transactions( data, use_parent_asin=True, min_transaction_size=2, return_format="dataframe")
# Filter infrequent itemsfiltered_transactions = filter_infrequent_items( transactions_df, min_item_frequency=3, min_transaction_size=2, return_format="dataframe")
# Ready for algorithm inputfrom apriori import Apriorialgorithm = Apriori(min_support=0.0005)algorithm.fit(filtered_transactions)Conclusion
Section titled “Conclusion”Data preprocessing transforms raw Amazon review data into a clean, structured format suitable for frequent itemset mining. By handling missing data appropriately, engineering meaningful features, and applying strategic filtering, we ensure:
- Data Quality: Verified purchases, complete transactions
- Algorithm Efficiency: Reduced dimensionality, filtered noise
- Meaningful Results: Product grouping, appropriate support thresholds
- Scalability: Efficient Polars-based operations
The preprocessing pipeline is designed to be flexible, allowing experimentation with different strategies while maintaining data integrity and algorithm compatibility.