Skip to content

Data Preprocessing

Data preprocessing is a crucial step that transforms raw Amazon review data into a format suitable for frequent itemset mining algorithms. The preprocessing pipeline handles missing data, performs feature engineering, and applies filtering to ensure data quality and algorithm efficiency.

The preprocessing pipeline consists of several sequential steps:

  1. Data Loading: Load JSONL dataset files
  2. Missing Data Handling: Filter and handle null values
  3. Feature Engineering: Create transactions from user-product relationships
  4. Data Filtering: Remove infrequent items and small transactions
  5. Format Conversion: Convert to algorithm-ready format

The Amazon Reviews dataset contains several types of missing values that require different handling strategies:

Problem: Many products don’t have a parent ASIN (null values)

Solution: Fallback strategy

# Use parent_asin if available, otherwise use asin
df = df.with_columns(
pl.when(pl.col("parent_asin").is_not_null())
.then(pl.col("parent_asin"))
.otherwise(pl.col("asin"))
.alias("product_id")
)

Rationale:

  • Parent ASIN groups product variants together (e.g., different colors/sizes)
  • When parent ASIN is unavailable, individual ASIN is used
  • This ensures no data loss while maximizing product grouping benefits

Impact:

  • Reduces unique item count when parent ASINs are available
  • Creates more meaningful product associations
  • Improves algorithm performance by reducing sparsity

Problem: Some records may have null or False values for verified_purchase

Solution: Filter to only include verified purchases

def filter_verified_purchases(data):
df = _to_dataframe(data)
# Filter for verified purchases, handle null values
filtered_df = df.filter(
pl.col("verified_purchase").fill_null(False) == True
)
return filtered_df.to_dicts()

Rationale:

  • Verified purchases ensure data quality
  • Unverified reviews may not represent actual purchases
  • Reduces noise in transaction data

Impact:

  • Typically reduces dataset size by 20-40%
  • Improves data quality and result reliability
  • Focuses analysis on actual purchase behavior

Problem: Critical fields may be missing, preventing transaction creation

Solution: Filter out records with missing critical fields

# Filter out rows with missing user_id or product_id
df = df.filter(
pl.col("user_id").is_not_null() &
pl.col("product_id").is_not_null()
)

Rationale:

  • Both user_id and asin are required for transaction creation
  • Records without these fields cannot be used
  • Better to exclude than impute (no meaningful default values)

Impact:

  • Minimal data loss (these fields are rarely missing)
  • Ensures data integrity for transaction creation
FieldMissing StrategyImpact
parent_asinFallback to asinNo data loss, maximizes grouping
verified_purchaseFilter (keep only True)20-40% data reduction, quality improvement
user_idFilter (exclude)Minimal loss, ensures integrity
asinFilter (exclude)Minimal loss, ensures integrity

Feature engineering transforms raw review data into transaction format suitable for association rule mining.

The core feature engineering step is creating transactions from user-product relationships:

  • Transaction: A set of items (products) purchased by a single user
  • Item: A product (ASIN or Parent ASIN)
  • User Cart: All products purchased by a user form one transaction
def create_user_carts(data, use_parent_asin=True):
"""
Group products by user_id to create transactions.
Each user's purchases form a transaction (cart), where items are
the products (ASINs) they purchased.
"""
df = _to_dataframe(data)
# Select product_id based on use_parent_asin
if use_parent_asin and "parent_asin" in df.columns:
df = df.with_columns(
pl.when(pl.col("parent_asin").is_not_null())
.then(pl.col("parent_asin"))
.otherwise(pl.col("asin"))
.alias("product_id")
)
else:
df = df.with_columns(pl.col("asin").alias("product_id"))
# Group by user_id and collect unique product_ids
user_carts_df = (
df.select(["user_id", "product_id"])
.unique(subset=["user_id", "product_id"]) # Remove duplicates
.group_by("user_id")
.agg(pl.col("product_id").unique().alias("products"))
)
return user_carts
  1. Product Grouping Strategy:

    • Option 1: Use individual ASINs (more granular)
    • Option 2: Use Parent ASINs (groups variants)
    • Choice: Parent ASIN with fallback (best of both worlds)
  2. Duplicate Handling:

    • Issue: Same user may review same product multiple times
    • Solution: Use .unique() to ensure one item per transaction
    • Rationale: Association rules don’t benefit from duplicate items
  3. Transaction Format:

    • Internal: Dictionary mapping user_id to set of products
    • Output: List of transactions (list of item lists)
    • Alternative: Polars DataFrame in long format (transaction_id, item)

Transformation: Choose between ASIN and Parent ASIN

  • When to use Parent ASIN: When available, for better grouping
  • When to use ASIN: Fallback when Parent ASIN is null
  • Benefit: Maximizes product grouping while avoiding data loss

Transformation: Group multiple reviews by same user-product pair

  • Method: Use .unique() to remove duplicates
  • Benefit: Clean transaction data, one item per user-product

Transformation: Convert between formats

  • Dictionary → List: [list(cart) for cart in user_carts.values()]
  • List → DataFrame: Long format (transaction_id, item) for efficient Polars operations
  • Benefit: Algorithm compatibility and performance optimization
  1. Dimensionality Reduction:

    • Groups product variants together
    • Reduces unique item count
    • Improves algorithm scalability
  2. Meaningful Associations:

    • Captures product family relationships
    • More interpretable association rules
    • Better for business insights
  3. Data Quality:

    • Removes duplicate entries
    • Ensures transaction integrity
    • Focuses on verified purchases

Data filtering removes low-quality or uninformative data to improve algorithm performance and result quality.

Purpose: Remove transactions with too few items

Implementation:

def prepare_transactions(data, min_transaction_size=1):
# ... create transactions ...
# Filter by minimum transaction size
filtered_transactions = [
t for t in transactions
if len(t) >= min_transaction_size
]
return filtered_transactions

Rationale:

  • Single-item transactions don’t contribute to association rules
  • Association rules require at least 2 items
  • Reduces noise and computational overhead

Typical Values:

  • min_transaction_size=1: Keep all transactions (for exploration)
  • min_transaction_size=2: Standard for association rule mining
  • min_transaction_size=3: More restrictive, higher quality transactions

Impact:

  • Reduces transaction count (typically 10-30% reduction for min_size=2)
  • Improves average transaction size
  • Faster algorithm execution

Purpose: Remove items that appear too rarely

Implementation:

def filter_infrequent_items(
transactions,
min_item_frequency=2,
min_transaction_size=2
):
# Calculate item frequencies
item_frequencies = calculate_item_frequencies(transactions)
# Find frequent items
frequent_items = {
item for item, count in item_frequencies.items()
if count >= min_item_frequency
}
# Filter transactions: keep only frequent items
filtered_transactions = [
[item for item in transaction if item in frequent_items]
for transaction in transactions
]
# Remove transactions that become too small
filtered_transactions = [
t for t in filtered_transactions
if len(t) >= min_transaction_size
]
return filtered_transactions

Rationale:

  • Items appearing in very few transactions are unlikely to form frequent itemsets
  • Reduces computational overhead (fewer candidates)
  • Removes noise from rare products

Typical Values:

  • min_item_frequency=1: Keep all items (no filtering)
  • min_item_frequency=2: Remove singletons
  • min_item_frequency=3: More aggressive filtering (used in experiments)

Impact:

  • Reduces unique item count significantly (30-50% reduction)
  • Faster candidate generation
  • Cleaner, more meaningful results

Purpose: Programmatically determine filtering thresholds

Implementation:

def suggest_min_support(transactions, show_stats=True):
"""
Suggest minimum support value based on item frequency distribution.
"""
item_frequencies = calculate_item_frequencies(transactions)
frequencies = list(item_frequencies.values())
# Calculate suggestions using different methods
suggestions = {
'percentile_75': calculate_min_support(
transactions, method='percentile', percentile=75.0
),
'absolute_1pct': calculate_min_support(
transactions, method='absolute',
min_absolute_count=int(num_transactions * 0.01)
),
'statistical': calculate_min_support(
transactions, method='statistical'
),
}
return suggestions

Methods:

  • Percentile: Use percentile of item frequencies (e.g., 75th percentile)
  • Absolute: Use percentage of total transactions (e.g., 1%)
  • Statistical: Use mean ± standard deviation
  • Desired Items: Calculate to get desired number of frequent items

Impact:

  • Guides minimum support threshold selection
  • Balances completeness vs. computational cost
  • Ensures meaningful frequent itemsets

The complete filtering pipeline:

# Step 1: Filter verified purchases
verified_data = filter_verified_purchases(data)
# Step 2: Create transactions
transactions_df = prepare_transactions(
verified_data,
use_parent_asin=True,
min_transaction_size=2,
return_format="dataframe"
)
# Step 3: Filter infrequent items
filtered_transactions = filter_infrequent_items(
transactions_df,
min_item_frequency=3,
min_transaction_size=2,
return_format="dataframe"
)
# Step 4: Convert to algorithm format
transactions_list = (
filtered_transactions
.group_by("transaction_id")
.agg(pl.col("item").unique().alias("items"))
.select("items")
.to_series()
.to_list()
)
FilterParameterTypical ReductionBenefit
Verified Purchasesverified_purchase=True20-40% recordsData quality
Min Transaction Sizemin_size=210-30% transactionsRule quality
Infrequent Itemsmin_freq=330-50% itemsPerformance

For frequent itemset mining, traditional standardization/normalization (e.g., z-score, min-max scaling) is not applicable because:

  1. Categorical Data: Items are categorical (product IDs), not numerical
  2. Binary Presence: Items either appear in a transaction or they don’t
  3. Support-Based: Support is already normalized (count / total transactions)

However, conceptual normalization occurs through:

Support values are inherently normalized:

  • Support = Count / Total Transactions
  • Range: [0, 1]
  • Already standardized across different dataset sizes

Item frequencies are normalized by total transactions:

  • Enables comparison across datasets
  • Allows meaningful support thresholds
  • Independent of dataset size
  • Run EDA before preprocessing
  • Understand data distributions
  • Make informed filtering decisions
  • Apply filters incrementally
  • Measure impact at each step
  • Balance quality vs. quantity
  • Experiment with different thresholds
  • Use programmatic suggestions
  • Validate with algorithm results
  • Filter verified purchases
  • Handle missing values appropriately
  • Ensure transaction integrity
  • Use Polars for efficient operations
  • Filter early to reduce data size
  • Cache preprocessed data when possible

The preprocessing pipeline is implemented in preprocessing.py with the following key functions:

  • filter_verified_purchases(): Filter to verified purchases only
  • create_user_carts(): Create user-product transactions
  • prepare_transactions(): Complete preprocessing pipeline
  • filter_infrequent_items(): Remove rare items
  • get_transaction_stats(): Calculate statistics
  • calculate_item_frequencies(): Count item frequencies
  • calculate_min_support(): Programmatic support calculation
  • suggest_min_support(): Suggest optimal support thresholds
from preprocessing import prepare_transactions, filter_infrequent_items
from dataset_loader import load_jsonl_dataset_polars
# Load dataset
data = load_jsonl_dataset_polars(url)
# Complete preprocessing pipeline
transactions_df = prepare_transactions(
data,
use_parent_asin=True,
min_transaction_size=2,
return_format="dataframe"
)
# Filter infrequent items
filtered_transactions = filter_infrequent_items(
transactions_df,
min_item_frequency=3,
min_transaction_size=2,
return_format="dataframe"
)
# Ready for algorithm input
from apriori import Apriori
algorithm = Apriori(min_support=0.0005)
algorithm.fit(filtered_transactions)

Data preprocessing transforms raw Amazon review data into a clean, structured format suitable for frequent itemset mining. By handling missing data appropriately, engineering meaningful features, and applying strategic filtering, we ensure:

  • Data Quality: Verified purchases, complete transactions
  • Algorithm Efficiency: Reduced dimensionality, filtered noise
  • Meaningful Results: Product grouping, appropriate support thresholds
  • Scalability: Efficient Polars-based operations

The preprocessing pipeline is designed to be flexible, allowing experimentation with different strategies while maintaining data integrity and algorithm compatibility.