2024-12-28 04:40:09 +00:00
2024-12-28 04:40:09 +00:00
2024-12-28 01:28:33 +00:00
2024-12-28 01:28:33 +00:00
2024-12-28 04:40:09 +00:00
2024-12-28 01:28:33 +00:00
2024-12-28 03:39:24 +00:00
2024-12-28 03:04:50 +00:00
2024-12-28 03:04:50 +00:00
2024-12-28 01:28:33 +00:00
2024-12-28 01:28:33 +00:00

MF-Fitter

A Rust application that performs matrix factorization on user-item interaction data from PostgreSQL using libmf, and saves the resulting item embeddings back to the database.

Project Structure

.
├── .devcontainer/          # Development container configuration
│   ├── .env               # Container environment variables
│   └── docker-compose.yml # Docker services configuration
├── src/
│   ├── main.rs           # Main matrix factorization application
│   └── bin/
│       └── generate_test_data.rs  # Test data generator
├── Dockerfile.postgres    # PostgreSQL container with pgvector
├── init-pgvector.sql     # PostgreSQL initialization script
├── .env                  # Local environment variables
├── Cargo.toml            # Rust project dependencies
└── README.md            # This file

Features

  • Async I/O with tokio for efficient database operations
  • Batch processing for handling large datasets
  • Graceful shutdown with Ctrl+C
  • Configurable number of factors and batch size
  • Automatic creation of the target embedding table
  • Environment variable based configuration
  • Test data generation with realistic patterns

Prerequisites

  • Docker and Docker Compose
  • VS Code with Remote Containers extension (for development)
  • Rust (if developing outside container)

Quick Start

  1. Clone the repository:

    git clone <repository-url>
    cd mf-fitter
    
  2. Start the development container in VS Code:

    • Open the project in VS Code
    • When prompted, click "Reopen in Container"
    • Or use Command Palette: "Remote-Containers: Reopen in Container"
  3. Generate test data:

    cargo run --bin generate_test_data -- \
        --num-users 1000 \
        --num-items 5000 \
        --user-clusters 5 \
        --item-clusters 10 \
        --avg-interactions 20
    
  4. Run matrix factorization:

    cargo run -- \
        --source-table user_interactions \
        --user-id-column user_id \
        --item-id-column item_id \
        --target-table item_embeddings \
        --factors 32
    

Configuration

Environment Variables

Create a .env file in the project root:

POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=postgres
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres

Test Data Generator Arguments

  • --num-users: Number of users to generate (default: 1000)
  • --num-items: Number of items to generate (default: 5000)
  • --user-clusters: Number of user clusters (default: 5)
  • --item-clusters: Number of item clusters (default: 10)
  • --avg-interactions: Average interactions per user (default: 20)
  • --interactions-table: Source table name (default: "user_interactions")
  • --embeddings-table: Target table name (default: "item_embeddings")

Matrix Factorization Arguments

  • --source-table: Name of the table containing user-item interactions
  • --user-id-column: Name of the column containing user IDs (must be integer)
  • --item-id-column: Name of the column containing item IDs (must be integer)
  • --target-table: Name of the table where item embeddings will be saved
  • --factors: Number of factors for matrix factorization (default: 8)
  • --batch-size: Number of rows to load in each batch (default: 10000)

Database Schema

Input Table (user_interactions)

CREATE TABLE user_interactions (
    user_id INTEGER,
    item_id INTEGER,
    PRIMARY KEY (user_id, item_id)
);

Output Table (item_embeddings)

CREATE TABLE item_embeddings (
    item_id INTEGER PRIMARY KEY,
    embedding FLOAT[]
);

Development

The project uses a VS Code devcontainer setup with:

  • PostgreSQL 16 with pgvector extension
  • Rust development environment
  • Automatic environment configuration

To modify the PostgreSQL setup:

  1. Edit Dockerfile.postgres for database customization
  2. Edit init-pgvector.sql for initialization scripts
  3. Rebuild the container: docker-compose up -d --build

Dependencies

  • Rust 2021 edition
  • PostgreSQL 16
  • libmf 0.3
  • tokio for async I/O
  • clap for CLI argument parsing
  • deadpool-postgres for connection pooling

License

This project is licensed under the MIT License - see the LICENSE file for details.

Description
No description provided
Readme 235 KiB
Languages
Rust 96.8%
Dockerfile 3.2%