Dylan Knutson b3ba58723c Enhance fit model functionality and argument handling
- Updated `fit_model_args.rs` to allow optional factors for matrix factorization and added an index name argument for index management.
- Modified `fit_model.rs` to handle index creation and dropping during data upsert, improving database interaction.
- Adjusted schema validation to infer vector dimensions and validate against specified factors.
- Enhanced `generate_test_data.rs` to create an IVFFlat index on the embeddings column.

These changes improve the flexibility and robustness of the fit model process, allowing for better management of database indices and more intuitive argument handling.
2024-12-28 20:37:12 +00:00
2024-12-28 01:28:33 +00:00
2024-12-28 01:28:33 +00:00
2024-12-28 01:28:33 +00:00
2024-12-28 03:39:24 +00:00
2024-12-28 01:28:33 +00:00
2024-12-28 01:28:33 +00:00

MF-Fitter

A Rust application that performs matrix factorization on user-item interaction data from PostgreSQL using libmf, and saves the resulting item embeddings back to the database.

Project Structure

.
├── .devcontainer/          # Development container configuration
│   ├── .env               # Container environment variables
│   └── docker-compose.yml # Docker services configuration
├── src/
│   ├── main.rs           # Main matrix factorization application
│   └── bin/
│       └── generate_test_data.rs  # Test data generator
├── Dockerfile.postgres    # PostgreSQL container with pgvector
├── init-pgvector.sql     # PostgreSQL initialization script
├── .env                  # Local environment variables
├── Cargo.toml            # Rust project dependencies
└── README.md            # This file

Features

  • Async I/O with tokio for efficient database operations
  • Batch processing for handling large datasets
  • Graceful shutdown with Ctrl+C
  • Configurable number of factors and batch size
  • Automatic creation of the target embedding table
  • Environment variable based configuration
  • Test data generation with realistic patterns

Prerequisites

  • Docker and Docker Compose
  • VS Code with Remote Containers extension (for development)
  • Rust (if developing outside container)

Quick Start

  1. Clone the repository:

    git clone <repository-url>
    cd mf-fitter
    
  2. Start the development container in VS Code:

    • Open the project in VS Code
    • When prompted, click "Reopen in Container"
    • Or use Command Palette: "Remote-Containers: Reopen in Container"
  3. Generate test data:

    cargo run --bin generate_test_data -- \
        --num-users 1000 \
        --num-items 5000 \
        --user-clusters 5 \
        --item-clusters 10 \
        --avg-interactions 20
    
  4. Run matrix factorization:

    cargo run -- \
        --source-table user_interactions \
        --user-id-column user_id \
        --item-id-column item_id \
        --target-table item_embeddings \
        --factors 32
    

Configuration

Environment Variables

Create a .env file in the project root:

POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=postgres
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres

Test Data Generator Arguments

  • --num-users: Number of users to generate (default: 1000)
  • --num-items: Number of items to generate (default: 5000)
  • --user-clusters: Number of user clusters (default: 5)
  • --item-clusters: Number of item clusters (default: 10)
  • --avg-interactions: Average interactions per user (default: 20)
  • --interactions-table: Source table name (default: "user_interactions")
  • --embeddings-table: Target table name (default: "item_embeddings")

Matrix Factorization Arguments

  • --source-table: Name of the table containing user-item interactions
  • --user-id-column: Name of the column containing user IDs (must be integer)
  • --item-id-column: Name of the column containing item IDs (must be integer)
  • --target-table: Name of the table where item embeddings will be saved
  • --factors: Number of factors for matrix factorization (default: 8)
  • --batch-size: Number of rows to load in each batch (default: 10000)

Database Schema

Input Table (user_interactions)

CREATE TABLE user_interactions (
    user_id INTEGER,
    item_id INTEGER,
    PRIMARY KEY (user_id, item_id)
);

Output Table (item_embeddings)

CREATE TABLE item_embeddings (
    item_id INTEGER PRIMARY KEY,
    embedding FLOAT[]
);

Development

The project uses a VS Code devcontainer setup with:

  • PostgreSQL 16 with pgvector extension
  • Rust development environment
  • Automatic environment configuration

To modify the PostgreSQL setup:

  1. Edit Dockerfile.postgres for database customization
  2. Edit init-pgvector.sql for initialization scripts
  3. Rebuild the container: docker-compose up -d --build

Dependencies

  • Rust 2021 edition
  • PostgreSQL 16
  • libmf 0.3
  • tokio for async I/O
  • clap for CLI argument parsing
  • deadpool-postgres for connection pooling

License

This project is licensed under the MIT License - see the LICENSE file for details.

Description
No description provided
Readme 235 KiB
Languages
Rust 96.8%
Dockerfile 3.2%