# MF-Fitter A Rust application that performs matrix factorization on user-item interaction data from PostgreSQL using libmf, and saves the resulting item embeddings back to the database. ## Project Structure ``` . ├── .devcontainer/ # Development container configuration │ ├── .env # Container environment variables │ └── docker-compose.yml # Docker services configuration ├── src/ │ ├── main.rs # Main matrix factorization application │ └── bin/ │ └── generate_test_data.rs # Test data generator ├── Dockerfile.postgres # PostgreSQL container with pgvector ├── init-pgvector.sql # PostgreSQL initialization script ├── .env # Local environment variables ├── Cargo.toml # Rust project dependencies └── README.md # This file ``` ## Features - Async I/O with tokio for efficient database operations - Batch processing for handling large datasets - Graceful shutdown with Ctrl+C - Configurable number of factors and batch size - Automatic creation of the target embedding table - Environment variable based configuration - Test data generation with realistic patterns ## Prerequisites - Docker and Docker Compose - VS Code with Remote Containers extension (for development) - Rust (if developing outside container) ## Quick Start 1. Clone the repository: ```bash git clone cd mf-fitter ``` 2. Start the development container in VS Code: - Open the project in VS Code - When prompted, click "Reopen in Container" - Or use Command Palette: "Remote-Containers: Reopen in Container" 3. Generate test data: ```bash cargo run --bin generate_test_data -- \ --num-users 1000 \ --num-items 5000 \ --user-clusters 5 \ --item-clusters 10 \ --avg-interactions 20 ``` 4. Run matrix factorization: ```bash cargo run -- \ --source-table user_interactions \ --user-id-column user_id \ --item-id-column item_id \ --target-table item_embeddings \ --factors 32 ``` ## Configuration ### Environment Variables Create a `.env` file in the project root: ```env POSTGRES_HOST=localhost POSTGRES_PORT=5432 POSTGRES_DB=postgres POSTGRES_USER=postgres POSTGRES_PASSWORD=postgres ``` ### Test Data Generator Arguments - `--num-users`: Number of users to generate (default: 1000) - `--num-items`: Number of items to generate (default: 5000) - `--user-clusters`: Number of user clusters (default: 5) - `--item-clusters`: Number of item clusters (default: 10) - `--avg-interactions`: Average interactions per user (default: 20) - `--interactions-table`: Source table name (default: "user_interactions") - `--embeddings-table`: Target table name (default: "item_embeddings") ### Matrix Factorization Arguments - `--source-table`: Name of the table containing user-item interactions - `--user-id-column`: Name of the column containing user IDs (must be integer) - `--item-id-column`: Name of the column containing item IDs (must be integer) - `--target-table`: Name of the table where item embeddings will be saved - `--factors`: Number of factors for matrix factorization (default: 8) - `--batch-size`: Number of rows to load in each batch (default: 10000) ## Database Schema ### Input Table (user_interactions) ```sql CREATE TABLE user_interactions ( user_id INTEGER, item_id INTEGER, PRIMARY KEY (user_id, item_id) ); ``` ### Output Table (item_embeddings) ```sql CREATE TABLE item_embeddings ( item_id INTEGER PRIMARY KEY, embedding FLOAT[] ); ``` ## Development The project uses a VS Code devcontainer setup with: - PostgreSQL 16 with pgvector extension - Rust development environment - Automatic environment configuration To modify the PostgreSQL setup: 1. Edit `Dockerfile.postgres` for database customization 2. Edit `init-pgvector.sql` for initialization scripts 3. Rebuild the container: `docker-compose up -d --build` ## Dependencies - Rust 2021 edition - PostgreSQL 16 - libmf 0.3 - tokio for async I/O - clap for CLI argument parsing - deadpool-postgres for connection pooling ## License This project is licensed under the MIT License - see the LICENSE file for details.