b3ba58723c5d534851a8a2667be53df541968583
- Updated `fit_model_args.rs` to allow optional factors for matrix factorization and added an index name argument for index management. - Modified `fit_model.rs` to handle index creation and dropping during data upsert, improving database interaction. - Adjusted schema validation to infer vector dimensions and validate against specified factors. - Enhanced `generate_test_data.rs` to create an IVFFlat index on the embeddings column. These changes improve the flexibility and robustness of the fit model process, allowing for better management of database indices and more intuitive argument handling.
MF-Fitter
A Rust application that performs matrix factorization on user-item interaction data from PostgreSQL using libmf, and saves the resulting item embeddings back to the database.
Project Structure
.
├── .devcontainer/ # Development container configuration
│ ├── .env # Container environment variables
│ └── docker-compose.yml # Docker services configuration
├── src/
│ ├── main.rs # Main matrix factorization application
│ └── bin/
│ └── generate_test_data.rs # Test data generator
├── Dockerfile.postgres # PostgreSQL container with pgvector
├── init-pgvector.sql # PostgreSQL initialization script
├── .env # Local environment variables
├── Cargo.toml # Rust project dependencies
└── README.md # This file
Features
- Async I/O with tokio for efficient database operations
- Batch processing for handling large datasets
- Graceful shutdown with Ctrl+C
- Configurable number of factors and batch size
- Automatic creation of the target embedding table
- Environment variable based configuration
- Test data generation with realistic patterns
Prerequisites
- Docker and Docker Compose
- VS Code with Remote Containers extension (for development)
- Rust (if developing outside container)
Quick Start
-
Clone the repository:
git clone <repository-url> cd mf-fitter -
Start the development container in VS Code:
- Open the project in VS Code
- When prompted, click "Reopen in Container"
- Or use Command Palette: "Remote-Containers: Reopen in Container"
-
Generate test data:
cargo run --bin generate_test_data -- \ --num-users 1000 \ --num-items 5000 \ --user-clusters 5 \ --item-clusters 10 \ --avg-interactions 20 -
Run matrix factorization:
cargo run -- \ --source-table user_interactions \ --user-id-column user_id \ --item-id-column item_id \ --target-table item_embeddings \ --factors 32
Configuration
Environment Variables
Create a .env file in the project root:
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=postgres
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
Test Data Generator Arguments
--num-users: Number of users to generate (default: 1000)--num-items: Number of items to generate (default: 5000)--user-clusters: Number of user clusters (default: 5)--item-clusters: Number of item clusters (default: 10)--avg-interactions: Average interactions per user (default: 20)--interactions-table: Source table name (default: "user_interactions")--embeddings-table: Target table name (default: "item_embeddings")
Matrix Factorization Arguments
--source-table: Name of the table containing user-item interactions--user-id-column: Name of the column containing user IDs (must be integer)--item-id-column: Name of the column containing item IDs (must be integer)--target-table: Name of the table where item embeddings will be saved--factors: Number of factors for matrix factorization (default: 8)--batch-size: Number of rows to load in each batch (default: 10000)
Database Schema
Input Table (user_interactions)
CREATE TABLE user_interactions (
user_id INTEGER,
item_id INTEGER,
PRIMARY KEY (user_id, item_id)
);
Output Table (item_embeddings)
CREATE TABLE item_embeddings (
item_id INTEGER PRIMARY KEY,
embedding FLOAT[]
);
Development
The project uses a VS Code devcontainer setup with:
- PostgreSQL 16 with pgvector extension
- Rust development environment
- Automatic environment configuration
To modify the PostgreSQL setup:
- Edit
Dockerfile.postgresfor database customization - Edit
init-pgvector.sqlfor initialization scripts - Rebuild the container:
docker-compose up -d --build
Dependencies
- Rust 2021 edition
- PostgreSQL 16
- libmf 0.3
- tokio for async I/O
- clap for CLI argument parsing
- deadpool-postgres for connection pooling
License
This project is licensed under the MIT License - see the LICENSE file for details.
Description
Languages
Rust
96.8%
Dockerfile
3.2%