4651b96785892f053e4e6271d335106675330905
MF-Fitter
A Rust application that performs matrix factorization on user-item interaction data from PostgreSQL using libmf, and saves the resulting item embeddings back to the database.
Project Structure
.
├── .devcontainer/ # Development container configuration
│ ├── .env # Container environment variables
│ └── docker-compose.yml # Docker services configuration
├── src/
│ ├── main.rs # Main matrix factorization application
│ └── bin/
│ └── generate_test_data.rs # Test data generator
├── Dockerfile.postgres # PostgreSQL container with pgvector
├── init-pgvector.sql # PostgreSQL initialization script
├── .env # Local environment variables
├── Cargo.toml # Rust project dependencies
└── README.md # This file
Features
- Async I/O with tokio for efficient database operations
- Batch processing for handling large datasets
- Graceful shutdown with Ctrl+C
- Configurable number of factors and batch size
- Automatic creation of the target embedding table
- Environment variable based configuration
- Test data generation with realistic patterns
Prerequisites
- Docker and Docker Compose
- VS Code with Remote Containers extension (for development)
- Rust (if developing outside container)
Quick Start
-
Clone the repository:
git clone <repository-url> cd mf-fitter -
Start the development container in VS Code:
- Open the project in VS Code
- When prompted, click "Reopen in Container"
- Or use Command Palette: "Remote-Containers: Reopen in Container"
-
Generate test data:
cargo run --bin generate_test_data -- \ --num-users 1000 \ --num-items 5000 \ --user-clusters 5 \ --item-clusters 10 \ --avg-interactions 20 -
Run matrix factorization:
cargo run -- \ --source-table user_interactions \ --user-id-column user_id \ --item-id-column item_id \ --target-table item_embeddings \ --factors 32
Configuration
Environment Variables
Create a .env file in the project root:
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=postgres
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
Test Data Generator Arguments
--num-users: Number of users to generate (default: 1000)--num-items: Number of items to generate (default: 5000)--user-clusters: Number of user clusters (default: 5)--item-clusters: Number of item clusters (default: 10)--avg-interactions: Average interactions per user (default: 20)--interactions-table: Source table name (default: "user_interactions")--embeddings-table: Target table name (default: "item_embeddings")
Matrix Factorization Arguments
--source-table: Name of the table containing user-item interactions--user-id-column: Name of the column containing user IDs (must be integer)--item-id-column: Name of the column containing item IDs (must be integer)--target-table: Name of the table where item embeddings will be saved--factors: Number of factors for matrix factorization (default: 8)--batch-size: Number of rows to load in each batch (default: 10000)
Database Schema
Input Table (user_interactions)
CREATE TABLE user_interactions (
user_id INTEGER,
item_id INTEGER,
PRIMARY KEY (user_id, item_id)
);
Output Table (item_embeddings)
CREATE TABLE item_embeddings (
item_id INTEGER PRIMARY KEY,
embedding FLOAT[]
);
Development
The project uses a VS Code devcontainer setup with:
- PostgreSQL 16 with pgvector extension
- Rust development environment
- Automatic environment configuration
To modify the PostgreSQL setup:
- Edit
Dockerfile.postgresfor database customization - Edit
init-pgvector.sqlfor initialization scripts - Rebuild the container:
docker-compose up -d --build
Dependencies
- Rust 2021 edition
- PostgreSQL 16
- libmf 0.3
- tokio for async I/O
- clap for CLI argument parsing
- deadpool-postgres for connection pooling
License
This project is licensed under the MIT License - see the LICENSE file for details.
Description
Languages
Rust
96.8%
Dockerfile
3.2%