# MF-Fitter

A Rust application that performs matrix factorization on user-item interaction data from PostgreSQL using libmf, and saves the resulting item embeddings back to the database.

## Project Structure

```
.
├── .devcontainer/          # Development container configuration
│   ├── .env               # Container environment variables
│   └── docker-compose.yml # Docker services configuration
├── src/
│   ├── main.rs           # Main matrix factorization application
│   └── bin/
│       └── generate_test_data.rs  # Test data generator
├── Dockerfile.postgres    # PostgreSQL container with pgvector
├── init-pgvector.sql     # PostgreSQL initialization script
├── .env                  # Local environment variables
├── Cargo.toml            # Rust project dependencies
└── README.md            # This file
```

## Features

- Async I/O with tokio for efficient database operations
- Batch processing for handling large datasets
- Graceful shutdown with Ctrl+C
- Configurable number of factors and batch size
- Automatic creation of the target embedding table
- Environment variable based configuration
- Test data generation with realistic patterns

## Prerequisites

- Docker and Docker Compose
- VS Code with Remote Containers extension (for development)
- Rust (if developing outside container)

## Quick Start

1. Clone the repository:
   ```bash
   git clone <repository-url>
   cd mf-fitter
   ```

2. Start the development container in VS Code:
   - Open the project in VS Code
   - When prompted, click "Reopen in Container"
   - Or use Command Palette: "Remote-Containers: Reopen in Container"

3. Generate test data:
   ```bash
   cargo run --bin generate_test_data -- \
       --num-users 1000 \
       --num-items 5000 \
       --user-clusters 5 \
       --item-clusters 10 \
       --avg-interactions 20
   ```

4. Run matrix factorization:
   ```bash
   cargo run -- \
       --source-table user_interactions \
       --user-id-column user_id \
       --item-id-column item_id \
       --target-table item_embeddings \
       --factors 32
   ```

## Configuration

### Environment Variables

Create a `.env` file in the project root:

```env
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=postgres
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
```

### Test Data Generator Arguments

- `--num-users`: Number of users to generate (default: 1000)
- `--num-items`: Number of items to generate (default: 5000)
- `--user-clusters`: Number of user clusters (default: 5)
- `--item-clusters`: Number of item clusters (default: 10)
- `--avg-interactions`: Average interactions per user (default: 20)
- `--interactions-table`: Source table name (default: "user_interactions")
- `--embeddings-table`: Target table name (default: "item_embeddings")

### Matrix Factorization Arguments

- `--source-table`: Name of the table containing user-item interactions
- `--user-id-column`: Name of the column containing user IDs (must be integer)
- `--item-id-column`: Name of the column containing item IDs (must be integer)
- `--target-table`: Name of the table where item embeddings will be saved
- `--factors`: Number of factors for matrix factorization (default: 8)
- `--batch-size`: Number of rows to load in each batch (default: 10000)

## Database Schema

### Input Table (user_interactions)
```sql
CREATE TABLE user_interactions (
    user_id INTEGER,
    item_id INTEGER,
    PRIMARY KEY (user_id, item_id)
);
```

### Output Table (item_embeddings)
```sql
CREATE TABLE item_embeddings (
    item_id INTEGER PRIMARY KEY,
    embedding FLOAT[]
);
```

## Development

The project uses a VS Code devcontainer setup with:
- PostgreSQL 16 with pgvector extension
- Rust development environment
- Automatic environment configuration

To modify the PostgreSQL setup:
1. Edit `Dockerfile.postgres` for database customization
2. Edit `init-pgvector.sql` for initialization scripts
3. Rebuild the container: `docker-compose up -d --build`

## Dependencies

- Rust 2021 edition
- PostgreSQL 16
- libmf 0.3
- tokio for async I/O
- clap for CLI argument parsing
- deadpool-postgres for connection pooling

## License

This project is licensed under the MIT License - see the LICENSE file for details.