147 lines
4.2 KiB
Markdown
147 lines
4.2 KiB
Markdown
# MF-Fitter
|
|
|
|
A Rust application that performs matrix factorization on user-item interaction data from PostgreSQL using libmf, and saves the resulting item embeddings back to the database.
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
.
|
|
├── .devcontainer/ # Development container configuration
|
|
│ ├── .env # Container environment variables
|
|
│ └── docker-compose.yml # Docker services configuration
|
|
├── src/
|
|
│ ├── main.rs # Main matrix factorization application
|
|
│ └── bin/
|
|
│ └── generate_test_data.rs # Test data generator
|
|
├── Dockerfile.postgres # PostgreSQL container with pgvector
|
|
├── init-pgvector.sql # PostgreSQL initialization script
|
|
├── .env # Local environment variables
|
|
├── Cargo.toml # Rust project dependencies
|
|
└── README.md # This file
|
|
```
|
|
|
|
## Features
|
|
|
|
- Async I/O with tokio for efficient database operations
|
|
- Batch processing for handling large datasets
|
|
- Graceful shutdown with Ctrl+C
|
|
- Configurable number of factors and batch size
|
|
- Automatic creation of the target embedding table
|
|
- Environment variable based configuration
|
|
- Test data generation with realistic patterns
|
|
|
|
## Prerequisites
|
|
|
|
- Docker and Docker Compose
|
|
- VS Code with Remote Containers extension (for development)
|
|
- Rust (if developing outside container)
|
|
|
|
## Quick Start
|
|
|
|
1. Clone the repository:
|
|
```bash
|
|
git clone <repository-url>
|
|
cd mf-fitter
|
|
```
|
|
|
|
2. Start the development container in VS Code:
|
|
- Open the project in VS Code
|
|
- When prompted, click "Reopen in Container"
|
|
- Or use Command Palette: "Remote-Containers: Reopen in Container"
|
|
|
|
3. Generate test data:
|
|
```bash
|
|
cargo run --bin generate_test_data -- \
|
|
--num-users 1000 \
|
|
--num-items 5000 \
|
|
--user-clusters 5 \
|
|
--item-clusters 10 \
|
|
--avg-interactions 20
|
|
```
|
|
|
|
4. Run matrix factorization:
|
|
```bash
|
|
cargo run -- \
|
|
--source-table user_interactions \
|
|
--user-id-column user_id \
|
|
--item-id-column item_id \
|
|
--target-table item_embeddings \
|
|
--factors 32
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
Create a `.env` file in the project root:
|
|
|
|
```env
|
|
POSTGRES_HOST=localhost
|
|
POSTGRES_PORT=5432
|
|
POSTGRES_DB=postgres
|
|
POSTGRES_USER=postgres
|
|
POSTGRES_PASSWORD=postgres
|
|
```
|
|
|
|
### Test Data Generator Arguments
|
|
|
|
- `--num-users`: Number of users to generate (default: 1000)
|
|
- `--num-items`: Number of items to generate (default: 5000)
|
|
- `--user-clusters`: Number of user clusters (default: 5)
|
|
- `--item-clusters`: Number of item clusters (default: 10)
|
|
- `--avg-interactions`: Average interactions per user (default: 20)
|
|
- `--interactions-table`: Source table name (default: "user_interactions")
|
|
- `--embeddings-table`: Target table name (default: "item_embeddings")
|
|
|
|
### Matrix Factorization Arguments
|
|
|
|
- `--source-table`: Name of the table containing user-item interactions
|
|
- `--user-id-column`: Name of the column containing user IDs (must be integer)
|
|
- `--item-id-column`: Name of the column containing item IDs (must be integer)
|
|
- `--target-table`: Name of the table where item embeddings will be saved
|
|
- `--factors`: Number of factors for matrix factorization (default: 8)
|
|
- `--batch-size`: Number of rows to load in each batch (default: 10000)
|
|
|
|
## Database Schema
|
|
|
|
### Input Table (user_interactions)
|
|
```sql
|
|
CREATE TABLE user_interactions (
|
|
user_id INTEGER,
|
|
item_id INTEGER,
|
|
PRIMARY KEY (user_id, item_id)
|
|
);
|
|
```
|
|
|
|
### Output Table (item_embeddings)
|
|
```sql
|
|
CREATE TABLE item_embeddings (
|
|
item_id INTEGER PRIMARY KEY,
|
|
embedding FLOAT[]
|
|
);
|
|
```
|
|
|
|
## Development
|
|
|
|
The project uses a VS Code devcontainer setup with:
|
|
- PostgreSQL 16 with pgvector extension
|
|
- Rust development environment
|
|
- Automatic environment configuration
|
|
|
|
To modify the PostgreSQL setup:
|
|
1. Edit `Dockerfile.postgres` for database customization
|
|
2. Edit `init-pgvector.sql` for initialization scripts
|
|
3. Rebuild the container: `docker-compose up -d --build`
|
|
|
|
## Dependencies
|
|
|
|
- Rust 2021 edition
|
|
- PostgreSQL 16
|
|
- libmf 0.3
|
|
- tokio for async I/O
|
|
- clap for CLI argument parsing
|
|
- deadpool-postgres for connection pooling
|
|
|
|
## License
|
|
|
|
This project is licensed under the MIT License - see the LICENSE file for details. |