initial commit

This commit is contained in:
Dylan Knutson
2024-12-28 01:28:33 +00:00
commit f7bb5b0cdd
15 changed files with 2061 additions and 0 deletions

147
README.md Normal file
View File

@@ -0,0 +1,147 @@
# MF-Fitter
A Rust application that performs matrix factorization on user-item interaction data from PostgreSQL using libmf, and saves the resulting item embeddings back to the database.
## Project Structure
```
.
├── .devcontainer/ # Development container configuration
│ ├── .env # Container environment variables
│ └── docker-compose.yml # Docker services configuration
├── src/
│ ├── main.rs # Main matrix factorization application
│ └── bin/
│ └── generate_test_data.rs # Test data generator
├── Dockerfile.postgres # PostgreSQL container with pgvector
├── init-pgvector.sql # PostgreSQL initialization script
├── .env # Local environment variables
├── Cargo.toml # Rust project dependencies
└── README.md # This file
```
## Features
- Async I/O with tokio for efficient database operations
- Batch processing for handling large datasets
- Graceful shutdown with Ctrl+C
- Configurable number of factors and batch size
- Automatic creation of the target embedding table
- Environment variable based configuration
- Test data generation with realistic patterns
## Prerequisites
- Docker and Docker Compose
- VS Code with Remote Containers extension (for development)
- Rust (if developing outside container)
## Quick Start
1. Clone the repository:
```bash
git clone <repository-url>
cd mf-fitter
```
2. Start the development container in VS Code:
- Open the project in VS Code
- When prompted, click "Reopen in Container"
- Or use Command Palette: "Remote-Containers: Reopen in Container"
3. Generate test data:
```bash
cargo run --bin generate_test_data -- \
--num-users 1000 \
--num-items 5000 \
--user-clusters 5 \
--item-clusters 10 \
--avg-interactions 20
```
4. Run matrix factorization:
```bash
cargo run -- \
--source-table user_interactions \
--user-id-column user_id \
--item-id-column item_id \
--target-table item_embeddings \
--factors 32
```
## Configuration
### Environment Variables
Create a `.env` file in the project root:
```env
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=postgres
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
```
### Test Data Generator Arguments
- `--num-users`: Number of users to generate (default: 1000)
- `--num-items`: Number of items to generate (default: 5000)
- `--user-clusters`: Number of user clusters (default: 5)
- `--item-clusters`: Number of item clusters (default: 10)
- `--avg-interactions`: Average interactions per user (default: 20)
- `--interactions-table`: Source table name (default: "user_interactions")
- `--embeddings-table`: Target table name (default: "item_embeddings")
### Matrix Factorization Arguments
- `--source-table`: Name of the table containing user-item interactions
- `--user-id-column`: Name of the column containing user IDs (must be integer)
- `--item-id-column`: Name of the column containing item IDs (must be integer)
- `--target-table`: Name of the table where item embeddings will be saved
- `--factors`: Number of factors for matrix factorization (default: 8)
- `--batch-size`: Number of rows to load in each batch (default: 10000)
## Database Schema
### Input Table (user_interactions)
```sql
CREATE TABLE user_interactions (
user_id INTEGER,
item_id INTEGER,
PRIMARY KEY (user_id, item_id)
);
```
### Output Table (item_embeddings)
```sql
CREATE TABLE item_embeddings (
item_id INTEGER PRIMARY KEY,
embedding FLOAT[]
);
```
## Development
The project uses a VS Code devcontainer setup with:
- PostgreSQL 16 with pgvector extension
- Rust development environment
- Automatic environment configuration
To modify the PostgreSQL setup:
1. Edit `Dockerfile.postgres` for database customization
2. Edit `init-pgvector.sql` for initialization scripts
3. Rebuild the container: `docker-compose up -d --build`
## Dependencies
- Rust 2021 edition
- PostgreSQL 16
- libmf 0.3
- tokio for async I/O
- clap for CLI argument parsing
- deadpool-postgres for connection pooling
## License
This project is licensed under the MIT License - see the LICENSE file for details.