initial commit
This commit is contained in:
147
README.md
Normal file
147
README.md
Normal file
@@ -0,0 +1,147 @@
|
||||
# MF-Fitter
|
||||
|
||||
A Rust application that performs matrix factorization on user-item interaction data from PostgreSQL using libmf, and saves the resulting item embeddings back to the database.
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
.
|
||||
├── .devcontainer/ # Development container configuration
|
||||
│ ├── .env # Container environment variables
|
||||
│ └── docker-compose.yml # Docker services configuration
|
||||
├── src/
|
||||
│ ├── main.rs # Main matrix factorization application
|
||||
│ └── bin/
|
||||
│ └── generate_test_data.rs # Test data generator
|
||||
├── Dockerfile.postgres # PostgreSQL container with pgvector
|
||||
├── init-pgvector.sql # PostgreSQL initialization script
|
||||
├── .env # Local environment variables
|
||||
├── Cargo.toml # Rust project dependencies
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Features
|
||||
|
||||
- Async I/O with tokio for efficient database operations
|
||||
- Batch processing for handling large datasets
|
||||
- Graceful shutdown with Ctrl+C
|
||||
- Configurable number of factors and batch size
|
||||
- Automatic creation of the target embedding table
|
||||
- Environment variable based configuration
|
||||
- Test data generation with realistic patterns
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Docker and Docker Compose
|
||||
- VS Code with Remote Containers extension (for development)
|
||||
- Rust (if developing outside container)
|
||||
|
||||
## Quick Start
|
||||
|
||||
1. Clone the repository:
|
||||
```bash
|
||||
git clone <repository-url>
|
||||
cd mf-fitter
|
||||
```
|
||||
|
||||
2. Start the development container in VS Code:
|
||||
- Open the project in VS Code
|
||||
- When prompted, click "Reopen in Container"
|
||||
- Or use Command Palette: "Remote-Containers: Reopen in Container"
|
||||
|
||||
3. Generate test data:
|
||||
```bash
|
||||
cargo run --bin generate_test_data -- \
|
||||
--num-users 1000 \
|
||||
--num-items 5000 \
|
||||
--user-clusters 5 \
|
||||
--item-clusters 10 \
|
||||
--avg-interactions 20
|
||||
```
|
||||
|
||||
4. Run matrix factorization:
|
||||
```bash
|
||||
cargo run -- \
|
||||
--source-table user_interactions \
|
||||
--user-id-column user_id \
|
||||
--item-id-column item_id \
|
||||
--target-table item_embeddings \
|
||||
--factors 32
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
Create a `.env` file in the project root:
|
||||
|
||||
```env
|
||||
POSTGRES_HOST=localhost
|
||||
POSTGRES_PORT=5432
|
||||
POSTGRES_DB=postgres
|
||||
POSTGRES_USER=postgres
|
||||
POSTGRES_PASSWORD=postgres
|
||||
```
|
||||
|
||||
### Test Data Generator Arguments
|
||||
|
||||
- `--num-users`: Number of users to generate (default: 1000)
|
||||
- `--num-items`: Number of items to generate (default: 5000)
|
||||
- `--user-clusters`: Number of user clusters (default: 5)
|
||||
- `--item-clusters`: Number of item clusters (default: 10)
|
||||
- `--avg-interactions`: Average interactions per user (default: 20)
|
||||
- `--interactions-table`: Source table name (default: "user_interactions")
|
||||
- `--embeddings-table`: Target table name (default: "item_embeddings")
|
||||
|
||||
### Matrix Factorization Arguments
|
||||
|
||||
- `--source-table`: Name of the table containing user-item interactions
|
||||
- `--user-id-column`: Name of the column containing user IDs (must be integer)
|
||||
- `--item-id-column`: Name of the column containing item IDs (must be integer)
|
||||
- `--target-table`: Name of the table where item embeddings will be saved
|
||||
- `--factors`: Number of factors for matrix factorization (default: 8)
|
||||
- `--batch-size`: Number of rows to load in each batch (default: 10000)
|
||||
|
||||
## Database Schema
|
||||
|
||||
### Input Table (user_interactions)
|
||||
```sql
|
||||
CREATE TABLE user_interactions (
|
||||
user_id INTEGER,
|
||||
item_id INTEGER,
|
||||
PRIMARY KEY (user_id, item_id)
|
||||
);
|
||||
```
|
||||
|
||||
### Output Table (item_embeddings)
|
||||
```sql
|
||||
CREATE TABLE item_embeddings (
|
||||
item_id INTEGER PRIMARY KEY,
|
||||
embedding FLOAT[]
|
||||
);
|
||||
```
|
||||
|
||||
## Development
|
||||
|
||||
The project uses a VS Code devcontainer setup with:
|
||||
- PostgreSQL 16 with pgvector extension
|
||||
- Rust development environment
|
||||
- Automatic environment configuration
|
||||
|
||||
To modify the PostgreSQL setup:
|
||||
1. Edit `Dockerfile.postgres` for database customization
|
||||
2. Edit `init-pgvector.sql` for initialization scripts
|
||||
3. Rebuild the container: `docker-compose up -d --build`
|
||||
|
||||
## Dependencies
|
||||
|
||||
- Rust 2021 edition
|
||||
- PostgreSQL 16
|
||||
- libmf 0.3
|
||||
- tokio for async I/O
|
||||
- clap for CLI argument parsing
|
||||
- deadpool-postgres for connection pooling
|
||||
|
||||
## License
|
||||
|
||||
This project is licensed under the MIT License - see the LICENSE file for details.
|
||||
Reference in New Issue
Block a user