25 Commits

Author SHA1 Message Date
Dylan Knutson
0dadf2654c Update dependencies and enhance SQL formatting in fit_model
- Added new dependencies: `colored`, `lazy_static`, `sqlformat`, and `unicode_categories` to improve code readability and SQL formatting capabilities.
- Introduced a new module `format_sql` for better SQL query formatting in `fit_model.rs`.
- Updated `fit_model.rs` to utilize the new `format_sql` function for logging SQL commands, enhancing clarity in database operations.
- Adjusted the `Cargo.toml` and `Cargo.lock` files to reflect the new dependencies and their versions.

These changes improve the maintainability and readability of the code, particularly in the context of SQL operations.
2024-12-28 21:46:14 +00:00
Dylan Knutson
4651b96785 write to temp table and atomic swap with old table 2024-12-28 21:02:44 +00:00
Dylan Knutson
b255f40ac7 add progress bar 2024-12-28 20:51:11 +00:00
Dylan Knutson
b3ba58723c Enhance fit model functionality and argument handling
- Updated `fit_model_args.rs` to allow optional factors for matrix factorization and added an index name argument for index management.
- Modified `fit_model.rs` to handle index creation and dropping during data upsert, improving database interaction.
- Adjusted schema validation to infer vector dimensions and validate against specified factors.
- Enhanced `generate_test_data.rs` to create an IVFFlat index on the embeddings column.

These changes improve the flexibility and robustness of the fit model process, allowing for better management of database indices and more intuitive argument handling.
2024-12-28 20:37:12 +00:00
Dylan Knutson
5430fdd501 remove float[] array support, only use vector 2024-12-28 19:58:46 +00:00
Dylan Knutson
75e7a4538d Refactor Dockerfile and add fit_model functionality
- Updated the Dockerfile to rename the built binary from `mf-fitter` to `fit_model` for clarity.
- Introduced a new `fit_model_args.rs` file to define command-line arguments for the fit model process, including parameters for matrix factorization.
- Added `pg_types.rs` and `pgvector.rs` files to handle PostgreSQL type interactions and vector serialization/deserialization.
- Implemented the main logic for the fit model in `fit_model.rs`, including data loading, model training, and embedding saving.
- Enhanced `visualize_embeddings.rs` to load embeddings and clusters more efficiently.

These changes improve the organization and functionality of the model fitting process, making it more intuitive and maintainable.
2024-12-28 19:50:24 +00:00
Dylan Knutson
bc88c54cb0 write vector binary type to database 2024-12-28 19:04:43 +00:00
Dylan Knutson
2b1865f3d4 use COPY for exporting data into temp table 2024-12-28 18:32:18 +00:00
Dylan Knutson
c4e79a36f9 Add argument parsing for data loading configuration
- Introduced a new `args.rs` file to define command-line arguments for data loading parameters, including source and target table details, matrix factorization settings, and optional interaction limits.
- Refactored `main.rs` to utilize the new argument structure, enhancing code organization and readability.
- Removed the previous inline argument definitions, streamlining the main application logic.

These changes improve the configurability and maintainability of the data loading process.
2024-12-28 18:16:39 +00:00
Dylan Knutson
428ca89c92 use COPY for importing data 2024-12-28 17:55:56 +00:00
Dylan Knutson
857cbf5d1f add max interactions flag 2024-12-28 17:41:42 +00:00
Dylan Knutson
350c61c313 Refactor data loading and embedding saving process
- Updated `.cargo/config.toml` to optimize compilation flags for performance.
- Enhanced `main.rs` by:
  - Renaming user and item ID columns for clarity.
  - Adding validation functions to ensure the existence of tables and columns in the database schema.
  - Implementing immediate exit handling during data loading.
  - Modifying the `save_embeddings` function to accept item IDs for processing.
  - Improving error handling with context messages for database operations.

These changes improve code readability, robustness, and performance during data processing.
2024-12-28 06:42:28 +00:00
Dylan Knutson
c791203d1c dockerfile for building release app 2024-12-28 05:01:10 +00:00
Dylan Knutson
66165a7eee batch loading for computed rows 2024-12-28 04:40:09 +00:00
Dylan Knutson
9aece9c740 make libmf multithreading work 2024-12-28 04:19:00 +00:00
Dylan Knutson
2738b8469b cargo clippy 2024-12-28 03:46:30 +00:00
Dylan Knutson
6ebbd6aaa9 better visualization 2024-12-28 03:39:24 +00:00
Dylan Knutson
ab5f379b94 use cluster affinities 2024-12-28 03:32:38 +00:00
Dylan Knutson
9b4316e819 different way of giving clusters an x, y, z 2024-12-28 03:11:37 +00:00
Dylan Knutson
32a7292481 more fixes 2024-12-28 03:04:50 +00:00
Dylan Knutson
56b6604142 improve embedding visualization 2024-12-28 02:09:32 +00:00
Dylan Knutson
e21541af46 embeddings visualization 2024-12-28 01:59:11 +00:00
Dylan Knutson
61b9728fd8 better test data generation 2024-12-28 01:51:33 +00:00
Dylan Knutson
00b30ac285 cluster validation 2024-12-28 01:46:48 +00:00
Dylan Knutson
f7bb5b0cdd initial commit 2024-12-28 01:28:33 +00:00