Initial Commit of the PDM project (ready for DWS migration)

2026-04-20 08:42:38 -05:00
commit dda7b664e7
2721 changed files with 442772 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,469 @@
+# PDM Vault Data Migration Tool
+
+A collection of Python scripts for migrating and managing SOLIDWORKS PDM
+Professional vault data. The project began as a pair of SQL-to-SQL migration
+scripts (for folder and file variable values) and has grown to include a
+suite of PDM API helpers for batch workflow transitions, Copy Tree exports,
+and interactive SQL tasks.
+
+## Overview
+
+The project is split into two layers:
+
+1. **Top-level migration scripts** — one-shot SQL migrations between source
+   and target PDM databases (`migrate_folderdata.py`, `migrate_filedata.py`,
+   `rollback_filedata.py`), plus verification utilities
+   (`check_var_clashing.py`, `check_paths.py`).
+2. **`helpers/` toolkit** — a newer set of scripts for live vault operations
+   and ad-hoc SQL work:
+   - Batch workflow state transitions via the PDM COM API
+   - Batch "Copy Tree" exports
+   - Interactive SQL helper with named query files and preview-and-confirm
+     on every write
+
+## Prerequisites
+
+- Python 3.x (3.12 tested)
+- SQL Server access to both source and target databases (ODBC Driver 17+)
+- SOLIDWORKS PDM Professional client (for `helpers/batch_*` scripts only —
+  these use the local `ConisioLib.EdmVault` COM component)
+- Required packages (see `requirements.txt`):
+  - `pyodbc` — SQL Server connections
+  - `pywin32` — PDM COM API via `win32com.client` / `pythoncom`
+  - `comtypes` — vtable-level COM calls (used for `ChangeState3`)
+
+Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+
+## Project Structure
+
+```
+data_migration_project/
+├── config.json                 # Your real config (DB credentials, mappings)
+├── config.json.template        # Template for new installs — copy to config.json
+├── db_utils.py                 # Shared SQL Server connection wrapper
+├── migrate_folderdata.py       # Folder/project variable value migration
+├── migrate_filedata.py         # File variable value migration (latest revision)
+├── rollback_filedata.py        # Rolls back a filedata migration
+├── check_var_clashing.py       # Finds variable name conflicts before migration
+├── check_paths.py              # Verifies folder path mapping
+├── requirements.txt            # Python dependencies
+├── BATCH_NOTES.md              # Deep-dive on PDM COM ChangeState3 internals
+├── README.md                   # This file
+│
+├── helpers/
+│   ├── batch_workflows_paths.py   # Batch workflow state transitions
+│   ├── batch_copy_tree.py         # Batch Copy Tree export
+│   ├── db_helper.py               # Interactive SQL helper + tasks
+│   ├── test_batch_api.py          # Dev-time prototype for PDM COM bridging
+│   └── queries/                   # Reusable named SQL queries (.sql files)
+│
+├── documentation/              # PDM API reference (.chm files)
+└── logs/                       # Migration log files
+```
+
+## Configuration Setup
+
+### 1. Create `config.json`
+
+Copy the template and fill in your real values:
+
+```bash
+cp config.json.template config.json
+```
+
+The project supports both **SQL Authentication** and **Windows Authentication**:
+
+```json
+{
+  "source_db": {
+    "driver": "{ODBC Driver 17 for SQL Server}",
+    "server": "your-server",
+    "database": "source-database",
+    "username": "sql-user",
+    "password": "sql-password",
+    "trusted_connection": false
+  },
+  "target_db": {
+    "driver": "{ODBC Driver 17 for SQL Server}",
+    "server": "your-server",
+    "database": "target-database",
+    "trusted_connection": true
+  }
+}
+```
+
+- Set `trusted_connection: true` to use Windows Authentication (ignores
+  username/password).
+- Set `trusted_connection: false` and provide `username`/`password` for SQL
+  Server Authentication.
+
+### 2. Path Mapping (for folder data)
+
+```json
+{
+  "path_mapping": {
+    "target_root_folder": "DWS",
+    "case_sensitive": false
+  }
+}
+```
+
+- `target_root_folder`: Root folder name in your target vault. Source folder
+  paths are prepended with this folder name.
+
+### 3. Migration Settings
+
+```json
+{
+  "migration": {
+    "duplicate_handling": "ignore",
+    "batch_size": 500,
+    "commit_interval": 10,
+    "document_status_batch_size": 5000
+  }
+}
+```
+
+### 4. Configuration Mapping Overrides (file data migration only)
+
+```json
+{
+  "configuration_mapping_overrides": {
+    "165": 11250,
+    "167": 11359
+  }
+}
+```
+
+Required when duplicate configuration names exist in the target database.
+Format: `"source_id": target_id`.
+
+**To find duplicates:**
+```sql
+SELECT ConfigurationID, ConfigurationName
+FROM DocumentConfiguration
+WHERE ConfigurationName IN (
+    SELECT ConfigurationName
+    FROM DocumentConfiguration
+    GROUP BY ConfigurationName
+    HAVING COUNT(*) > 1
+)
+ORDER BY ConfigurationName, ConfigurationID;
+```
+
+---
+
+## Top-Level Migration Scripts
+
+### Folder Data Migration
+
+Migrates variable values for folders/projects (DocumentID = 1).
+
+```bash
+python migrate_folderdata.py
+```
+
+**What it does:**
+- Maps ProjectIDs based on folder paths
+- Maps VariableIDs based on variable names
+- Migrates all revisions of folder-level variable values
+- Validates the migration
+- Creates mapping CSV files for review
+
+**Output files:**
+- `mapping_projects_{timestamp}.csv`
+- `mapping_variables_{timestamp}.csv`
+- `folderdata_migration_{timestamp}.log`
+- `validation_missing_folderdata_{timestamp}.csv` (if issues)
+
+---
+
+### File Data Migration
+
+Migrates variable values for files (ProjectID = 2, DocumentID != 1).
+
+**IMPORTANT:** Only migrates the **latest revision** of each variable for
+each file configuration.
+
+```bash
+python migrate_filedata.py
+```
+
+**What it does:**
+- Maps VariableIDs by name, DocumentIDs by full file path, ConfigurationIDs
+  (with manual overrides from `config.json` if specified)
+- Fetches only the latest revision per VariableID+DocumentID+ConfigurationID
+- Inserts all records with `RevisionNo = 1`
+- Validates the migration and emits mapping CSVs
+
+**You will be prompted** to confirm configuration mapping overrides before
+the migration runs — review carefully.
+
+**Output files:**
+- `mapping_variables_filedata_{timestamp}.csv`
+- `mapping_documents_filedata_{timestamp}.csv`
+- `filedata_migration_{timestamp}.log`
+- `validation_missing_filedata_{timestamp}.csv` (if issues)
+- Progress files (auto-deleted on success)
+
+---
+
+### Rollback File Data Migration
+
+```bash
+python rollback_filedata.py mapping_documents_filedata_YYYYMMDD_HHMMSS.csv
+```
+
+Reads the document mapping CSV from a previous migration and deletes all
+VariableValue records for those documents from the target database.
+**Shows preview and prompts for confirmation before deleting.**
+
+> **WARNING:** This permanently deletes data from the target database.
+> Always back up first.
+
+---
+
+## Helper Scripts (`helpers/`)
+
+The helpers are live-vault tools that talk to PDM directly via the COM API,
+plus an interactive SQL runner. They are independent of the top-level
+migration scripts and can be used any time.
+
+### `batch_workflows_paths.py` — Batch Workflow Transitions
+
+Drives `IEdmFile13::ChangeState3` against hundreds or thousands of files at
+once, transitioning each through a named workflow transition. Implements
+escalating-backoff retries and vault reconnect to handle PDM's in-process
+DLL state corruption on large batches.
+
+```bash
+python helpers/batch_workflows_paths.py -v "Drilling_Test" -c files.csv -t "AA"
+```
+
+**Options:**
+- `-v, --vault` — PDM vault name
+- `-c, --csv` — Path to a text/CSV file with one full vault path per line
+  (e.g. `C:\PDM\Drilling_Test\DWS\Parts\widget.sldprt`)
+- `-t, --transition` — Name of the workflow transition (e.g. `"AA"`)
+- `--comment` — Optional transition comment
+- `-u, --username` — PDM username (prompts if omitted)
+
+**Output files:**
+- `batch_workflow_paths_{timestamp}.log` — detailed log
+- `failed_transitions_{timestamp}.txt` — real failures worth retrying
+- `not_available_{timestamp}.txt` — files whose transition wasn't valid
+  (typically already in the target state from a prior run)
+
+For implementation details on the restricted `ChangeState3` COM method and
+why it requires ctypes/comtypes vtable access, see
+[BATCH_NOTES.md](BATCH_NOTES.md).
+
+---
+
+### `batch_copy_tree.py` — Batch Copy Tree Export
+
+Reads part numbers from a CSV, runs PDM's Copy Tree function for each, and
+exports each part's file tree to its own subfolder.
+
+```bash
+python helpers/batch_copy_tree.py -c parts.csv -o "C:\Temp\Output" --vault "Drilling_Test"
+```
+
+---
+
+### `db_helper.py` — Interactive SQL Helper
+
+Runs SELECT queries, multi-step tasks, and confirmed INSERTs against either
+database from `config.json`. Queries are stored as `.sql` files in
+`helpers/queries/` and referenced by name.
+
+**List saved queries:**
+```bash
+python helpers/db_helper.py --list-queries
+```
+
+**Run a saved query by name:**
+```bash
+python helpers/db_helper.py --db target_db --query get_var47
+```
+
+**Run raw SQL (anything with a space in it is treated as a literal query):**
+```bash
+python helpers/db_helper.py --db target_db --query "SELECT TOP 10 * FROM Documents"
+```
+
+**Run a predefined task:**
+```bash
+python helpers/db_helper.py --db target_db --task copy_57_to_50 --dry-run
+python helpers/db_helper.py --db target_db --task copy_57_to_50
+```
+
+#### Safety features
+
+- Every INSERT or UPDATE goes through `preview_and_confirm` — you see the
+  SQL, the row count, and a sample of the data and must type `y` before it
+  executes.
+- `--dry-run` shows the preview but skips execution entirely.
+- All writes run inside a transaction. On any per-row error you're asked
+  whether to commit or rollback.
+- Every query, parameter set, and decision is logged to
+  `db_helper_{timestamp}.log`.
+
+#### Saved SQL Queries
+
+Drop a `.sql` file into `helpers/queries/` and it becomes callable by its
+filename (without extension). Leave a comment on the first line for an
+inline description — it shows up in `--list-queries`.
+
+Current queries:
+- `DWS_GET_VV-57.sql` — Documents in DWS paths that have VariableID=57
+- `DWS_VV-57_FullList.sql` — Full VariableValue rows for VV-57 in DWS paths
+- `Get_All_VV_Per_DocID.sql` — All distinct VariableIDs for a given
+  DocumentID (parameterized with `?`)
+- `INSERT_VV50_Copy.sql` — Inserts a VV-50 copy of a VV-57 row
+
+#### Tasks
+
+Tasks are Python functions in `db_helper.py` that chain multiple queries
+and transforms together — e.g. run a SELECT, loop the results, run a
+second parameterized SELECT per row, validate, then INSERT filtered rows
+with confirmation.
+
+Each task is registered in `TASK_REGISTRY` near the bottom of the file.
+Current tasks:
+
+| Task | Purpose |
+|------|---------|
+| `check_vv50` | For every doc with VV-57, check whether it also has VV-50. Writes `has_vv50_{timestamp}.txt`. |
+| `copy_57_to_50` | Insert VV-50 rows mirroring existing VV-57 rows, skipping any DocumentIDs already in a `has_vv50_*.txt` file. |
+| `copy_with_new_id` | Example/template task — copy rows with a transformed ID. |
+
+**Adding a new task:** write a function `def task_foo(db, args): ...` and
+add it to `TASK_REGISTRY`. The building blocks `run_select`, `load_query`,
+`preview_and_confirm`, and `run_insert` are all at the top of the file.
+
+---
+
+## Understanding the Logs
+
+### Migration Progress
+```
+Processing batch 10/100 (500 records)...
+Batch 10 complete: inserted=450, updated=50, errors=0
+[COMMIT] Transaction committed at batch 10
+```
+
+### Validation Results
+```
+==================================================
+$  Migration Validation Completed!
+==================================================
+Gross Success rate: 95.39%
+Success rate w/o Ignored Files: 100.00%
+371630 of 397043 Rows were found
+--------------------------------------------------
+MISSING ROW COUNT: 0 - See CSV output for details
+We ignored a total of 25413 rows. We couldn't map these to the TargetDB
+```
+
+- **Gross Success rate** — % of all source records found in target
+- **Success rate w/o Ignored Files** — % of mappable records found (should
+  be 100%)
+- **MISSING ROW COUNT** — Records that should exist but don't (should be 0)
+- **Ignored** — Records that couldn't be mapped (unmapped variables,
+  documents, or configurations)
+
+## Important Notes
+
+### File Data Migration Behavior
+
+1. **Only Latest Revisions** — File data migration only migrates the most
+   recent revision of each variable for each file configuration. Historical
+   revisions are not migrated.
+2. **RevisionNo Reset** — All migrated file data is inserted with
+   `RevisionNo = 1` in the target database.
+3. **Configuration Mapping** — You MUST verify manual overrides in
+   `config.json` before running.
+
+### Progress Tracking and Resume
+
+Both migration scripts support automatic resume:
+- Progress is saved every 10 batches.
+- If a migration fails, re-run the script and it will offer to resume.
+- Progress files are automatically cleaned up on success.
+
+### Validation
+
+All migrations include automatic validation:
+- Compares source records (after mapping) to target records using set-based
+  comparison.
+- Reports any missing records to CSV.
+- Should show 100% success rate for mappable records.
+
+## Troubleshooting
+
+### "Migration failed at batch X"
+Check the log file, then re-run and choose `y` to resume from the last
+checkpoint.
+
+### "We ignored a total of X rows"
+Expected for unmapped variables, documents, or configurations. Check the
+mapping CSV files to see what was skipped.
+
+### "MISSING ROW COUNT: X" (where X > 0)
+Indicates a real problem:
+1. Check `validation_missing_*.csv` for details.
+2. Verify ID mappings in the mapping CSV files.
+3. Check the migration log for insert errors.
+
+### Configuration Mapping Issues
+If you see warnings about duplicate ConfigurationNames:
+1. Run the SQL query above to find duplicates.
+2. Determine the correct target ID for each source configuration.
+3. Add manual overrides to `config.json`.
+4. Re-run the migration.
+
+### Database Connection Timeouts
+- Progress is saved automatically — re-run to resume.
+- Consider reducing `batch_size` in `config.json`.
+
+### Batch Workflow Transition Failures
+
+If you see `[CS3] Phase-2 access violation ...` warnings in
+`batch_workflow_paths_*.log`:
+
+- The script automatically retries with escalating backoff (3s → 10s → 30s).
+- After 3 consecutive persistent failures it automatically reconnects the
+  vault to reset PDM's in-process DLL state.
+- Genuine failures end up in `failed_transitions_{timestamp}.txt` — feed
+  that file straight back in to retry just the failures.
+- Files that appear in `not_available_{timestamp}.txt` aren't really
+  failures; they were already in the target state (e.g. from a previous
+  successful run).
+
+See [BATCH_NOTES.md](BATCH_NOTES.md) for full background on why
+`ChangeState3` is difficult to call and how the COM bridging works.
+
+## Best Practices
+
+1. **Always back up** the target database before running migrations.
+2. **Test on a dev/test environment first**.
+3. **Review mapping CSV files** to verify ID mappings are correct.
+4. **Check validation results** — 100% success for mappable records.
+5. **Keep `config.json`** with any manual overrides for future reference.
+6. **Use `--dry-run`** with `db_helper.py` tasks before real runs.
+7. **Save the `has_vv50_*.txt` / `failed_transitions_*.txt` output files**
+   — they let you incrementally mop up residual work without re-processing
+   everything.
+
+## Support
+
+For issues or questions:
+1. Check the log files for detailed error messages.
+2. Review the mapping CSV files to verify ID mappings.
+3. Ensure `config.json` is properly configured.
+4. Verify database connectivity and permissions.
+5. For PDM COM API internals, see [BATCH_NOTES.md](BATCH_NOTES.md).