# PDM Vault Data Migration Tool

A collection of Python scripts for migrating and managing SOLIDWORKS PDM
Professional vault data. The project began as a pair of SQL-to-SQL migration
scripts (for folder and file variable values) and has grown to include a
suite of PDM API helpers for batch workflow transitions, Copy Tree exports,
and interactive SQL tasks.

## Overview

The project is split into two layers:

1. **Top-level migration scripts** — one-shot SQL migrations between source
   and target PDM databases (`migrate_folderdata.py`, `migrate_filedata.py`,
   `rollback_filedata.py`), plus verification utilities
   (`check_var_clashing.py`, `check_paths.py`).
2. **`helpers/` toolkit** — a newer set of scripts for live vault operations
   and ad-hoc SQL work:
   - Batch workflow state transitions via the PDM COM API
   - Batch "Copy Tree" exports
   - Interactive SQL helper with named query files and preview-and-confirm
     on every write

## Prerequisites

- Python 3.x (3.12 tested)
- SQL Server access to both source and target databases (ODBC Driver 17+)
- SOLIDWORKS PDM Professional client (for `helpers/batch_*` scripts only —
  these use the local `ConisioLib.EdmVault` COM component)
- Required packages (see `requirements.txt`):
  - `pyodbc` — SQL Server connections
  - `pywin32` — PDM COM API via `win32com.client` / `pythoncom`
  - `comtypes` — vtable-level COM calls (used for `ChangeState3`)

Install dependencies:
```bash
pip install -r requirements.txt
```

## Project Structure

```
data_migration_project/
├── config.json                 # Your real config (DB credentials, mappings)
├── config.json.template        # Template for new installs — copy to config.json
├── db_utils.py                 # Shared SQL Server connection wrapper
├── migrate_folderdata.py       # Folder/project variable value migration
├── migrate_filedata.py         # File variable value migration (latest revision)
├── rollback_filedata.py        # Rolls back a filedata migration
├── check_var_clashing.py       # Finds variable name conflicts before migration
├── check_paths.py              # Verifies folder path mapping
├── requirements.txt            # Python dependencies
├── BATCH_NOTES.md              # Deep-dive on PDM COM ChangeState3 internals
├── README.md                   # This file
│
├── helpers/
│   ├── batch_workflows_paths.py   # Batch workflow state transitions
│   ├── batch_copy_tree.py         # Batch Copy Tree export
│   ├── db_helper.py               # Interactive SQL helper + tasks
│   ├── test_batch_api.py          # Dev-time prototype for PDM COM bridging
│   └── queries/                   # Reusable named SQL queries (.sql files)
│
├── documentation/              # PDM API reference (.chm files)
└── logs/                       # Migration log files
```

## Configuration Setup

### 1. Create `config.json`

Copy the template and fill in your real values:

```bash
cp config.json.template config.json
```

The project supports both **SQL Authentication** and **Windows Authentication**:

```json
{
  "source_db": {
    "driver": "{ODBC Driver 17 for SQL Server}",
    "server": "your-server",
    "database": "source-database",
    "username": "sql-user",
    "password": "sql-password",
    "trusted_connection": false
  },
  "target_db": {
    "driver": "{ODBC Driver 17 for SQL Server}",
    "server": "your-server",
    "database": "target-database",
    "trusted_connection": true
  }
}
```

- Set `trusted_connection: true` to use Windows Authentication (ignores
  username/password).
- Set `trusted_connection: false` and provide `username`/`password` for SQL
  Server Authentication.

### 2. Path Mapping (for folder data)

```json
{
  "path_mapping": {
    "target_root_folder": "DWS",
    "case_sensitive": false
  }
}
```

- `target_root_folder`: Root folder name in your target vault. Source folder
  paths are prepended with this folder name.

### 3. Migration Settings

```json
{
  "migration": {
    "duplicate_handling": "ignore",
    "batch_size": 500,
    "commit_interval": 10,
    "document_status_batch_size": 5000
  }
}
```

### 4. Configuration Mapping Overrides (file data migration only)

```json
{
  "configuration_mapping_overrides": {
    "165": 11250,
    "167": 11359
  }
}
```

Required when duplicate configuration names exist in the target database.
Format: `"source_id": target_id`.

**To find duplicates:**
```sql
SELECT ConfigurationID, ConfigurationName
FROM DocumentConfiguration
WHERE ConfigurationName IN (
    SELECT ConfigurationName
    FROM DocumentConfiguration
    GROUP BY ConfigurationName
    HAVING COUNT(*) > 1
)
ORDER BY ConfigurationName, ConfigurationID;
```

---

## Top-Level Migration Scripts

### Folder Data Migration

Migrates variable values for folders/projects (DocumentID = 1).

```bash
python migrate_folderdata.py
```

**What it does:**
- Maps ProjectIDs based on folder paths
- Maps VariableIDs based on variable names
- Migrates all revisions of folder-level variable values
- Validates the migration
- Creates mapping CSV files for review

**Output files:**
- `mapping_projects_{timestamp}.csv`
- `mapping_variables_{timestamp}.csv`
- `folderdata_migration_{timestamp}.log`
- `validation_missing_folderdata_{timestamp}.csv` (if issues)

---

### File Data Migration

Migrates variable values for files (ProjectID = 2, DocumentID != 1).

**IMPORTANT:** Only migrates the **latest revision** of each variable for
each file configuration.

```bash
python migrate_filedata.py
```

**What it does:**
- Maps VariableIDs by name, DocumentIDs by full file path, ConfigurationIDs
  (with manual overrides from `config.json` if specified)
- Fetches only the latest revision per VariableID+DocumentID+ConfigurationID
- Inserts all records with `RevisionNo = 1`
- Validates the migration and emits mapping CSVs

**You will be prompted** to confirm configuration mapping overrides before
the migration runs — review carefully.

**Output files:**
- `mapping_variables_filedata_{timestamp}.csv`
- `mapping_documents_filedata_{timestamp}.csv`
- `filedata_migration_{timestamp}.log`
- `validation_missing_filedata_{timestamp}.csv` (if issues)
- Progress files (auto-deleted on success)

---

### Rollback File Data Migration

```bash
python rollback_filedata.py mapping_documents_filedata_YYYYMMDD_HHMMSS.csv
```

Reads the document mapping CSV from a previous migration and deletes all
VariableValue records for those documents from the target database.
**Shows preview and prompts for confirmation before deleting.**

> **WARNING:** This permanently deletes data from the target database.
> Always back up first.

---

## Helper Scripts (`helpers/`)

The helpers are live-vault tools that talk to PDM directly via the COM API,
plus an interactive SQL runner. They are independent of the top-level
migration scripts and can be used any time.

### `batch_workflows_paths.py` — Batch Workflow Transitions

Drives `IEdmFile13::ChangeState3` against hundreds or thousands of files at
once, transitioning each through a named workflow transition. Implements
escalating-backoff retries and vault reconnect to handle PDM's in-process
DLL state corruption on large batches.

```bash
python helpers/batch_workflows_paths.py -v "Drilling_Test" -c files.csv -t "AA"
```

**Options:**
- `-v, --vault` — PDM vault name
- `-c, --csv` — Path to a text/CSV file with one full vault path per line
  (e.g. `C:\PDM\Drilling_Test\DWS\Parts\widget.sldprt`)
- `-t, --transition` — Name of the workflow transition (e.g. `"AA"`)
- `--comment` — Optional transition comment
- `-u, --username` — PDM username (prompts if omitted)

**Output files:**
- `batch_workflow_paths_{timestamp}.log` — detailed log
- `failed_transitions_{timestamp}.txt` — real failures worth retrying
- `not_available_{timestamp}.txt` — files whose transition wasn't valid
  (typically already in the target state from a prior run)

For implementation details on the restricted `ChangeState3` COM method and
why it requires ctypes/comtypes vtable access, see
[BATCH_NOTES.md](BATCH_NOTES.md).

---

### `batch_copy_tree.py` — Batch Copy Tree Export

Reads part numbers from a CSV, runs PDM's Copy Tree function for each, and
exports each part's file tree to its own subfolder.

```bash
python helpers/batch_copy_tree.py -c parts.csv -o "C:\Temp\Output" --vault "Drilling_Test"
```

---

### `db_helper.py` — Interactive SQL Helper

Runs SELECT queries, multi-step tasks, and confirmed INSERTs against either
database from `config.json`. Queries are stored as `.sql` files in
`helpers/queries/` and referenced by name.

**List saved queries:**
```bash
python helpers/db_helper.py --list-queries
```

**Run a saved query by name:**
```bash
python helpers/db_helper.py --db target_db --query get_var47
```

**Run raw SQL (anything with a space in it is treated as a literal query):**
```bash
python helpers/db_helper.py --db target_db --query "SELECT TOP 10 * FROM Documents"
```

**Run a predefined task:**
```bash
python helpers/db_helper.py --db target_db --task copy_57_to_50 --dry-run
python helpers/db_helper.py --db target_db --task copy_57_to_50
```

#### Safety features

- Every INSERT or UPDATE goes through `preview_and_confirm` — you see the
  SQL, the row count, and a sample of the data and must type `y` before it
  executes.
- `--dry-run` shows the preview but skips execution entirely.
- All writes run inside a transaction. On any per-row error you're asked
  whether to commit or rollback.
- Every query, parameter set, and decision is logged to
  `db_helper_{timestamp}.log`.

#### Saved SQL Queries

Drop a `.sql` file into `helpers/queries/` and it becomes callable by its
filename (without extension). Leave a comment on the first line for an
inline description — it shows up in `--list-queries`.

Current queries:
- `DWS_GET_VV-57.sql` — Documents in DWS paths that have VariableID=57
- `DWS_VV-57_FullList.sql` — Full VariableValue rows for VV-57 in DWS paths
- `Get_All_VV_Per_DocID.sql` — All distinct VariableIDs for a given
  DocumentID (parameterized with `?`)
- `INSERT_VV50_Copy.sql` — Inserts a VV-50 copy of a VV-57 row

#### Tasks

Tasks are Python functions in `db_helper.py` that chain multiple queries
and transforms together — e.g. run a SELECT, loop the results, run a
second parameterized SELECT per row, validate, then INSERT filtered rows
with confirmation.

Each task is registered in `TASK_REGISTRY` near the bottom of the file.
Current tasks:

| Task | Purpose |
|------|---------|
| `check_vv50` | For every doc with VV-57, check whether it also has VV-50. Writes `has_vv50_{timestamp}.txt`. |
| `copy_57_to_50` | Insert VV-50 rows mirroring existing VV-57 rows, skipping any DocumentIDs already in a `has_vv50_*.txt` file. |
| `copy_with_new_id` | Example/template task — copy rows with a transformed ID. |

**Adding a new task:** write a function `def task_foo(db, args): ...` and
add it to `TASK_REGISTRY`. The building blocks `run_select`, `load_query`,
`preview_and_confirm`, and `run_insert` are all at the top of the file.

---

## Understanding the Logs

### Migration Progress
```
Processing batch 10/100 (500 records)...
Batch 10 complete: inserted=450, updated=50, errors=0
[COMMIT] Transaction committed at batch 10
```

### Validation Results
```
==================================================
$  Migration Validation Completed!
==================================================
Gross Success rate: 95.39%
Success rate w/o Ignored Files: 100.00%
371630 of 397043 Rows were found
--------------------------------------------------
MISSING ROW COUNT: 0 - See CSV output for details
We ignored a total of 25413 rows. We couldn't map these to the TargetDB
```

- **Gross Success rate** — % of all source records found in target
- **Success rate w/o Ignored Files** — % of mappable records found (should
  be 100%)
- **MISSING ROW COUNT** — Records that should exist but don't (should be 0)
- **Ignored** — Records that couldn't be mapped (unmapped variables,
  documents, or configurations)

## Important Notes

### File Data Migration Behavior

1. **Only Latest Revisions** — File data migration only migrates the most
   recent revision of each variable for each file configuration. Historical
   revisions are not migrated.
2. **RevisionNo Reset** — All migrated file data is inserted with
   `RevisionNo = 1` in the target database.
3. **Configuration Mapping** — You MUST verify manual overrides in
   `config.json` before running.

### Progress Tracking and Resume

Both migration scripts support automatic resume:
- Progress is saved every 10 batches.
- If a migration fails, re-run the script and it will offer to resume.
- Progress files are automatically cleaned up on success.

### Validation

All migrations include automatic validation:
- Compares source records (after mapping) to target records using set-based
  comparison.
- Reports any missing records to CSV.
- Should show 100% success rate for mappable records.

## Troubleshooting

### "Migration failed at batch X"
Check the log file, then re-run and choose `y` to resume from the last
checkpoint.

### "We ignored a total of X rows"
Expected for unmapped variables, documents, or configurations. Check the
mapping CSV files to see what was skipped.

### "MISSING ROW COUNT: X" (where X > 0)
Indicates a real problem:
1. Check `validation_missing_*.csv` for details.
2. Verify ID mappings in the mapping CSV files.
3. Check the migration log for insert errors.

### Configuration Mapping Issues
If you see warnings about duplicate ConfigurationNames:
1. Run the SQL query above to find duplicates.
2. Determine the correct target ID for each source configuration.
3. Add manual overrides to `config.json`.
4. Re-run the migration.

### Database Connection Timeouts
- Progress is saved automatically — re-run to resume.
- Consider reducing `batch_size` in `config.json`.

### Batch Workflow Transition Failures

If you see `[CS3] Phase-2 access violation ...` warnings in
`batch_workflow_paths_*.log`:

- The script automatically retries with escalating backoff (3s → 10s → 30s).
- After 3 consecutive persistent failures it automatically reconnects the
  vault to reset PDM's in-process DLL state.
- Genuine failures end up in `failed_transitions_{timestamp}.txt` — feed
  that file straight back in to retry just the failures.
- Files that appear in `not_available_{timestamp}.txt` aren't really
  failures; they were already in the target state (e.g. from a previous
  successful run).

See [BATCH_NOTES.md](BATCH_NOTES.md) for full background on why
`ChangeState3` is difficult to call and how the COM bridging works.

## Best Practices

1. **Always back up** the target database before running migrations.
2. **Test on a dev/test environment first**.
3. **Review mapping CSV files** to verify ID mappings are correct.
4. **Check validation results** — 100% success for mappable records.
5. **Keep `config.json`** with any manual overrides for future reference.
6. **Use `--dry-run`** with `db_helper.py` tasks before real runs.
7. **Save the `has_vv50_*.txt` / `failed_transitions_*.txt` output files**
   — they let you incrementally mop up residual work without re-processing
   everything.

## Support

For issues or questions:
1. Check the log files for detailed error messages.
2. Review the mapping CSV files to verify ID mappings.
3. Ensure `config.json` is properly configured.
4. Verify database connectivity and permissions.
5. For PDM COM API internals, see [BATCH_NOTES.md](BATCH_NOTES.md).