File Upload Script Documentation¶

This script allows you to upload a folder containing files to a server using SAS/Signed URLs. It supports both Azure and Google Cloud Storage providers and includes features like concurrent uploads, checksum verification, and recursive folder processing.

Prerequisites¶

Download the script from the Karya Language Server GitHub repository
Path: /scripts/upload_script.py
Install dependencies by running:

pip install aiofiles aiohttp requests

Ensure all other missing dependencies (if any) are installed on your system

How it Works¶

The upload process follows these steps:

File Scanning: The script scans your specified folder for files
Checksum Calculation: MD5 checksums are calculated for each file
MIME Type Detection: File types are automatically detected
URL Generation: The server generates SAS/Signed URLs for each file
Concurrent Upload: Files are uploaded to cloud storage (max 5 concurrent uploads)
Completion: The upload process is finalized and a dataset ID is returned

Key Features¶

Concurrent Uploads: Up to 5 files uploaded simultaneously for faster processing
Checksum Verification: MD5 checksums ensure file integrity
Recursive Processing: Option to process files in nested folders
Multiple Providers: Support for both Azure and Google Cloud Storage
Error Handling: Comprehensive error reporting and retry logic
Progress Logging: Detailed logging with optional verbose mode
File Count Limit: For Sarvam ASR, only 1–20 audio files can be uploaded at a time to prevent failures.

Command Line Arguments¶

Required Arguments¶

Argument	Description
`folder_path`	Path to the folder containing files to upload (for Sarvam ASR, 1–20 audio files to upload per batch)
`--api-key`	API key for server authentication
`--user-email`	User email address
`--server-url`	Base server URL (e.g., https://dev-server.com)

Optional Arguments¶

Argument	Default	Description
`--provider`	`Azure`	Storage provider (`Azure` or `Google`)
`--project-name`	`Individual`	Project name for the upload
`--title`	Auto-generated	Custom title for the upload
`--description`	Auto-generated	Custom description for the upload
`--recursive`	`False`	Process files recursively from nested folders
`--verbose`, `-v`	`False`	Enable verbose logging (debug level)

Usage Examples¶

Basic Upload (Top-level files only)¶

python upload_script.py /path/to/files \
  --api-key YOUR_KEY \
  --user-email user@example.com \
  --server-url https://dev-server.com

Recursive Upload (Include nested folders)¶

python upload_script.py /path/to/files \
  --api-key YOUR_KEY \
  --user-email user@example.com \
  --server-url https://dev-server.com \
  --recursive

Google Cloud Storage Upload¶

python upload_script.py /path/to/files \
  --api-key YOUR_KEY \
  --user-email user@example.com \
  --server-url https://dev-server.com \
  --provider Google

Complete Example with All Options¶

python upload_script.py ./my-dataset/ \
  --api-key "your-api-key-here" \
  --user-email "user@example.com" \
  --server-url "https://api.yourcompany.com" \
  --provider Google \
  --project-name "ML-Project" \
  --title "Audio Dataset v2" \
  --description "Updated audio files for ML training" \
  --recursive \
  --verbose

Output and Results¶

Success Output¶

When the upload completes successfully, you'll see:

🎉 SUCCESS! Dataset ID: abc123-def456-ghi789

Progress Logging¶

The script provides detailed logging:

2024-01-15 10:30:00 - INFO - === File Upload Configuration ===
2024-01-15 10:30:00 - INFO - Folder Path: ./my-dataset/
2024-01-15 10:30:00 - INFO - Server URL: https://api.yourcompany.com
2024-01-15 10:30:00 - INFO - Provider: Google
2024-01-15 10:30:00 - INFO - User Email: user@example.com
2024-01-15 10:30:00 - INFO - Project Name: ML-Project
2024-01-15 10:30:00 - INFO - Recursive: Yes
2024-01-15 10:30:00 - INFO - ===================================
2024-01-15 10:30:01 - INFO - Step 1: Scanning files...
2024-01-15 10:30:01 - INFO - Found 25 files
2024-01-15 10:30:02 - INFO - Step 2: Creating JSON payload...
2024-01-15 10:30:02 - INFO - Step 3: Getting upload URLs...
2024-01-15 10:30:03 - INFO - Received task ID: task_12345
2024-01-15 10:30:03 - INFO - Step 4: Uploading files...
2024-01-15 10:30:05 - INFO - ✓ Successfully uploaded file1.wav (1024000 bytes)
2024-01-15 10:30:06 - INFO - ✓ Successfully uploaded file2.wav (2048000 bytes)
2024-01-15 10:30:10 - INFO - Upload Results: 25/25 files uploaded successfully
2024-01-15 10:30:10 - INFO - Step 5: Completing upload...
2024-01-15 10:30:11 - INFO - ✓ Upload process completed! Dataset ID: abc123-def456-ghi789

Error Handling¶

Common Error Scenarios¶

Dataset Name Conflict (409 Error)

❌ ERROR: Dataset name already exists. Please use a different name or delete the existing dataset first.

Authentication Failure

❌ ERROR: Server error 401: Unauthorized

File Not Found

❌ ERROR: Folder path does not exist: /invalid/path

Upload Failures

⚠️ WARNING: 3 files failed to upload

Troubleshooting¶

Check API Key: Ensure your API key is valid and has upload permissions
Verify Server URL: Make sure the server URL is correct and accessible
File Permissions: Ensure the script has read access to your files
Network Connectivity: Check your internet connection for large file uploads
Use Verbose Mode: Add --verbose flag for detailed debugging information

Best Practices¶

File Organization: Keep related files in the same folder for easier management
Naming Conventions: Use descriptive filenames and avoid special characters
File Sizes: Large files may take longer to upload; consider splitting very large files
Recursive Uploads: Use --recursive only when you need files from subdirectories
Verbose Logging: Use --verbose for troubleshooting upload issues
Backup: Always keep backups of your original files before uploading

Next Steps¶

After successful upload, you can:

Use the returned dataset_id to create transcription tasks
Share the dataset with team members
Create additional datasets using the same process

Note

The script automatically generates titles and descriptions if not provided. For better organization, consider providing custom titles and descriptions that clearly identify your datasets.