Skip to content

File Upload Script Documentation

This script allows you to upload a folder containing files to a server using SAS/Signed URLs. It supports both Azure and Google Cloud Storage providers and includes features like concurrent uploads, checksum verification, and recursive folder processing.


Prerequisites

  1. Download the script from the Karya Language Server GitHub repository
  2. Path: /scripts/upload_script.py
  3. Install dependencies by running:
pip install aiofiles aiohttp requests
  1. Ensure all other missing dependencies (if any) are installed on your system

How it Works

The upload process follows these steps:

  1. File Scanning: The script scans your specified folder for files
  2. Checksum Calculation: MD5 checksums are calculated for each file
  3. MIME Type Detection: File types are automatically detected
  4. URL Generation: The server generates SAS/Signed URLs for each file
  5. Concurrent Upload: Files are uploaded to cloud storage (max 5 concurrent uploads)
  6. Completion: The upload process is finalized and a dataset ID is returned

Key Features

  • Concurrent Uploads: Up to 5 files uploaded simultaneously for faster processing
  • Checksum Verification: MD5 checksums ensure file integrity
  • Recursive Processing: Option to process files in nested folders
  • Multiple Providers: Support for both Azure and Google Cloud Storage
  • Error Handling: Comprehensive error reporting and retry logic
  • Progress Logging: Detailed logging with optional verbose mode
  • File Count Limit: For Sarvam ASR, only 1–20 audio files can be uploaded at a time to prevent failures.

Command Line Arguments

Required Arguments

Argument Description
folder_path Path to the folder containing files to upload (for Sarvam ASR, 1–20 audio files to upload per batch)
--api-key API key for server authentication
--user-email User email address
--server-url Base server URL (e.g., https://dev-server.com)

Optional Arguments

Argument Default Description
--provider Azure Storage provider (Azure or Google)
--project-name Individual Project name for the upload
--title Auto-generated Custom title for the upload
--description Auto-generated Custom description for the upload
--recursive False Process files recursively from nested folders
--verbose, -v False Enable verbose logging (debug level)

Usage Examples

Basic Upload (Top-level files only)

python upload_script.py /path/to/files \
  --api-key YOUR_KEY \
  --user-email user@example.com \
  --server-url https://dev-server.com

Recursive Upload (Include nested folders)

python upload_script.py /path/to/files \
  --api-key YOUR_KEY \
  --user-email user@example.com \
  --server-url https://dev-server.com \
  --recursive

Google Cloud Storage Upload

python upload_script.py /path/to/files \
  --api-key YOUR_KEY \
  --user-email user@example.com \
  --server-url https://dev-server.com \
  --provider Google

Complete Example with All Options

python upload_script.py ./my-dataset/ \
  --api-key "your-api-key-here" \
  --user-email "user@example.com" \
  --server-url "https://api.yourcompany.com" \
  --provider Google \
  --project-name "ML-Project" \
  --title "Audio Dataset v2" \
  --description "Updated audio files for ML training" \
  --recursive \
  --verbose

Output and Results

Success Output

When the upload completes successfully, you'll see:

🎉 SUCCESS! Dataset ID: abc123-def456-ghi789

Progress Logging

The script provides detailed logging:

2024-01-15 10:30:00 - INFO - === File Upload Configuration ===
2024-01-15 10:30:00 - INFO - Folder Path: ./my-dataset/
2024-01-15 10:30:00 - INFO - Server URL: https://api.yourcompany.com
2024-01-15 10:30:00 - INFO - Provider: Google
2024-01-15 10:30:00 - INFO - User Email: user@example.com
2024-01-15 10:30:00 - INFO - Project Name: ML-Project
2024-01-15 10:30:00 - INFO - Recursive: Yes
2024-01-15 10:30:00 - INFO - ===================================
2024-01-15 10:30:01 - INFO - Step 1: Scanning files...
2024-01-15 10:30:01 - INFO - Found 25 files
2024-01-15 10:30:02 - INFO - Step 2: Creating JSON payload...
2024-01-15 10:30:02 - INFO - Step 3: Getting upload URLs...
2024-01-15 10:30:03 - INFO - Received task ID: task_12345
2024-01-15 10:30:03 - INFO - Step 4: Uploading files...
2024-01-15 10:30:05 - INFO - ✓ Successfully uploaded file1.wav (1024000 bytes)
2024-01-15 10:30:06 - INFO - ✓ Successfully uploaded file2.wav (2048000 bytes)
2024-01-15 10:30:10 - INFO - Upload Results: 25/25 files uploaded successfully
2024-01-15 10:30:10 - INFO - Step 5: Completing upload...
2024-01-15 10:30:11 - INFO - ✓ Upload process completed! Dataset ID: abc123-def456-ghi789

Error Handling

Common Error Scenarios

  1. Dataset Name Conflict (409 Error)

    ❌ ERROR: Dataset name already exists. Please use a different name or delete the existing dataset first.
    

  2. Authentication Failure

    ❌ ERROR: Server error 401: Unauthorized
    

  3. File Not Found

    ❌ ERROR: Folder path does not exist: /invalid/path
    

  4. Upload Failures

    ⚠️ WARNING: 3 files failed to upload
    

Troubleshooting

  • Check API Key: Ensure your API key is valid and has upload permissions
  • Verify Server URL: Make sure the server URL is correct and accessible
  • File Permissions: Ensure the script has read access to your files
  • Network Connectivity: Check your internet connection for large file uploads
  • Use Verbose Mode: Add --verbose flag for detailed debugging information

Best Practices

  1. File Organization: Keep related files in the same folder for easier management
  2. Naming Conventions: Use descriptive filenames and avoid special characters
  3. File Sizes: Large files may take longer to upload; consider splitting very large files
  4. Recursive Uploads: Use --recursive only when you need files from subdirectories
  5. Verbose Logging: Use --verbose for troubleshooting upload issues
  6. Backup: Always keep backups of your original files before uploading

Next Steps

After successful upload, you can:

  • Use the returned dataset_id to create transcription tasks
  • Share the dataset with team members
  • Create additional datasets using the same process

Note

The script automatically generates titles and descriptions if not provided. For better organization, consider providing custom titles and descriptions that clearly identify your datasets.