File Upload Script Documentation¶
This script allows you to upload a folder containing files to a server using SAS/Signed URLs. It supports both Azure and Google Cloud Storage providers and includes features like concurrent uploads, checksum verification, and recursive folder processing.
Prerequisites¶
- Download the script from the Karya Language Server GitHub repository
- Path:
/scripts/upload_script.py - Install dependencies by running:
- Ensure all other missing dependencies (if any) are installed on your system
How it Works¶
The upload process follows these steps:
- File Scanning: The script scans your specified folder for files
- Checksum Calculation: MD5 checksums are calculated for each file
- MIME Type Detection: File types are automatically detected
- URL Generation: The server generates SAS/Signed URLs for each file
- Concurrent Upload: Files are uploaded to cloud storage (max 5 concurrent uploads)
- Completion: The upload process is finalized and a dataset ID is returned
Key Features¶
- Concurrent Uploads: Up to 5 files uploaded simultaneously for faster processing
- Checksum Verification: MD5 checksums ensure file integrity
- Recursive Processing: Option to process files in nested folders
- Multiple Providers: Support for both Azure and Google Cloud Storage
- Error Handling: Comprehensive error reporting and retry logic
- Progress Logging: Detailed logging with optional verbose mode
- File Count Limit: For Sarvam ASR, only 1–20 audio files can be uploaded at a time to prevent failures.
Command Line Arguments¶
Required Arguments¶
| Argument | Description |
|---|---|
folder_path |
Path to the folder containing files to upload (for Sarvam ASR, 1–20 audio files to upload per batch) |
--api-key |
API key for server authentication |
--user-email |
User email address |
--server-url |
Base server URL (e.g., https://dev-server.com) |
Optional Arguments¶
| Argument | Default | Description |
|---|---|---|
--provider |
Azure |
Storage provider (Azure or Google) |
--project-name |
Individual |
Project name for the upload |
--title |
Auto-generated | Custom title for the upload |
--description |
Auto-generated | Custom description for the upload |
--recursive |
False |
Process files recursively from nested folders |
--verbose, -v |
False |
Enable verbose logging (debug level) |
Usage Examples¶
Basic Upload (Top-level files only)¶
python upload_script.py /path/to/files \
--api-key YOUR_KEY \
--user-email user@example.com \
--server-url https://dev-server.com
Recursive Upload (Include nested folders)¶
python upload_script.py /path/to/files \
--api-key YOUR_KEY \
--user-email user@example.com \
--server-url https://dev-server.com \
--recursive
Google Cloud Storage Upload¶
python upload_script.py /path/to/files \
--api-key YOUR_KEY \
--user-email user@example.com \
--server-url https://dev-server.com \
--provider Google
Complete Example with All Options¶
python upload_script.py ./my-dataset/ \
--api-key "your-api-key-here" \
--user-email "user@example.com" \
--server-url "https://api.yourcompany.com" \
--provider Google \
--project-name "ML-Project" \
--title "Audio Dataset v2" \
--description "Updated audio files for ML training" \
--recursive \
--verbose
Output and Results¶
Success Output¶
When the upload completes successfully, you'll see:
Progress Logging¶
The script provides detailed logging:
2024-01-15 10:30:00 - INFO - === File Upload Configuration ===
2024-01-15 10:30:00 - INFO - Folder Path: ./my-dataset/
2024-01-15 10:30:00 - INFO - Server URL: https://api.yourcompany.com
2024-01-15 10:30:00 - INFO - Provider: Google
2024-01-15 10:30:00 - INFO - User Email: user@example.com
2024-01-15 10:30:00 - INFO - Project Name: ML-Project
2024-01-15 10:30:00 - INFO - Recursive: Yes
2024-01-15 10:30:00 - INFO - ===================================
2024-01-15 10:30:01 - INFO - Step 1: Scanning files...
2024-01-15 10:30:01 - INFO - Found 25 files
2024-01-15 10:30:02 - INFO - Step 2: Creating JSON payload...
2024-01-15 10:30:02 - INFO - Step 3: Getting upload URLs...
2024-01-15 10:30:03 - INFO - Received task ID: task_12345
2024-01-15 10:30:03 - INFO - Step 4: Uploading files...
2024-01-15 10:30:05 - INFO - ✓ Successfully uploaded file1.wav (1024000 bytes)
2024-01-15 10:30:06 - INFO - ✓ Successfully uploaded file2.wav (2048000 bytes)
2024-01-15 10:30:10 - INFO - Upload Results: 25/25 files uploaded successfully
2024-01-15 10:30:10 - INFO - Step 5: Completing upload...
2024-01-15 10:30:11 - INFO - ✓ Upload process completed! Dataset ID: abc123-def456-ghi789
Error Handling¶
Common Error Scenarios¶
-
Dataset Name Conflict (409 Error)
-
Authentication Failure
-
File Not Found
-
Upload Failures
Troubleshooting¶
- Check API Key: Ensure your API key is valid and has upload permissions
- Verify Server URL: Make sure the server URL is correct and accessible
- File Permissions: Ensure the script has read access to your files
- Network Connectivity: Check your internet connection for large file uploads
- Use Verbose Mode: Add
--verboseflag for detailed debugging information
Best Practices¶
- File Organization: Keep related files in the same folder for easier management
- Naming Conventions: Use descriptive filenames and avoid special characters
- File Sizes: Large files may take longer to upload; consider splitting very large files
- Recursive Uploads: Use
--recursiveonly when you need files from subdirectories - Verbose Logging: Use
--verbosefor troubleshooting upload issues - Backup: Always keep backups of your original files before uploading
Next Steps¶
After successful upload, you can:
- Use the returned
dataset_idto create transcription tasks - Share the dataset with team members
- Create additional datasets using the same process
Note
The script automatically generates titles and descriptions if not provided. For better organization, consider providing custom titles and descriptions that clearly identify your datasets.