Feature #7241
openOptimize PDF Processing Pipeline and Large File Uploads to AWS S3
Description
Currently, a single API endpoint is responsible for handling multiple operations, including:
Downloading a PDF file temporarily from AWS S3
Reading and extracting data from the PDF using a Python script
Processing extracted data (e.g., BOM extraction, matching, persistence)
Handling additional unrelated events within the same request lifecycle
This design has led to high API response times, increased memory usage, and poor scalability.
Additionally, large file uploads to AWS S3 are not optimized, resulting in slow uploads and potential failures under load.
Subtasks
Related issues
Updated by Kalyan Ravula about 16 hours ago
- Status changed from New to In Progress
- % Done changed from 0 to 90
Worked on optimizing the AWS S3 file upload flow to address performance and scalability issues caused by handling large files in a synchronous API.
Reviewed the existing implementation and identified that large files were being uploaded in a blocking manner, increasing API response time and memory usage. Refactored the upload logic to use a streaming-based approach instead of loading entire files into memory. Implemented multipart upload support for large files to improve upload speed, reliability, and fault tolerance.
Added basic logging to capture file size and upload duration to help measure performance improvements and identify bottlenecks. Ensured proper error handling and cleanup to avoid resource leaks during failed or interrupted uploads.
This optimization reduces API latency, improves stability under concurrent uploads, and makes the upload flow more scalable for large files.