Textify.img

Overview

Serverless Alt-Text Generation for better accessibility on the web using AI, built mostly on Amazon Web Services. Started as a proof-of-concept and ended up just working on building it :). Frontend is built on React, Backend is entirely serverless and secured using AWS Cognito. The resulting caption is returned to the user through a fully decoupled, secure, and scalable architecture.

Demo

Here is a quick video demo: https://youtu.be/_dbX_a1pUL4

Try it out! https://textify.alialjaffer.com

Disclaimer: Cold start of the Machine Learning model could lead to some extra waiting time while the image processes...

GitHub repository: aliAljaffer/alt-text-generator

The Problem

I wanted to build an easy way to caption images, mainly for use in alt text during web development projects or social media posting (on platforms that support alt text inclusion).

The main concern to me was the cost of building this project. However, knowing that most of the services I decided on were included in the AWS free tier encouraged me to actually go for it. Also, can't forget the peace of mind from setting up a budget limit.

Approach

I began whiteboarding an architecture that I thought would work. In practice, the main services I included did end up being used, but I used many more services than planned. I started with building the frontend using React, a JavaScript framework I'm familiar with. While I realize the website is simple and doesn't require much reactivity, I just wanted the ease-of-use and the ability to separate the app into components for reusability.

Once that was done, I setup two S3 buckets: textify-website for hosting the website files, and textify-img for hosting user uploads. In textify-img, I set up a bucket-wide Lifecycle Rule to expire objects after 24hrs. This is the minimum storage period available. I did not plan on storing user uploads, so this automates that requirement. Next, I created a GitHub Action that fires on push events onto the website repository. It runs a 4-step job:

builder: Builds the React code into a simple HTML, CSS, JS structure that can be hosted statically anywhere. The results are saved as an artifact of the job to be later used on step 3.
remove-old: Removes the previous website files in textify-website.
upload-new: Uploads the artifacts generated by builder onto textify-website.
invalidate_cache: Invalidates the CloudFront CDN cache so that the newly-built version is served to the user.

And thanks to this process, development and deployment are automated and taken care of.

Next, I created two Lambda functions. One function receives an image file image/* from the frontend, and uploads it to textify-img/uploads, then passes the object key to the second Lambda, which sends the image for processing using SageMaker AI. The model used for captioning is Salesforce/blip-image-captioning-large running on an ml.m5.xlarge instance.

To handle uploads, I setup an API Gateway with a /upload POST route. This is what triggers the first Lambda to receive the user image.

CloudFront was used as a CDN, with textify-website S3 bucket as the origin. Finally, I added DNS A and AAAA records to the subdomain textify.alialjaffer.com as ALIAS to the CloudFront distribution.

Security

To keep the image upload feature from being abused, I setup authentication using Cognito as a JWT authorizer, and used oidc-client-ts for session management and authorization flow on the frontend. Moreover, CORS was enabled and setup for the API Gateway routes.

Size limitations that are enforced both on the frontend and the lambda backend should prevent malicious actors from uploading massive, distruptive files into the object storage.

Architectural Design

System design for the Image Captioning project with the services used

Key Outcomes

Scalable System: With the use of a serverless backbone, this solution mostly scales really well. Lambda automatically scales based on demand, API Gateway has burst rates, and if limits are hit, AWS Support can provide a higher limit.
- API Gateway supports burst rates and throttling.
- Lambda scales with concurrent requests.
- AWS support can increase quotas as demand grows.
- S3 and CloudFront offer virtually unlimited throughput.
Great Developer Experience (DX): As the sole developer, automation saves me a lot of the headache of doing repetitive tasks, and ensures less human errors are made!
Improved Web Accessibility at Scale: By automating alt-text generation, this system promotes accessible design for websites and social media content — a benefit that becomes more impactful the more it's used.
Cost Efficiency for Early-Stage Projects: Leveraging AWS free tier services (Lambda, S3, API Gateway, Cognito) makes this a cost-effective solution, especially valuable for prototypes, personal tools, and low-traffic use cases.
Security and Abuse Mitigation: Integration with Cognito for user auth and API protection ensures only authenticated users can generate captions, preventing misuse of compute-intensive resources.

Challenges

Asynchronous Execution: For Lambdas, I was envisioning a workflow for the image like this: upload-to-s3 -> send-to-sagemaker, but I was receiving only the response from the first lambda. I had two options: Either I use Step Functions, or call the second lambda from the first and await the response. I chose the latter, which ended up looking like: upload-to-s3 (upload then trigger the next step and wait) -> send-to-sagemaker (await sagemaker response) -> upload-to-s3 (sends response from sagemaker).

response = lambda_client.invoke(
    FunctionName='send-to-sagemaker',
    # This is the invocation type that allows Async execution
    InvocationType='RequestResponse',
    Payload=json.dumps(object_info)
)

Mobile sign-in issues: The callback function for signin provided by oidc-client-ts was not working as intended on mobile. Resolved by directly redirecting using window.location.href after building the OIDC request
StreamingBody decoding: Confused me for a little at the start. Handled in Lambda by decoding and parsing the payload from SageMaker
CORS and Auth coordination: Ensured consistent headers and token handling between CloudFront -> API Gateway -> Lambda
Cold Starts: Due to the ML model being hosted on a serverless endpoint, cold starts can take up some time. Nothing can be done about this, however, except provisioning a dedicated instance, but then it wouldn't be serverless... 😜

Conclusion

This was an overall fun (and slightly frustrating at times) project to build. It came at a time where I was craving a break from studying by watching theory lectures, and actually got me to go with the hands-on approach to AWS.