Building an AI-Powered Image Captioning App with React and Flask

Lomanu4 · 18 Май 2025

In this tutorial, we'll walk through building a full-stack application that generates descriptive captions for uploaded images using AI. The application combines a React frontend with a Flask backend and leverages Salesforce's BLIP (Bootstrapped Language-Image Pretraining) model via Hugging Face's transformers library.

What We'll Build

We'll create an application that allows users to:

Upload an image from their device
Send the image to a Flask backend
Process the image with the BLIP AI model
Display the generated caption

System Architecture

Here's a high-level diagram of our application architecture:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Tech Stack Overview

Frontend

React: For building the user interface
Axios: For making HTTP requests to the backend
Vite: For fast development and bundling

Backend

Flask: For creating the REST API
Flask-CORS: For handling cross-origin requests
Transformers: Hugging Face's library for using pre-trained models
Pillow: For image processing

AI Model

BLIP (Bootstrapped Language-Image Pretraining): Salesforce's model for generating image captions

Step 1: Setting Up the Backend

Let's start by creating our Flask backend which will handle image processing and caption generation.

First, we need to install the necessary dependencies:

pip install flask flask-cors transformers torch torchvision pillow

Next, create a file called app.py:

import logging
from flask import Flask, request, jsonify
from flask_cors import CORS
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import base64
import io

# Initialize the Flask application
app = Flask(__name__)
# Enable Cross-Origin Resource Sharing (CORS) for the app
CORS(app)

# Configure logging to display information level logs
logging.basicConfig(level=logging.INFO)

# Configuration for the model name
MODEL_NAME = "Salesforce/blip-image-captioning-base"

# Load the BLIP model and processor using the specified model name
captioning_model = BlipForConditionalGeneration.from_pretrained(MODEL_NAME)
image_processor = BlipProcessor.from_pretrained(MODEL_NAME)

def decode_image(base64_image):
"""
Decode a base64 encoded string to a PIL image.
"""
try:
# Decode the base64 string to bytes
image_bytes = base64.b64decode(base64_image)
# Convert bytes to a PIL Image
return Image.open(io.BytesIO(image_bytes))
except Exception as e:
logging.error("Failed to decode image: %s", e)
raise ValueError("Invalid image data")

def generate_caption(image):
"""
Generate a caption for the given image using the BLIP model.
"""
try:
# Process the image and prepare it for the model
model_inputs = image_processor(image, return_tensors="pt")
# Generate a caption using the model
model_output = captioning_model.generate(**model_inputs)
# Decode the model output to a human-readable string
return image_processor.decode(model_output[0], skip_special_tokens=True)
except Exception as e:
logging.error("Failed to generate caption: %s", e)
raise RuntimeError("Caption generation failed")

@app.route('/caption', methods=['POST'])
def caption_image():
"""
Endpoint to generate a caption for a given image.
"""
try:
# Retrieve JSON data from the request
request_data = request.json
# Extract base64 encoded image data
base64_image = request_data.get("image", "")
if not base64_image:
return jsonify({"error": "No image data provided"}), 400

# Decode the image and generate a caption
image = decode_image(base64_image)
generated_caption = generate_caption(image)
# Return the generated caption as a JSON response
return jsonify({"caption": generated_caption})

except ValueError as ve:
# Handle invalid image data
return jsonify({"error": str(ve)}), 400
except RuntimeError as re:
# Handle caption generation failure
return jsonify({"error": str(re)}), 500
except Exception as error:
# Handle unexpected errors
logging.error("Unexpected error: %s", error)
return jsonify({"error": "An unexpected error occurred"}), 500

if __name__ == '__main__':
# Run the Flask application in debug mode
app.run(debug=True)

This backend performs three main functions:

Decodes base64-encoded image data received from the frontend
Processes the image with the BLIP model
Returns the generated caption as a JSON response

Step 2: Creating the React Frontend

Now, let's build our React frontend with Vite. First, set up a new React project:

npm create vite@latest frontend -- --template react
cd frontend
npm install
npm install axios

Now, let's create our main App component in src/App.jsx:

import React, { useState } from "react";
import axios from "axios";

/**
* App component for the Image Captioning application.
* Allows users to upload an image and generate a caption using a backend service.
*/
function App() {
// State to store the selected image as a base64 string
const [selectedImage, setSelectedImage] = useState(null);
// State to store the generated caption for the image
const [generatedCaption, setGeneratedCaption] = useState("");
// State to store any error messages
const [errorMessage, setErrorMessage] = useState("");

// Styles object to manage inline styles for the component
const styles = {
container: { padding: "20px", maxWidth: "600px", margin: "0 auto" },
imagePreview: { width: "100%", maxHeight: "300px" },
button: { padding: "10px", marginTop: "20px", cursor: "pointer" },
errorText: { marginTop: "20px", color: "red" },
captionText: { marginTop: "20px" }
};

/**
* Handles the image upload event.
* Reads the uploaded file and converts it to a base64 string.
* @param {Object} event - The file input change event.
*/
const handleImageUpload = (event) => {
const [uploadedFile] = event.target.files; // Destructure to get the first file
if (uploadedFile) {
const fileReader = new FileReader();
// Set the selected image state when file reading is complete
fileReader.onloadend = () => setSelectedImage(fileReader.result);
// Set an error message if file reading fails
fileReader.onerror = () => setErrorMessage("Failed to read file.");
// Read the file as a data URL (base64 string)
fileReader.readAsDataURL(uploadedFile);
}
};

/**
* Sends the selected image to the backend to generate a caption.
* Updates the generated caption or error message based on the response.
*/
const handleGenerateCaption = async () => {
try {
setErrorMessage(""); // Clear any previous error messages
setGeneratedCaption("Generating caption..."); // Indicate caption generation in progress

// Extract the base64 part of the image data
const base64ImageData = selectedImage?.split(",")[1];
// Send a POST request to the backend with the image data
const response = await axios.post("

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

", { image: base64ImageData });

// Update the generated caption with the response or a default message
setGeneratedCaption(response.data?.caption || "No caption generated.");
} catch (err) {
// Set an error message if the request fails
setErrorMessage("Failed to generate caption. Please try again.");
}
};

return (
<div style={styles.container}>
<h1>Image Captioning App</h1>
{/* File input for uploading images */}
<input type="file" accept="image/*" onChange={handleImageUpload} />
{/* Display the selected image if available */}
{selectedImage && (
<div style={{ marginTop: "20px" }}>
<img src={selectedImage} alt="Preview" style={styles.imagePreview} />
</div>
)}
{/* Button to trigger caption generation */}
<button onClick={handleGenerateCaption} style={styles.button}>
Generate Caption
</button>
{/* Display the generated caption if available */}
{generatedCaption && <p style={styles.captionText}>Caption: {generatedCaption}</p>}
{/* Display an error message if available */}
{errorMessage && <p style={styles.errorText}>{errorMessage}</p>}
</div>
);
}

export default App;

This frontend provides:

An input for uploading images
A preview of the selected image
A button to trigger caption generation
Display areas for the generated caption and any error messages

How It Works: The Data Flow

Here's a detailed flowchart of how data moves through our application:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Understanding the BLIP Model

The BLIP (Bootstrapped Language-Image Pretraining) model from Salesforce is a powerful vision-language model that can perform various tasks including image captioning.

Key Features of BLIP

Multimodal Learning: BLIP understands both images and text, allowing it to generate coherent captions that describe the content of images.
Bootstrapped Learning: It uses a bootstrapped approach that helps clean noisy image-text pairs from the web, resulting in better performance.
Versatility: Beyond image captioning, BLIP can also perform visual question answering, image-text retrieval, and more.

BLIP was introduced in the paper

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

by Li et al. (2022)1.

Error Handling and Optimization

Our application includes several error-handling measures:

Frontend Error Handling:
- Checks for valid image uploads
- Displays user-friendly error messages
- Shows loading states during caption generation
Backend Error Handling:
- Validates input data
- Catches and logs exceptions
- Returns appropriate HTTP status codes

Potential Enhancements

Here are some ways to extend this application:

Multiple Caption Generation: Generate multiple captions with different parameters.
User Feedback Loop: Allow users to rate captions and use this feedback to fine-tune the model.
Style Transfer: Add image filters or style transfer options before captioning.
Progressive Web App (PWA): Convert to a PWA for offline capabilities.
Advanced UI: Implement drag-and-drop functionality and animations.

Performance Considerations

When working with ML models like BLIP, consider the following:

Model Size: The BLIP model is large (several hundred MB). Consider loading strategies or serving options to optimize initial load time.
Caching: Implement caching for repeated requests with the same images.
Batching: If supporting multiple users, implement request batching to increase throughput.

Conclusion

In this tutorial, we've built a complete image captioning application using React, Flask, and the BLIP model. This project demonstrates how to:

Set up a Flask backend with a machine learning model
Create a React frontend for image upload and display
Implement communication between frontend and backend
Process and transform data for AI model consumption

The combination of modern web technologies with powerful AI models opens up endless possibilities for creative applications. The techniques shown here can be extended to other vision-language tasks like visual question answering, image generation, and more.

Resources and References

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.
Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.
Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.
Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

GitHub Repository:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML 2022.

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Building an AI-Powered Image Captioning App with React and Flask

Lomanu4