🎯 Transform your PDF documents into an intelligent, searchable knowledge base using AI
📖 Quick Start • 🏗️ Architecture • 🔧 Manual Setup • ❓ Troubleshooting
- 🌟 Features
- 🏗️ Architecture
- 🚀 Quick Start
- 📋 Prerequisites
- 🔧 Manual Setup
- 📱 Usage Guide
- 🛠️ Troubleshooting
- 💰 Cost Optimization
- 🔒 Security
- 🤝 Contributing
| 📄 PDF Processing | 🧠 AI-Powered Search | 🔐 Secure Authentication | ☁️ Cloud Native |
|---|---|---|---|
| Upload and extract text from PDFs | Natural language queries using Gemini AI | JWT-based auth via AWS Cognito | Serverless architecture on AWS |
| Automatic text chunking | Vector similarity search with FAISS | User registration & login | Auto-scaling Lambda functions |
| Metadata extraction | Semantic understanding | Session management | Pay-per-use pricing |
- 📥 Smart Document Upload: Direct-to-S3 upload with presigned URLs (supports files up to 5GB)
- 🔍 Intelligent Search: Ask questions in natural language and get accurate answers from your documents
- 🧮 Vector Embeddings: Advanced FAISS-based similarity search for contextual understanding
- 🔄 Real-time Processing: Background document processing with status updates
- 📊 User Dashboard: Manage your document collection with an intuitive React interface
- 🛡️ Enterprise Security: AWS Cognito authentication with JWT tokens
- 📈 Scalable Architecture: Serverless design that scales from 0 to millions of requests
graph TB
subgraph "🌐 Frontend Layer"
A[React App<br/>📱 localhost:3000]
end
subgraph "🔐 Authentication"
B[AWS Cognito<br/>🔑 User Pool]
end
subgraph "🚪 API Gateway"
C[HTTP API Gateway<br/>🌍 REST Endpoints]
end
subgraph "⚡ Processing Layer"
D[Upload λ<br/>📤 PDF Upload]
E[Query λ<br/>🔍 AI Search]
F[Presigned URL λ<br/>🔗 S3 Links]
G[Process Upload λ<br/>⚙️ Background Processing]
end
subgraph "💾 Storage Layer"
H[S3 Bucket<br/>📁 PDF Storage]
I[DynamoDB<br/>🗄️ Metadata & Embeddings]
J[Secrets Manager<br/>🔐 API Keys]
end
subgraph "🤖 AI Services"
K[Google Gemini API<br/>🧠 Text Generation]
L[FAISS Vector DB<br/>📊 Similarity Search]
end
A --> B
A --> C
C --> D
C --> E
C --> F
C --> G
D --> H
D --> I
E --> I
E --> K
E --> L
F --> H
G --> I
G --> K
G --> L
D --> J
E --> J
G --> J
sequenceDiagram
participant U as 👤 User
participant F as 📱 Frontend
participant API as 🚪 API Gateway
participant L as ⚡ Lambda
participant S3 as 📁 S3
participant DB as 🗄️ DynamoDB
participant AI as 🤖 Gemini AI
Note over U,AI: 📄 Document Upload Flow
U->>F: Upload PDF
F->>API: Request presigned URL
API->>L: Get upload link
L->>S3: Generate presigned URL
S3-->>L: Return URL
L-->>F: Presigned URL
F->>S3: Direct upload PDF
S3->>L: Trigger processing
L->>AI: Extract & embed text
AI-->>L: Text embeddings
L->>DB: Store metadata & embeddings
Note over U,AI: 🔍 Query Flow
U->>F: Ask question
F->>API: Send query
API->>L: Process query
L->>AI: Generate query embedding
AI-->>L: Query vector
L->>DB: Similarity search
DB-->>L: Relevant chunks
L->>AI: Generate answer
AI-->>L: Final answer
L-->>F: Response
F-->>U: Display answer
The fastest way to get started is using our automated deployment script:
# 1️⃣ Clone the repository
git clone <your-repo-url>
cd cc-internship
# 2️⃣ Configure your API key
# Edit deploy.sh and replace YOUR_GEMINI_KEY_HERE with your actual Gemini API key
nano deploy.sh # Line 27: GEMINI_API_KEY="your_actual_key_here"
# 3️⃣ Make the script executable
chmod +x deploy.sh
# 4️⃣ Deploy everything (takes 5-10 minutes)
./deploy.sh🎉 That's it! The script will:
- ✅ Create all AWS resources (IAM, S3, DynamoDB, Cognito, Lambda, API Gateway)
- ✅ Deploy Lambda functions with proper configurations
- ✅ Set up the React frontend with environment variables
- ✅ Start the development server at
http://localhost:3000
After running the deployment script, you'll see a status summary:
==================================
DEPLOYMENT SUMMARY
==================================
IAM: ✅
Secrets: ✅
S3: ✅
DynamoDB: ✅
Cognito: ✅
Lambda: ✅
API Gateway: ✅
Dependencies: ✅
==================================
🎉 Frontend started at http://localhost:3000
Before running the deployment script, ensure you have:
| Tool | Version | Purpose | Installation |
|---|---|---|---|
| 🐍 Python | 3.12+ | Lambda runtime | Download |
| ☁️ AWS CLI | Latest | AWS resource management | pip install awscli |
| 🔨 SAM CLI | Latest | Serverless deployment | Install Guide |
| 📦 Node.js | 16+ | React frontend | Download |
| 📊 jq | Latest | JSON parsing | apt install jq / brew install jq |
| 📁 zip | Latest | Lambda packaging | Pre-installed on most systems |
- Create AWS Account (if you don't have one)
- Configure AWS CLI:
aws configure # Enter your Access Key ID, Secret Access Key, Region (ap-south-1), Output format (json) - Get Gemini API Key from Google AI Studio
Run this command to verify your setup:
# Check all required tools
for tool in python3 aws sam node npm jq zip; do
if command -v $tool &> /dev/null; then
echo "✅ $tool: $(command -v $tool)"
else
echo "❌ $tool: Not found"
fi
done
# Check AWS configuration
aws sts get-caller-identityIf you prefer manual deployment or need to troubleshoot, follow these detailed steps:
📂 Step 1: Project Structure
cc-internship/
├── 📁 backend/ # Lambda functions
│ ├── 📁 upload/ # PDF upload handler
│ ├── 📁 query/ # AI query processor
│ ├── 📁 presigned-url/ # S3 URL generator
│ └── 📁 process-upload/ # Background processor
├── 📁 frontend/ # React application
│ ├── 📁 src/ # Source code
│ └── 📄 package.json # Dependencies
├── 📁 infra/ # SAM template
│ └── 📄 template.yaml # Infrastructure as code
├── 🚀 deploy.sh # Automated deployment
├── 🗑️ teardown.sh # Resource cleanup
└── 📖 REDEPLOY.md # Manual deployment guide
🔐 Step 2: IAM User Creation
# Create deployment user
aws iam create-user --user-name pai-deployment-user
# Create access key
aws iam create-access-key --user-name pai-deployment-user
# Create policy
cat > pai-deployment-policy.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudformation:*", "s3:*", "lambda:*",
"apigateway:*", "dynamodb:*", "cognito-idp:*",
"iam:GetRole", "iam:CreateRole", "iam:AttachRolePolicy",
"iam:PassRole", "secretsmanager:*", "logs:*"
],
"Resource": "*"
}
]
}
EOF
# Attach policy
aws iam create-policy --policy-name pai-deployment-policy --policy-document file://pai-deployment-policy.json
aws iam attach-user-policy --user-name pai-deployment-user --policy-arn arn:aws:iam::ACCOUNT:policy/pai-deployment-policy🔒 Step 3: Secrets Management
# Store Gemini API key securely
aws secretsmanager create-secret \
--name pai-gemini-api-key \
--description "Gemini API key for PAI platform" \
--secret-string "YOUR_GEMINI_API_KEY" \
--region ap-south-1
# Verify secret creation
aws secretsmanager describe-secret --secret-id pai-gemini-api-key --region ap-south-1📦 Step 4: S3 Bucket Setup
# Create bucket with timestamp for uniqueness
BUCKET_NAME="pai-pdf-storage-$(date +%s)"
aws s3 mb "s3://$BUCKET_NAME" --region ap-south-1
# Enable versioning
aws s3api put-bucket-versioning \
--bucket $BUCKET_NAME \
--versioning-configuration Status=Enabled
# Configure CORS for frontend uploads
aws s3api put-bucket-cors --bucket $BUCKET_NAME --cors-configuration '{
"CORSRules": [{
"AllowedHeaders": ["*"],
"AllowedMethods": ["GET", "POST", "PUT"],
"AllowedOrigins": ["*"],
"ExposeHeaders": ["ETag"],
"MaxAgeSeconds": 3000
}]
}'🗄️ Step 5: DynamoDB Table
# Create table for embeddings and metadata
aws dynamodb create-table \
--table-name pai-embeddings-metadata \
--attribute-definitions AttributeName=doc_id,AttributeType=S \
--key-schema AttributeName=doc_id,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--region ap-south-1
# Wait for table to become active
aws dynamodb wait table-exists --table-name pai-embeddings-metadata --region ap-south-1🔑 Step 6: Cognito User Pool
# Create user pool
aws cognito-idp create-user-pool \
--pool-name pai-user-pool \
--policies '{"PasswordPolicy": {"MinimumLength": 8}}' \
--auto-verified-attributes email \
--alias-attributes email \
--region ap-south-1
# Create app client (save the IDs)
aws cognito-idp create-user-pool-client \
--user-pool-id YOUR_POOL_ID \
--client-name pai-client \
--no-generate-secret \
--explicit-auth-flows ALLOW_USER_PASSWORD_AUTH ALLOW_REFRESH_TOKEN_AUTH \
--region ap-south-1
# Create domain
aws cognito-idp create-user-pool-domain \
--domain "pai-auth-$(date +%s)" \
--user-pool-id YOUR_POOL_ID \
--region ap-south-1🚀 SAM Deployment
# Navigate to infrastructure directory
cd infra
# Build SAM application
sam build
# Deploy with guided setup
sam deploy --guided
# Or use automated deployment
sam deploy \
--stack-name pai-stack \
--s3-bucket your-deployment-bucket \
--capabilities CAPABILITY_IAM \
--region ap-south-1⚙️ Environment Configuration
# Navigate to frontend directory
cd frontend
# Create environment file
cat > .env << EOF
REACT_APP_API_URL=https://your-api-id.execute-api.ap-south-1.amazonaws.com
REACT_APP_COGNITO_USER_POOL_ID=ap-south-1_xxxxxxxxx
REACT_APP_COGNITO_USER_POOL_CLIENT_ID=xxxxxxxxxxxxxxxxxxxxxxxxxx
EOF
# Install dependencies
npm install --legacy-peer-deps
# Start development server
npm start-
🌐 Open your browser and navigate to
http://localhost:3000 -
👤 Create an account:
- Click "Create Account"
- Enter your email and password
- Verify your email (check your inbox)
-
📄 Upload your first document:
- Click "Upload PDF" or drag & drop
- Wait for processing to complete
- You'll see a success message when ready
-
🔍 Start querying:
- Type your question in natural language
- Example: "What are the main points discussed in this document?"
- Get AI-powered answers instantly
| Query Type | Example Question | Expected Response |
|---|---|---|
| 📊 Summarization | "Summarize the main points of this document" | Structured summary with key highlights |
| 🔍 Fact Finding | "What is the budget mentioned for Q4?" | Specific numbers and context |
| 📈 Analysis | "What are the risks mentioned in the report?" | Risk analysis with details |
| 📋 Lists | "List all the recommendations made" | Bullet-pointed recommendations |
🖥️ Dashboard Overview
┌─────────────────────────────────────────────┐
│ 🚀 Personal AI Knowledge Platform │
├─────────────────────────────────────────────┤
│ 📤 Upload PDF | 🔍 Search Documents │
├─────────────────────────────────────────────┤
│ 📁 My Documents | ⚙️ Settings │
│ ┌─────────────┐ | 👤 Profile │
│ │ Document 1 │ | 🚪 Logout │
│ │ ✅ Processed │ | │
│ └─────────────┘ | │
└─────────────────────────────────────────────┘
❌ Deployment Fails
Issue: CloudFormation deployment fails with transform errors
Solution:
# Check for SCP restrictions
aws iam simulate-principal-policy \
--policy-source-arn arn:aws:iam::ACCOUNT:user/pai-deployment-user \
--action-names cloudformation:CreateChangeSet \
--resource-arns "arn:aws:cloudformation:ap-south-1:aws:transform/Serverless-2020-10-31"
# Use manual deployment as fallback
./deploy.sh # Script automatically handles SCP issues🔧 Lambda Function Errors
Issue: Functions return "Internal Server Error"
Solution:
# Check function logs
aws logs describe-log-groups --log-group-name-prefix /aws/lambda/pai
# Fix common issues
./deploy.sh --fix-issues
# Update environment variables manually
aws lambda update-function-configuration \
--function-name pai-query \
--environment Variables='{
"DYNAMODB_TABLE":"pai-embeddings-metadata",
"GEMINI_SECRET_NAME":"pai-gemini-api-key"
}'🌐 Frontend Connection Issues
Issue: Frontend can't connect to backend
Solution:
- Check
.envfile in frontend directory - Verify API Gateway URL is correct
- Check CORS configuration:
aws apigatewayv2 get-api --api-id YOUR_API_ID
- Test API directly:
curl "https://your-api.execute-api.ap-south-1.amazonaws.com/presigned-url?filename=test.pdf"
📄 PDF Upload Issues
Issue: Large files fail to upload
Causes & Solutions:
- File > 2MB via API Gateway: ✅ Uses S3 direct upload automatically
- File > 50MB: ✅ Supported via presigned URLs
- CORS errors: Check S3 bucket CORS configuration
- Timeout errors: Large files process in background
# Run comprehensive health check
./deploy.sh --health-check
# Check individual services
aws lambda list-functions --query 'Functions[?contains(FunctionName, `pai`)]'
aws s3 ls | grep pai-pdf-storage
aws dynamodb list-tables --query 'TableNames[?contains(@, `pai`)]'
aws cognito-idp list-user-pools --max-results 20 --query 'UserPools[?contains(Name, `pai`)]'| Issue Type | Action |
|---|---|
| 🚨 Critical Error | Check CloudWatch logs: aws logs describe-log-groups --log-group-name-prefix /aws/lambda/pai |
| Monitor DynamoDB and Lambda metrics in AWS Console | |
| 🔧 Configuration | Run ./deploy.sh --fix-issues for common problems |
| 📖 Documentation | Check REDEPLOY.md for detailed manual steps |
This project is designed to work within AWS Free Tier limits:
| Service | Free Tier Limit | Expected Usage |
|---|---|---|
| 🔧 Lambda | 1M requests/month | ~10K requests |
| 📦 S3 | 5GB storage | ~1GB PDFs |
| 🗄️ DynamoDB | 25GB storage | ~100MB metadata |
| 🚪 API Gateway | 1M API calls | ~10K calls |
| 🔐 Cognito | 50K MAU | ~10 users |
# Enable cost monitoring
aws budgets create-budget --account-id ACCOUNT --budget '{
"BudgetName": "PAI-Platform-Budget",
"BudgetLimit": {"Amount": "5", "Unit": "USD"},
"TimeUnit": "MONTHLY",
"BudgetType": "COST"
}'- 📄 Document Limit: Stay under 1000 documents for free tier
- 🔄 Query Frequency: Batch queries when possible
- 📁 File Size: Compress PDFs before upload
- 🗑️ Cleanup: Regularly delete unused documents
- 🔐 Authentication: AWS Cognito with JWT tokens
- 🔑 Authorization: Per-user document isolation
- 🚪 API Security: CORS configured, API key protection
- 📁 Data Encryption: S3 and DynamoDB encryption at rest
- 🌐 Network Security: VPC endpoints for sensitive operations
- 🔒 Secrets Management: API keys stored in AWS Secrets Manager
🔒 Enhanced Security Setup
# Enable S3 encryption
aws s3api put-bucket-encryption \
--bucket $BUCKET_NAME \
--server-side-encryption-configuration '{
"Rules": [{
"ApplyServerSideEncryptionByDefault": {
"SSEAlgorithm": "AES256"
}
}]
}'
# Enable DynamoDB encryption
aws dynamodb update-table \
--table-name pai-embeddings-metadata \
--sse-specification Enabled=true
# Configure API Gateway throttling
aws apigatewayv2 update-stage \
--api-id $API_ID \
--stage-name '$default' \
--throttle-settings BurstLimit=100,RateLimit=50- 🔐 Cognito password policy enabled (8+ characters)
- 🚪 API Gateway CORS properly configured
- 📁 S3 bucket not publicly accessible
- 🔑 Secrets stored in AWS Secrets Manager
- 📊 CloudTrail logging enabled for auditing
- 🛡️ Lambda functions use least-privilege IAM roles
- 🍴 Fork the repository
- 🌿 Create a feature branch:
git checkout -b feature/amazing-feature - 💾 Commit your changes:
git commit -m 'Add amazing feature' - 📤 Push to the branch:
git push origin feature/amazing-feature - 🔄 Open a Pull Request
When reporting bugs, please include:
- 📋 Steps to reproduce
- 💻 Expected vs actual behavior
- 🖥️ Environment details (OS, browser, versions)
- 📄 Relevant logs or screenshots
We welcome suggestions for:
- 🔍 Enhanced search capabilities
- 📊 New document formats (Word, Excel, etc.)
- 🤖 Additional AI model integrations
- 🎨 UI/UX improvements
- 📖 Detailed Setup Guide - Step-by-step manual deployment
- 🗑️ Cleanup Guide - Remove all AWS resources
- 🚀 Deployment Script - Automated deployment tool
- 📊 Architecture Diagrams - System overview
| Need Help With | Contact Method |
|---|---|
| 🐛 Bugs | Open a GitHub issue with error details |
| 💡 Features | Start a GitHub discussion |
| 🔧 Setup | Check troubleshooting guide above |
| 💬 General | Create a GitHub discussion |