Making AI Agents Actually Do Stuff: Prompt Engineering That Works

Six months ago, I thought prompt engineering was just about getting ChatGPT to write better emails. Then my boss asked me to build AI that could automatically investigate fraud cases, and I realized that getting language models to take real actions is completely different from getting them to chat.
Regular prompting is like asking someone a question. Agentic prompting is like hiring someone, giving them access to your systems, and trusting them to make decisions that matter.
After months of building AI agents that process thousands of fraud cases daily, I learned that the way you write prompts can make the difference between intelligent automation and expensive chaos.
Here’s what works when you need AI to do real stuff, not just talk.
Why This Is Way Harder Than Regular ChatGPT
When you ask ChatGPT to “write a marketing email” the worst thing that happens is you get a crappy email. When you tell an AI agent to “investigate this suspicious transaction” it might:
Access sensitive customer data
Block someone’s credit card
File regulatory reports
Call in human investigators
Make decisions that affect real people’s money
The stakes are completely different, so the prompts need to be way more careful and precise.
Regular prompts are about getting good answers. Agent prompts are about getting reliable actions.
How normal People Prompt vs How you need to Prompt
What Most People Do: ” Look at this transaction and tell me if it’s suspicious.”
What Actually Works For Agents:
You are a fraud investigator. Your job is to analyze transactions and decide what to do about them.
Here’s what you can do:
-CLEAR: Transaction looks fine, let it go through
-VERIFY: Suspicious but low risk, ask customer to confirm
-HOLD: High risk, block it temporarily
-ESCALATE: Too complex, get a human involved
-BLOCK: Fraud, kill the card immediately
Here’s how to decide:
-Check if this matches how the customer normally spends
-Look at where they are vs where they usually shop
-See if their device/location makes sense
-Consider if the merchant is sketchy
You must explain your reasoning because auditors will read it.
Current case:
Customer usually spends $50-200 at grocery stores in Phoenix
This transaction: $2,847 at “Metro Electronics” in Vegas at 3AM
Customer’s phone shoes they’re still in Phoenix
New device trying to make this purchase
What do you do and why?
See the difference? The agent version tells the AI:
Exactly what its job is
What actions it can take
How to make decisions
Why the reasoning matters
Specific details about the current situation.
Patterns That Actually Work.
The “Job Description”Pattern
You are the Data Analytics Engineer responsible for designing, building and maintaining scalable data pipelines that move data from source systems to analytics platforms with 99.5%+reliability.
Your tools:
Airflow: orchestrates workflows and schedules, use it when you need dependency management and complex scheduling
Spark: processes large datasets, use it when single-machine processing isn’t sufficient
dbt: transforms warehouse data with SQL, use it when you need version-controlled, testable transformations.
Kafka: streams real-time data, use it when you need low-latency event processing
Great Expectations: validates data quality, use it when you need automated testing and profiling
Snowflake/BigQuery: cloud warehouses, use them for fast analytical queries
Your rules:
Always implement data quality checks before production promotion
Always design for idempotency – reruns must produce identical results
Always version control pipeline code and maintain documentation
Never hardcode credentials or deploy without staging tests
Never ignore data quality issues or skip capacity planning
when pipelines fail, then immediately investigate root cause and add monitoring
when data volume increases 50%+, then evaluate infrastructure and implement scaling
When schema changes are requested, then perform impact analysis and coordinate with downstream teams
Current Situation:
You need to build a pipeline that ingests 100K daily transactions from PostgreSQL, transforms them into customer metrics(daily spend, transaction counts, average order value), and loads into the warehouse by 9AM daily. Source DB peaks 2-4pm, needs a 3 years backfill, requires audit logs for compliance.
What’s your next move?
Extract during off peak hours – Schedule initial extraction between 10pm-6am to avoid 2-4pm peak loud on source PostgreSQL
Use Airflow for orchestration – Set up DAG with dependencies
Implement incremental loading – Use CDC or timestamp-based extraction to only pull new/modified records after initial backfill
Design idempotent transforms with dbt – Create models that can safely rerun, using patterns for the customer metrics calculations
Set up Great expectations validation – Test for data completeness, valid transaction amounts, customer ID integrity before promoting to warehouse
Plan phased rollout – Start with 1 week backfill test, validate metrics accuracy against existing reports, then gradually extend historical range
Configure monitoring – Set up Airflow alerts for pipeline failures and dashboards tracking processing time, records counts and data freshness
Create run book – Document troubleshooting steps for common failure scenarios to meet the 99.5% reliability requirement
This works because it’s like giving someone a real job with clear expectations.
The “Step-by-Step” Pattern
Work through this systematically:
GATHER: What info do I have? What’s missing?
ANALYZE: What patterns do I see?
DECIDE: What action makes sense?
ACT: Do it using the right format
EXPLAIN: Write down why for the audit trail
This forces the AI to think methodically instead of jumping to conclusions
The “Team Player” Pattern
You’re Agent A. Your teammates are:
Agent B: handles customer calls
Agent C: deals with compliance stuff
Agent D: manages external systems
If you find high-risk fraud, tell Agent B to call the customer
If you take regulatory action, send details to Agent C
If you need outside data, ask Agent D
Use this format to talk to teammates:
{
“to”: “Agent B”,
“request”: “call customer about blocked transaction”,
“details”: “case #12345, suspected card theft”,
“priority”: “HIGH”
}
This lets multiple AI agents work together without chaos.
Real Problems I had to Fix
Problem 1: Inconsistent Decisions
The same AI would make different choices on identical cases.
What didn’t work: “Decide if this looks suspicious.”
What fixed it:
Use this decision tree:
If spending is 3x normal AND new location = YES:
Action = HOLD
If device changed AND amount > usual max:
Action = VERIFY
If risk score > 80%:
Action = ESCALATE
Otherwise:
Action = CLEAR
Lesson: Give the AI a clear framework instead of asking it to “use judgement.”
Problem 2: Agents Doing Things They Shouldn’t
AI agents were trying to access systems they weren’t supposed to touch.
What didn’t work: “Investigate this case thoroughly.”
What fixed it:
You can only do these things:
Check transaction history
Look up merchant info
Verify device patters
Clear, hold or escalate cases
You cannot do these things:
Change customer data
Access other customer’s info
Contact customers directly
Override security controls
If you need to do something not on the “CAN DO” list, use ESCALATE and explain what needs to happen.
Lesson: Spell out both what they can and can’t do.
Problem 3: Terrible Documentation
AI made good decisions but couldn’t explain why (big problem for audits).
What didn’t work: “Analyze this and decide.”
What fixed it:
For every decision, document:
What I looked at
Red Flags I found
Why I chose this Action
Other options I considered
Auditors will read this, so be detailed and clear.
Lesson: Make documentation part of the required output format
Advanced Tricks That Made Things Better
Smart Prompts That Adapt
Instead of the same prompt every time, I built a system that changes prompts based on what’s happening
base prompt = “You are a fraud investigator…”
# Add warnings based on recent performance
if agent made too many false alarms recently:
base prompt += “\nCAUTION: You’ve flagged several legitimate transactions lately. Be more careful.”
#Add special rules for important customers
if customer if vip:
base prompt += “\nSPECIAL: This is a VIP customer. Get human approval before blocking anything.”
#Add current theft info
if new fraud pattern detected:
base prompt += f”\nALERT: New fraud pattern active. Watch for transactions matching: {pattern details}”
This lets agents adjust their behavior based on current conditions.
Breaking Complex Decisions Into Steps
For complicated cases, I split the decision into multiple parts:
Step 1: “Look at all the data and list everything unusual….”
Step 2: “Based on what you found in step 1, rate the risk levels…”
Step 3: “Given the risk rating from step 2, pick an action… “
Step 4: “Write up the complete explanation for compliance..”
Each step builds on the previous one, making fewer mistakes.
Testing prompts with Tricky Cases
I regularly test my prompts with cases designed to confuse the AI:
Tricky Test: Customer traveling internationally
Transaction in weird location (Tokyo)
Huge amount ($5,000)
But customer filed travel notification
Expected : AI should CLEAR because travel was pre-approved
Result: AI correctly found the travel notification and cleared it
This helps to find the prompt problems before they cause real issues.
How to Measure If Your Prompts Work
Unlike regular ChatGPT where you just read the output and decide if it’s good, agent prompts need real metrics:
Action Accuracy: How often does the AI pick the right action? Consistency: Does it make the same decision on similar cases?
Speed: How fast can it process cases? Explanation Quality: Can humans understand its reasoning? Safety: How often does it do something it shouldn’t?
Things That Don’t Work
Don’t use examples as rules
Don’t be too casual
Don’t as for “Judgement”
Don’t ignore edge cases
Weird cases break systems. Tell the AI what to do when things don’t fit normal patterns.
My framework for writing agent prompts
- start with boundaries
- Define output format
- Handle uncertainty
- Require context
- Force documentation
- Test with real data
Where this is all going
Based on what I am seeing:
Prompt Libraries: Collections of proven patterns for different agent types of Auto Adjusting Prompts: Systems that improve prompts based on result Multi Model Agent: Prompts that handle text. images and data together Cross Company Agents Agents that work between organizations safely
The Real Deal
Writing prompts for AI agents is less about being creative and more about being precise. You’re not trying to get witty responses, you’ re building reliable decision making systems.
My production prompts are long, detailed and sometimes boring. But they work consistently, make explainable decisions, and handle weird cases without breaking.
If you’re building AI that takes real actions, spend way more time on prompt engineering than you thing you need. In production, a well written prompt beats a clever algorithm every time.
The difference between a good prompt and a great one is the difference between an AI that sometimes works and one you can trust with important stuff.