Making AI Agents Actually Do Stuff: Prompt Engineering That Works

mrarup827 hours ago

0 0 7 minutes read

Six months ago, I thought prompt engineering was just about getting ChatGPT to write better emails. Then my boss asked me to build AI that could automatically investigate fraud cases, and I realized that getting language models to take real actions is completely different from getting them to chat.

Regular prompting is like asking someone a question. Agentic prompting is like hiring someone, giving them access to your systems, and trusting them to make decisions that matter.

After months of building AI agents that process thousands of fraud cases daily, I learned that the way you write prompts can make the difference between intelligent automation and expensive chaos.

Here’s what works when you need AI to do real stuff, not just talk.

Why This Is Way Harder Than Regular ChatGPT

When you ask ChatGPT to “write a marketing email” the worst thing that happens is you get a crappy email. When you tell an AI agent to “investigate this suspicious transaction” it might:

Access sensitive customer data

Block someone’s credit card

File regulatory reports

Call in human investigators

Make decisions that affect real people’s money

The stakes are completely different, so the prompts need to be way more careful and precise.

Regular prompts are about getting good answers. Agent prompts are about getting reliable actions.

How normal People Prompt vs How you need to Prompt

What Most People Do: ” Look at this transaction and tell me if it’s suspicious.”

What Actually Works For Agents:

You are a fraud investigator. Your job is to analyze transactions and decide what to do about them.

Here’s what you can do:

-CLEAR: Transaction looks fine, let it go through

-VERIFY: Suspicious but low risk, ask customer to confirm

-HOLD: High risk, block it temporarily

-ESCALATE: Too complex, get a human involved

-BLOCK: Fraud, kill the card immediately

Here’s how to decide:

-Check if this matches how the customer normally spends

-Look at where they are vs where they usually shop

-See if their device/location makes sense

-Consider if the merchant is sketchy

You must explain your reasoning because auditors will read it.

Current case:

Customer usually spends $50-200 at grocery stores in Phoenix

This transaction: $2,847 at “Metro Electronics” in Vegas at 3AM

Customer’s phone shoes they’re still in Phoenix

New device trying to make this purchase

What do you do and why?

See the difference? The agent version tells the AI:

Exactly what its job is

What actions it can take

How to make decisions

Why the reasoning matters

Specific details about the current situation.

Patterns That Actually Work.

The “Job Description”Pattern

You are the Data Analytics Engineer responsible for designing, building and maintaining scalable data pipelines that move data from source systems to analytics platforms with 99.5%+reliability.

Your tools:

Airflow: orchestrates workflows and schedules, use it when you need dependency management and complex scheduling

Spark: processes large datasets, use it when single-machine processing isn’t sufficient

dbt: transforms warehouse data with SQL, use it when you need version-controlled, testable transformations.

Kafka: streams real-time data, use it when you need low-latency event processing

Great Expectations: validates data quality, use it when you need automated testing and profiling

Snowflake/BigQuery: cloud warehouses, use them for fast analytical queries

Your rules:

Always implement data quality checks before production promotion

Always design for idempotency – reruns must produce identical results

Always version control pipeline code and maintain documentation

Never hardcode credentials or deploy without staging tests

Never ignore data quality issues or skip capacity planning

when pipelines fail, then immediately investigate root cause and add monitoring

when data volume increases 50%+, then evaluate infrastructure and implement scaling

When schema changes are requested, then perform impact analysis and coordinate with downstream teams

Current Situation:

You need to build a pipeline that ingests 100K daily transactions from PostgreSQL, transforms them into customer metrics(daily spend, transaction counts, average order value), and loads into the warehouse by 9AM daily. Source DB peaks 2-4pm, needs a 3 years backfill, requires audit logs for compliance.

What’s your next move?

Extract during off peak hours – Schedule initial extraction between 10pm-6am to avoid 2-4pm peak loud on source PostgreSQL

Use Airflow for orchestration – Set up DAG with dependencies

Implement incremental loading – Use CDC or timestamp-based extraction to only pull new/modified records after initial backfill

Design idempotent transforms with dbt – Create models that can safely rerun, using patterns for the customer metrics calculations

Set up Great expectations validation – Test for data completeness, valid transaction amounts, customer ID integrity before promoting to warehouse

Plan phased rollout – Start with 1 week backfill test, validate metrics accuracy against existing reports, then gradually extend historical range

Configure monitoring – Set up Airflow alerts for pipeline failures and dashboards tracking processing time, records counts and data freshness

Create run book – Document troubleshooting steps for common failure scenarios to meet the 99.5% reliability requirement

This works because it’s like giving someone a real job with clear expectations.

The “Step-by-Step” Pattern

Work through this systematically:

GATHER: What info do I have? What’s missing?

ANALYZE: What patterns do I see?

DECIDE: What action makes sense?

ACT: Do it using the right format

EXPLAIN: Write down why for the audit trail

This forces the AI to think methodically instead of jumping to conclusions

The “Team Player” Pattern

You’re Agent A. Your teammates are:

Agent B: handles customer calls

Agent C: deals with compliance stuff

Agent D: manages external systems

If you find high-risk fraud, tell Agent B to call the customer

If you take regulatory action, send details to Agent C

If you need outside data, ask Agent D

Use this format to talk to teammates:

{

“to”: “Agent B”,

“request”: “call customer about blocked transaction”,

“details”: “case #12345, suspected card theft”,

“priority”: “HIGH”

}

This lets multiple AI agents work together without chaos.

Real Problems I had to Fix

Problem 1: Inconsistent Decisions

The same AI would make different choices on identical cases.

What didn’t work: “Decide if this looks suspicious.”

What fixed it:

Use this decision tree:

If spending is 3x normal AND new location = YES:

Action = HOLD

If device changed AND amount > usual max:

Action = VERIFY

If risk score > 80%:

Action = ESCALATE

Otherwise:

Action = CLEAR

Lesson: Give the AI a clear framework instead of asking it to “use judgement.”

Problem 2: Agents Doing Things They Shouldn’t

AI agents were trying to access systems they weren’t supposed to touch.

What didn’t work: “Investigate this case thoroughly.”

What fixed it:

You can only do these things:

Check transaction history

Look up merchant info

Verify device patters

Clear, hold or escalate cases

You cannot do these things:

Change customer data

Access other customer’s info

Contact customers directly

Override security controls

If you need to do something not on the “CAN DO” list, use ESCALATE and explain what needs to happen.

Lesson: Spell out both what they can and can’t do.

Problem 3: Terrible Documentation

AI made good decisions but couldn’t explain why (big problem for audits).

What didn’t work: “Analyze this and decide.”

What fixed it:

For every decision, document:

What I looked at

Red Flags I found

Why I chose this Action

Other options I considered

Auditors will read this, so be detailed and clear.

Lesson: Make documentation part of the required output format

Advanced Tricks That Made Things Better

Smart Prompts That Adapt

Instead of the same prompt every time, I built a system that changes prompts based on what’s happening

base prompt = “You are a fraud investigator…”

# Add warnings based on recent performance

if agent made too many false alarms recently:

base prompt += “\nCAUTION: You’ve flagged several legitimate transactions lately. Be more careful.”

#Add special rules for important customers

if customer if vip:

base prompt += “\nSPECIAL: This is a VIP customer. Get human approval before blocking anything.”

#Add current theft info

if new fraud pattern detected:

base prompt += f”\nALERT: New fraud pattern active. Watch for transactions matching: {pattern details}”

This lets agents adjust their behavior based on current conditions.

Breaking Complex Decisions Into Steps

For complicated cases, I split the decision into multiple parts:

Step 1: “Look at all the data and list everything unusual….”

Step 2: “Based on what you found in step 1, rate the risk levels…”

Step 3: “Given the risk rating from step 2, pick an action… “

Step 4: “Write up the complete explanation for compliance..”

Each step builds on the previous one, making fewer mistakes.

Testing prompts with Tricky Cases

I regularly test my prompts with cases designed to confuse the AI:

Tricky Test: Customer traveling internationally

Transaction in weird location (Tokyo)

Huge amount ($5,000)

But customer filed travel notification

Expected : AI should CLEAR because travel was pre-approved

Result: AI correctly found the travel notification and cleared it

This helps to find the prompt problems before they cause real issues.

How to Measure If Your Prompts Work

Unlike regular ChatGPT where you just read the output and decide if it’s good, agent prompts need real metrics:

Action Accuracy: How often does the AI pick the right action? Consistency: Does it make the same decision on similar cases?

Speed: How fast can it process cases? Explanation Quality: Can humans understand its reasoning? Safety: How often does it do something it shouldn’t?

Things That Don’t Work

Don’t use examples as rules

Don’t be too casual

Don’t as for “Judgement”

Don’t ignore edge cases

Weird cases break systems. Tell the AI what to do when things don’t fit normal patterns.

My framework for writing agent prompts

start with boundaries
Define output format
Handle uncertainty
Require context
Force documentation
Test with real data

Where this is all going

Based on what I am seeing:

Prompt Libraries: Collections of proven patterns for different agent types of Auto Adjusting Prompts: Systems that improve prompts based on result Multi Model Agent: Prompts that handle text. images and data together Cross Company Agents Agents that work between organizations safely

The Real Deal

Writing prompts for AI agents is less about being creative and more about being precise. You’re not trying to get witty responses, you’ re building reliable decision making systems.

My production prompts are long, detailed and sometimes boring. But they work consistently, make explainable decisions, and handle weird cases without breaking.

If you’re building AI that takes real actions, spend way more time on prompt engineering than you thing you need. In production, a well written prompt beats a clever algorithm every time.

The difference between a good prompt and a great one is the difference between an AI that sometimes works and one you can trust with important stuff.

mrarup827 hours ago

0 0 7 minutes read

mrarup82

Related Articles

Possible Scenarios for XPR Once it Breaks Out of Consolidation: Ripple Price Analysis

Mutuum Finance (MUTM): Key Features Making It a Leader in DeFi Lending

Ripple triggers a surge in XRP Ledger activity with the latest update

Jury rules startup founder Charlie Javice guilty of defrauding JPMorgan Chase

Leave a Reply Cancel reply