Effective Data Chunking and Querying with Pinecone and GPT-4o

mrarup825 hours ago

0 0 7 minutes read

In our previous article, we saw how to scrape this newsletter with Firecrawl and transform the posts into markdown files that can be loaded into a VectorDB in Pinecone.

After releasing the first part of the article, I kept querying the VectorDB with different queries. I was unhappy with the results, so I wanted to optimize the data ingestion on Pinecone (or at least try it) a bit.

Improving the data quality

First of all, I tried to clean the markdown from the link to images, new lines, separators, and other stuff so that the files passed to Pinecone are more readable.

So, I created a small function with regular expressions (thanks, ChatGPT!) to preprocess the markdown extracted by Firecrawl before passing it to Pinecone.

def clean_markdown(md_text):
    """Cleans Markdown text by removing images and dividers."""
    import re
    md_text = re.sub(r"!\[.*?\]\(.*?\)", "", md_text)  # Remove markdown images
    md_text = re.sub(r']*>', "", md_text)  # Remove HTML images
    md_text = re.sub(r"(\*{3,}|-{3,})\n(.*?)\n(\*{3,}|-{3,})", "", md_text, flags=re.DOTALL)  # Remove dividers
    md_text = re.sub(r'\n\s*\n', '\n', md_text).strip()  # Remove extra newlines
    return md_text

Splitting in fixed-length chunks

Another technique that can be used to improve the pertinence of the data retrieved is to split articles into chunks. Instead of ingesting the whole article in the same entry of the index, it is split into chunks and inserted into several entries of the index.

This way, a single entry should contain a single concept instead of an entire article, making calculating its relevance to the user’s query easier. You can find this approach used in the file firecrawl_get_data_with_chunks.py on the GitHub repository of The Lab.

I’m well aware that this is far from perfect. I’m simply splitting the content into chunks of fixed length and ignoring the content of the chunks. The same paragraph could be split into different chunks, which is quite approximative.

A smarter approach could be to read the article with ChatGPT, making it summarize the different paragraphs and then load each summary in a different chunk. In this way, we can get clean data and chunks that have an entire paragraph inside.

Splitting in chunks with GPT4-o

That’s exactly what I did in my last attempt: I gave the markdown files of the article in input to GPT4-o and asked to rewrite them, using different paragraphs per single post.

Every paragraph became a chunk in Pinecone. In this case, it has a beginning and end, being a fully developed concept instead of a string of X tokens.

You can find this chunking method in the repository file firecrawl_get_data_with_chunks_openai.py.

It took several hours for me to develop and test these ideas, which were not part of the initial article, that’s why this episode is published on Friday instead of the usual Thursday.

Querying the Pinecone database

At the end of the chunking tests, we have three different Pinecone indexes that used the same input data but split it in a different way:

article-index, with one entry per article
article-index-with-chunks, with the articles split into different chunks based on the number of tokens used
article-index-with-chunks-openai, with the articles split by chapters written by OpenAI

All three indexes have the same structure instead:

a values field, where we have the vectorial representation of the text we passed. This will be used to find the most relevant text for the input query using a proximity algorithm.
a chunk_text field, where we store the text that will be used as the output of the query (the full article or the chunk of the article selected)
three metadata fields (author, title, and URL) that we’ll use to cite the articles used to answer the query

An example of a record in Pinecone

But how can we query these indexes to retrieve the results?

The theory is quite simple, at least on a surface level. When we write a prompt, we’re basically writing a query in a natural language. This query is then embedded with the same algorithm used when we embedded the articles inserted in Pinecone. Our query itself becomes a series of numbers so that Pinecone (or other vectorial databases) can perform a proximity search between the values of the query and the values of the articles. The nearest results are then returned with a certain degree of proximity and we can filter the records returned by using only the nearest ones.

def retrieve_articles(query, top_k=3, confidence_threshold=0.3):
    """Retrieve the most relevant articles from Pinecone for a user query."""
    # Generate query embedding
    query_embedding = pc.inference.embed(
        model="llama-text-embed-v2",
        inputs=[query],
        parameters={"input_type": "query"}
    )[0]["values"]

    # Query Pinecone
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        namespace="articles",
        include_metadata=True
    )

    # Extract relevant articles
    retrieved_docs = []
    for match in results["matches"]:
        score = match["score"]
        metadata = match["metadata"]

        # Add article details
        retrieved_docs.append({
            "title": metadata["title"],
            "url": metadata["url"],
            "author": metadata["author"],
            "content": metadata["chunk_text"],
            "score": score
        })

    # Compute highest confidence score
    max_score = max([doc["score"] for doc in retrieved_docs], default=0)

    # Decide whether to use Pinecone or fallback to GPT-4o
    use_pinecone = max_score >= confidence_threshold

    return retrieved_docs if use_pinecone else None, use_pinecone

Once the records are returned, it’s just a matter of prompt engineering. We need to append the values contained in the chunk_text field of the records to the context windows of the prompt and try to find the best way to describe the desired output.

def generate_answer(query):
    """Generates a long-form instructional answer using retrieved articles."""
    
    retrieved_docs, use_pinecone = retrieve_articles(query)

    if use_pinecone:
        # Extract full text from relevant articles
        context_text = "\n\n".join([
            f"Title: {doc['title']}\nAuthor: {doc['author']}\nContent:\n{doc['content']}..."
            for doc in retrieved_docs
        ])

        # Construct the GPT prompt
        prompt = (
            "Using the following extracted content from expert-written articles, "
            "provide a long-form, step-by-step, detailed answer with practical instructions. "
            "Make sure to extract key information and structure the answer properly.\n\n"
            f"{context_text}\n\n"
            f"📌 **User's Question**: {query}\n\n"
            f"💡 **Detailed Answer**:"
        )
    else:
        # No relevant articles, fall back to GPT-4o general knowledge
        prompt = f"Provide a long-form, detailed answer with step-by-step instructions based on your general knowledge:\n\n📌 **User's Question**: {query}\n\n💡 **Detailed Answer**:"


    # Query GPT-4o
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    
	# Extract paragraphs from the response
    return response.choices[0].message.content.strip(), use_pinecone, retrieved_docs

This prompt can probably be improved, but the results are quite good with all three indexes, even if I think I’m getting the best answers with the index that used GPT to chunk the articles.

Tests

After all this introduction, it’s time to test our assistant and see which index gives the best answers. The answers are quite long

Test 1 – article-index

How can I bypass Datadome protected websites?

This answer uses older articles to answer it and does not mention Camoufox, the latest solution I used for this task, while it mentions Ghost-Cursor.

Other parts of the answer are quite good but a bit generic and not so interesting, in pure GPT style.


### Step 2: Detect Datadome on Target Website
- **Wappalyzer**: Use this tool to inspect the website's tech stack and detect the presence of Datadome.
- **Cookie Inspection**: Recognize Datadome by looking for specific cookies in the website's response headers, e.g., a "Datadome" cookie.

### Step 3: Initial Scraping Attempts
- **Avoid Basic Tools**: Basic HTTP libraries like 'requests' or simple Scrapy setups will likely be blocked due to lack of JavaScript execution and human-like behavior.

2. **Human-like Interactions**:
   - Use tools like `python_ghost_cursor` to simulate human-like mouse movements (not compatible with Firefox).
   - Randomize waits between actions to mimic human browsing patterns.
   - Avoid direct navigation to deep links; instead, simulate a user's path through the site.

### Step 5: Experiment with Different Browsers
- **Switch to Brave Browser**: If Firefox does not work, try using Brave (a Chromium-based browser) with Playwright and see if it bypasses detection better than Chrome.

### Step 6: Commercial Proxy Solutions
- **Proxy Rotation**: Use a service like Proxyempire to rotate IPs frequently. Proxies can help bypass IP-based detection by altering your apparent origin.
- **Ensure Quality Proxies**: Use residential proxies for a more natural browsing fingerprint.

📚 **Sources:**
🔗 [How to scrape Datadome protected websites (early 2023 version)](https://substack.thewebscraping.club/p/how-to-scrape-datadome-2023) - Pierluigi Vinciguerra
🔗 [Bypassing Datadome with Web Scraping - End of 2023 Version](https://substack.thewebscraping.club/p/bypassing-datadome-2023-scraping) - Pierluigi Vinciguerra
🔗 [THE LAB #2: scraping data from a website with Datadome and xsrf tokens](https://substack.thewebscraping.club/p/scraping-datadome-api-hermes) - Pierluigi Vinciguerra

What are the best open-source tools for scraping in 2025?

This is quite accurate. The answer starts with traditional tools like Scrapy and Playwright and then moves on to ScrapegraphAI, Nodriver and Camoufox.

The same question asked to the standard GPT-4o provides a much less relevant answer.

How can I change my browser fingerprint? Which tools should I use?

Again, this is quite a good answer, not perfect, but at least references to the correct articles.

Test 2 – article-index-with-chunks

Let’s see how the answers change (if they do) by using this index made with split articles.

How can I bypass Datadome protected websites?

This is quite interesting. Apart from the introduction about detecting Datadome, the answer doesn’t come from my articles. In fact, there’s some Node code that I didn’t write, which is quite generic and won’t work for the task. Not a good answer.

What are the best open-source tools for scraping in 2025?

Again, it is a mixture of general GPT knowledge and something added by my articles. I think the Scrapoxy part was taken from my articles, but I definitely didn’t recommend Selenium and BeautifoulSoup in them.

How can I change my browser fingerprint? Which tools should I use?

The answer came again from the GPT knowledge and from the chunks of my latest article about browser fingerprinting.

On top, it also added a bit of self-promotion

Stay Informed and Adaptive:

Understand that modifying fingerprints is an ongoing arms race. Keep updated on new fingerprinting tactics, and regularly adapt your strategies.
Subscribing to resources like The Web Scraping Club can provide ongoing insights into the latest developments in web scraping and browser fingerprinting.

Let’s see if the third index performs better.

Test 3 – article-index-with-chunks-openai

Given the delusion about the previous index’s responses, let’s hope these tests will produce better results.

How can I bypass Datadome protected websites?

The answer is OK, similar to the one on the first test. It added the approach of reverse engineering the website calls but didn’t mention GhostCursor or Camoufox.

Probably I should improve the prompt and use more items in the result.

What are the best open-source tools for scraping in 2025?

Again, the answer is OK but not as good as the first test. Having a look at the mentioned articles, I suppose it has extracted some chunks that are not really spot on.

With ten chunks instead of three, the answer improves a bit.

With twenty chunks is the best answer we have so far

How can I change my browser fingerprint? Which tools should I use?

Keeping the limit of twenty chunks, we got again a brilliant answer.

I kept testing for some time, increasing the number of chunks used with the first index, the one with a full article per record, but the results were not great; it seems like I added more confusion to the context. I’ll keep playing with this configuration, and soon, I’ll write the final article of the series, in which we’ll see how to add a UX interface to our assistant.

mrarup825 hours ago

0 0 7 minutes read

Improving the data quality

Splitting in fixed-length chunks

Splitting in chunks with GPT4-o

Querying the Pinecone database

Tests

Test 1 – article-index

Test 2 – article-index-with-chunks

Test 3 – article-index-with-chunks-openai

mrarup82

Related Articles

Strategy Stock Sheds 26% In 5 Days As Bitcoin Tumbles

SPX6900 Price Prediction – Can This Meme Coin Break $2 in the Short Term?

CF Benchmarks Adds Crypto.com Exchange to Leading CME CF Bitcoin and Ether Reference Rates – CryptoMode

Cardano launches new open-source digital identity platform

Leave a Reply Cancel reply