Hey, Microsoft! 'Bing me' a Text Extractor!

Part 1 - How to GPT

(This article includes background info on how to get Bing Chat to analyze full documents, and some background in prompt writing and text extraction. The next one will include empirical data scoring, as well as conclusions.)

Last week, I wrote a “silly” article about building a text extractor for legal corpus data using ChatGPT 3.5. This week, I’m writing a much less silly article about the text extraction capabilities of ChatGPT 4 via Microsoft Bing, and what it can accomplish with a bit of prompt wrangling.

(I don’t like the term ‘prompt engineering,’ at least not with respect to my own prompt writing journey. I don’t have much of a foundation or methodology to my approach, and I proceed mostly through basic trial and error. I don’t find that particularly engineerical. Yes, I made that word up.)

My goal is to determine empirically the quality of structured data that can be readily extracted from legal documents and filings utilizing LLM tools such as ChatGPT. A lot of digital ink has already been spilled regarding the seemingly miraculous capabilities of ChatGPT, but much of it on the social media side is anecdotal, without rigorous scoring and data quality evaluation.

To that end, I am “simulating” a number of data sets with specific text extraction goals to test its capabilities. I’ve taken “inspiration” from some existing analytics products that I find particularly creative or interesting, and posed the question, “how would I build an underlying dataset to power a feature of this kind?” This is very far from obsoleting any of these products, as there is a huge difference between building a dataset and having a go-to-market strategy to deliver and maintain an analytics product. Still, I hope this helps to spark the imagination of legal data nerds who want a concrete example of how GPT might contribute to text extraction and data gathering today.

Background on Structured Data Prompts

I took inspiration for structured data prompt writing from the following sources:

https://www.youtube.com/watch?v=NVZbIsWX-_k

https://www.linkedin.com/pulse/data-mining-enrichment-chatgpt-ryan-rentfro/

There are plenty of sources out there on prompt-writing, but these both cover writing a structured model, almost like a scaffold for a GAI language model to fill out. In my very brief experience, it seems to provide a template for the model to work from that reduces the unpredictability or “creativity” from the model, where it might otherwise stray from the data model you want.

There are also a lot of cool examples online processing text in novel ways using ChatGPT. In many of these, the precise structure of the output usually isn’t critical. By contrast, I’m attempting to generate structured JSON to build a theoretical data set, so the formatting is crucial. If GPT starts reorganizing and restructuring the JSON, we’ll end up with messy data that requires normalizing or heavy editing, which defeats the purpose of the whole exercise.

Bing Chat >> ChatGPT 3.5

I got this idea from a post by Isaac Wong who wrote about how Bing Chat can process full documents and has the more powerful ChatGPT4 model driving its answers.

In my previous testing, I was using the web based ChatGPT 3.5, which was annoying because I had to find text-only examples to feed into the Chat prompt, and it was limited to about 4000 tokens and it would periodically just freak out. Copy-pasting segments of text doesn’t provide as useful a benchmark for GPT’s text processing capabilities (because I have already trimmed the text to the operative portions), nor does it really resemble a real-world data pipeline from document to dataset powered by an LLM.

While Bing Chat does have a prompt character limit of 2000, since you can point it to a website or a document on your computer, it will happily churn through hundreds of pages of documents while executing your prompt. Wow!

Get Bing Chat! (Setup)

In order to get started, you will need to download Microsoft Edge and create a Microsoft Account. Once you do this, you’ll have access to this “Bing Chat” icon.

You also need to enable “Page context” to allow the Bing Chat application to read your browser page information and incorporate that into its answers. This allows you to open a local PDF in Edge Browser, then have Bing Chat use the document as input.

Bing Chat also has a feature for modulating between “Creative”, “Balanced”, or “Precise” answers. This corresponds to the “temperature” parameter in the GPT4, which I understand to be a value that correlates with a tendency to choose outputs that are less than the most statistically favored. I don’t know exactly how Bing’s settings correspond to specific Temperature values, but choosing “More Precise” predictably improves the uniformity of structured data prompts.

Now we are ready to download PTAB documents, and start running them through GPT4-powered Bing Chat!

Now for the Good Stuff - Text Extraction Methodology

The inspiration for this text extraction exercise is a PTAB analytics feature from Juristat called “PTAB Expert Search”. I think it’s a really interesting and creative application of PTAB data, namely the Expert Declarations that are required under PTAB procedures. Juristat has aggregated this easily-overlooked data into an interesting and useful tool for evaluating experts. I find it delightfully clever.

Let’s consider how we would build this dataset. At a minimum, we need something like the following:

PTAB Petition #
Expert Name
Who retained them
Rate of Compensation
“Level” of expertise (perhaps some aggregate of work history, education, experience)

The “Red / Yellow / Green” model of expertise is some proprietary ranking, and I won’t get too caught up in what exactly that might be. I’m not trying to “clone” this feature, but build a dataset that would enable us to evaluate PTAB Experts in a similar manner.

Let’s consider the Expert Declarations themselves and what kind of data we can extract. If you’ve seen a PTAB expert declaration before, they are fairly dense discussions of the patent in question, the skill of a “PHOSITA” (“person have ordinary skill in the art”), proposed claim constructions, and a deep technical discussion of the invention’s claims in light of existing technology. Nestled in the first few pages is a “Background / Qualifications” discussion where the expert discloses their field of study/expertise, schooling, education, publications, rate of compensation, and their relationship to the matter at hand. Sometimes you get a CV attached at the end, although this is also commonly a separately filed exhibit.

After reading the intro to a few of these, my idea for a dataset is something like this:

PTAB Petition #
Patent Number
Expert Name
Field of Expertise
Compensation
Education
Qualifications

With this data, we can search by 1) expert field, 2) patent, 3) specific IPR #, 4) education and credentials. We can use a “qualifications” summary for keyword searching; in aggregate, we can see how often they have represented patent owners versus petitioners. With the patent number, we can match against USPTO Technology Center or Cooperative Patent Classification codes to understand what industries this expert is associated with. This is a good basis for building PTAB Expert analytics! Now, we just need to prompt engineer wrangle our way to some data!

Prompt Wrangling

Here, I’m going to outline the general process I took towards a workable extraction prompt. I’m no expert. I’ve read a couple of blog articles, and I’ve had a few hours of tinkering. Mainly, I’m just wrangling until the results are produced in a “stable” format, and the outputs look usable.

Prompt version 1

My first attempt was just to ask for a JSON of all the things I wanted:

for this page only, generate a json object that includes,
"IPR Numbers" as an array of strings, 
"Patent Numbers" as an array of strings, 
"Expert Name", 
“Retaining Party” which is either Petitioner or Patent Owner, 
"Field of Expertise" consisting of only a few words, "Compensation", 
"Education" as an array of strings only including the degree, year, and school, and 
"Qualifications" which briefly summarizes the expert's experience and accomplishments in their field

I chose to make “IPR Numbers” and “Patent Numbers” arrays because the first few Declarations I encountered were cross-filed across related IPR proceedings. This can happen when a series of IPRS get filed that all challenge different claims in the same patent. Several keys required additional context to get GPT to produce an answer conforming to the format desired:

“Retaining Party” would usually return the actual party’s name (e.g. “Google”) when I was looking for the party posture.
“Field of Expertise” tended to go on for several sentences, so I had to cut GPT off.
“Education” would auto-format into a bunch of different JSON conventions, so I was trying to guide it towards a particular format.
“Qualifications” would similarly rattle on for paragraphs, or sometimes explode into improvisational nested structures that were not helpful.

You might notice a lot of these prompts refer to “this page” or “this page only”. Sometimes this is helpful to direct GPT to apply the prompt instructions to the tab you are viewing, as opposed to searching the internet to find the answer.

GPT produces this:

{

    "IPR Numbers": ["IPR2023-00799"],
    "Patent Numbers": ["9,092,616"],
    "Expert Name": "Seth James Nielson",
    "Retaining Party": "Petitioner",
    "Field of Expertise": "Cybersecurity, computer networks, and software engineering",
    "Compensation": null,
    "Education": [
        "B.S. in Computer Science, April 2000, Brigham Young University",
        "M.S. in Computer Science, Brigham Young University",
        "Ph.D. in Computer Science, Rice University"
    ],
    "Qualifications": "Seth James Nielson is a subject matter expert in cybersecurity, computer networks, and software engineering. He has over two decades of experience in the computer industry and has worked as a software engineer, security analyst, research scientist and technical consultant. He has also held academic appointments at Johns Hopkins University and the University of Texas at Austin."
}

Already pretty close! Over multiple trials, I was getting inconsistent “Education” results, so I tried to be more explicit by first including a data model. I also added a “Years of Experience” key several times, while not entirely sure what I expected it to return. It’s just an experiment for now.

Prompt version 3

using the following model


IPR Numbers: array of strings;
Patent Numbers: array of strings;
Expert Name: string;
Retaining Party: either Patent Owner or Petitioner;
Years of Experience: number;
Compensation;
Field of Expertise: string including only a few words;
Education: as an array of json that contains Degree, Year and
School;
Qualifications: string summarizing the expert;


for this page only, generate a json object.

I was trying out different versions of this “Model” based prompt, when I accidentally ran it against a tab that didn’t contain a PTAB Disclosure. GPT just produced a mostly-blank JSON object that was very faithful to the structure of my model. This is actually a helpful tip that if you don’t know how to write JSON (or XML, or some other data structure), you can just ask GPT to produce that first, then feed it back to GPT in your actual prompt. Based on my previous error, I rewrote the prompt using the blank JSON I received:

Prompt version 4

{
    "IPR Numbers": [],
    "Patent Numbers": [],
    "Expert Name": "",
    "Retaining Party": "",
    "Years of Experience": “”,
    "Compensation": "",
    "Field of Expertise": "",
    "Education": [],
    "Qualifications": ""
}
summarize this page.

(for those of you who do not read JSON, it behaves like a tiny database where you can define “keys” and “values” each separated by a colon (“:”). Brackets (“[ ]”) means the “value” stored is an array which contains a list of values, and quotes means the value is a single string of text. JSON can also be nested, meaning the value can be another chunk of JSON itself, which we will see soon.)

This prompt was actually better across different Expert Declarations, but in my haste, I’d reintroduced the problems around “Retaining Party” and also the “Education” array was once again unpredictable. That’s ok, you can just explain your model to GPT after you introduce it.

Prompt version 7

using the following json model


 { 
"IPR Numbers": [],
"Patent Numbers": [], 
"Expert Name": "",
"Retaining Party": "", 
"Years of Experience": "", 
"Compensation": "", 
"Field of Expertise": "", 
"Education": ["Degree": "", "Year": "", "School": ""], 
"Qualifications": "" 
}
summarize this page. 

Years of Experience is a number. 
“Retaining Party” is either Patent Owner or Petitioner. 
Compensation is $/hour. 
Qualifications is a brief text summary of about 30 words other than field of expertise

(Note that the “Education” section is now an array of little bundles of nested JSONs) I found out that you can ask GPT to be “brief” or give it a word limit, and it generally treats it like a suggestion. Rude, but I’ll live.

In response, GPT produces this:

{
    "IPR Numbers": ["IPR2023-00799"],
    "Patent Numbers": ["9,092,616"],
    "Expert Name": "Seth James Nielson",
    "Retaining Party": "Petitioner",
    "Years of Experience": 20,
    "Compensation": null,
    "Field of Expertise": "Cybersecurity, computer networks, and software engineering",
    "Education": [
        {"Degree": "B.S. in Computer Science", "Year": 2000, "School": "Brigham Young University"},
        {"Degree": "M.S. in Computer Science", "Year": null, "School": "Brigham Young University"},
        {"Degree": "Ph.D. in Computer Science", "Year": null, "School": "Rice University"}
    ],
    "Qualifications": "Seth James Nielson has worked as a software engineer, security analyst, research scientist and technical consultant. He has also held academic appointments at Johns Hopkins University and the University of Texas at Austin."
}

Due to our “Precise” temperature settings, and possibly due to the strict JSON prompting, GPT is producing relatively consistent formatting, and even some “null” wherever it can’t find values. This is GOOD! We don’t want made-up data. Missing data can be filled in through secondary review, but made-up data is very hard to deal with.

Now we are ready to start testing! Scoring, Data, Results, Conclusions coming in Part 2...

LinkedIn respects your privacy

Hey, Microsoft! 'Bing me' a Text Extractor!

Leonard Park

Experienced LegalTech Product Manager and Attorney | Passionate about leveraging AI/LLMs

Part 1 - How to GPT

More articles by this author

Others also viewed

Towards Advanced RAG

MarTech AI #39: GPT-4o Mini, Project Strawberry, SpreadsheetLLM, Microsoft CrowdStrike Outage & Youtube Data Controversy

Fine-tuning or RAG: Which LLM Strategy is Best for Your GenAI Model?

OpenAI's o1 Model: Advancements in Reasoning and Safety

AI&YOU #40: Retrieval-Augmented Generation (RAG) in Enterprise AI

The cold start problem

My AI Diary: One day at a time with AI #1

PDFs Spill Secrets with RAG — Like a Man with a Bottle of Beer 🍻

From Clusters to Significance: How AI Assigns Meaning and Weights to Unlabelled Data

Knowledge Graphs in RAG: Enhancing AI with Structured Information

Explore content categories

Part 1 - How to GPT

LLMs and Contract Intelligence, Part II Reasoning Models

Mar 7, 2025

LLMs and Contract Intelligence - How Many "GPT4-level" models are there?

Dec 16, 2024

Google vs. Perplexity, Whose API Reigns Supreme?

Nov 8, 2024

Perplexity got sued. What does that mean for OpenAI and Anthropic and You?

Oct 25, 2024

Reflections on LegalTech Benchmarking

Sep 12, 2024

Fine Tune like a Lawyer

Aug 14, 2024

Let's Fine Tune 🦾 GPT-4o mini using LegalBench 👩🏻⚖️ datasets!

Jul 27, 2024

Prompt-Hacking Meta LlamaGuard 2 into a PII Classifier - part 1

Jun 27, 2024

Methodological Considerations re: Stanford HAI’s “Hallucination-Free?”

Jun 10, 2024

Visualizing Legal Text Embeddings with Gradient Maps

Feb 1, 2024

Others also viewed

Towards Advanced RAG

MarTech AI #39: GPT-4o Mini, Project Strawberry, SpreadsheetLLM, Microsoft CrowdStrike Outage & Youtube Data Controversy

Fine-tuning or RAG: Which LLM Strategy is Best for Your GenAI Model?

OpenAI's o1 Model: Advancements in Reasoning and Safety

AI&YOU #40: Retrieval-Augmented Generation (RAG) in Enterprise AI

The cold start problem

My AI Diary: One day at a time with AI #1

PDFs Spill Secrets with RAG — Like a Man with a Bottle of Beer 🍻

From Clusters to Significance: How AI Assigns Meaning and Weights to Unlabelled Data

Knowledge Graphs in RAG: Enhancing AI with Structured Information

Explore content categories