Using GenAI APIs to extract info from text fields in a proprietary db

Spent half a day tinkering with GenAI (OpenAI) APIs to solve a real-world problem. Felt good to be coding after so many months.

This was the 2nd attempt at understanding GenAI better with a real world problem statement. First one was this.

The Problem Statement – Opportunity to deploy OpenAI

At mTuzo we have collated all the card-based-offers for customers in multiple partners. We then provide these offers as B2B APIs to our fintech partners to showcase relevant targeted offers to their end-customers.

The most powerful value that mTuzo provides is that we have taken multiple unstructured data sources and made into a standardized uniform structured data.

Many offers have a set of Terms and conditions (T&Cs) around who or which transactions qualify for the offer. We do capture this in multiple fields across expiry, available outlets , card-level-eligibility etc.

One of the partners asked us if we could provide a separate data-field only for “minimum spends” needed to unlock the offer. In our APIs this is available prominently for showcasing to end consumers, but the partner’s use-case needed this as a separate field.

And we have 1000’s of live offers.

In the past we have used regex (& related techniques) to do something like this, but this time around I raised my hand to try & see if we could solve this via OpenAI.

Setting up GenAI API + own db

  • I can code only in PHP, so that decision was simple.
  • Next, pulled out a sample code to do a CURL to OpenAI APIs from the official website
  • Plugged in my API keys (created a new set just for this)
  • Model used – gpt-3.5-turbo-instruct (is this a good choice?)
  • Create a prompt to train tool (am sure this can be optimized further. Need to understand how) – "This AI tool helps credit/debit card customers understand terms and conditions(Tnc) of offers to identify any minimum spend criteria to avail the offer TnC- Get flat 20% discount on minimum spend of Rs. 2,499/- Should Reply- Rs 2,499\ TnC- Amazing offer of 1+1 when you spend of Rs. 4500/-\n Should Reply- Rs 4500\n\n TnC- Get upto60% discount upto Rs 100 Should Reply- NA"
  • Wrote code to connect to the db, pull one record at a time, extract the text field that has the minimum-spend data.
  • Create a dynamic prompt for each record by appending the text, at the end of the generic prompt script above
  • Call the OpenAI API for each record, with its unique prompt
  • Capture the JSON object returned, which looks like below. The real- response is in array[‘choices’][0][‘text’]
  • Use the response to do an update into a new column (for minimum spends) in the db/table.
Using GenAI (OpenAI) API to connect to a database via PHP.
OpenAI API response to the prompt asking it to find the MinSpend criteria in an offer tnc

The final result

  • Got almost accurate results. The model returned “No minimum spend” etc kind of response instead of “NA”. Maybe a fine-tuned model will help here
  • Speed – Ran through the whole db across all the offers in no time.
  • Cost – this was a token-heavy process (as seen above – spending almost 140 tokens in each record). Most of the tokens are used by the prompt. And since this is a API call without context, the prompt has to be repeated every API call. Nonetheless, it took less than 0.01 USD for 50 calls (tested before running on whole db). The cost may get reduced by using a finetuned model

Next Steps

  • Build a fine-tuned model and explore if the accuracy improves and the cost comes down significantly
  • Run similar models for extracting other information from the offer-tnc-text
  • Check if we can ask the model to return multiple info in one go. If yes, and the response comes as a text, how do we split the text (can we ask for a separator in the response) and extract multiple fields. This may be very efficient as most of the tokens in each call are just for prompt. Might as well get a richer response back.


Posted

in

, ,

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *