LLM – Soko https://sokosolutions.com Innovation & Talent Tue, 10 Oct 2023 00:46:56 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.1 https://sokosolutions.com/wp-content/uploads/2023/02/cropped-sokofavicon-32x32.png LLM – Soko https://sokosolutions.com 32 32 Corpus Analyzer https://sokosolutions.com/2023/09/28/corpus-analyzer/ Thu, 28 Sep 2023 23:58:37 +0000 https://sokosolutions.com/?p=1562
large language models

Corpus analyzer

The goal: Identify and summarize the contents of a set of PDF documents and present them in a conversational format.

Technical procedure

1. Document preprocessing:

Text is extracted from each file uploaded, skipping any embedded media on the document.

2. Splitting:

Text content is splitted in smaller chunks in order to keep meaningful pieces of information that fit within the context window of the language model.

3. Document summarization:

Every set of chunks is recursively summarized using GPT-3.5 Turbo. A prefix stating that it is a summary is prepended. 

4. Vectorization:

Each chunk is vectorized by using OpenAI’s embeddings to get a numerical representation of the content that condenses the semantic and its main keywords.

 

5. Storage:

Every chunk and their summaries are stored in Chroma. This vector database makes it easier to find similar content to a query.

6. Clusterization tool:

Summaries are clustered to find common topics between them. First features are obtained from text by using term frequency – inverse document frequency, then truncated singular value decomposition is applied to reduce the dimension. The algorithm used for clustering is DBSCAN. Once clusters are detected, a brief summary is built for each group emphasizing their similarities. Alongside with the descriptions, the tool reports the amount of documents and the document names present per category.

7. Agent:

A conversational agent is built based on GPT-4 with access to the vector db and a clusterization tool. It can formulate many internal questions/queries to compose an answer for a user question.

User instructions

1. Document upload:

The user can upload the PDF documents to analyze into the “Input” box. If needed, more documents can be uploaded after the previous upload has been completed.

2. Chatbot interaction:

After the documents are uploaded, the user can formulate questions about the documents in natural language. At any time the conversation can be cleared by the user, and optionally the files can also be cleared to start from the beginning.

Download the example document

Documents certifying the establishment of companies in Chile, car financing promotions and resumes

Click here to upload the documents

Questions

  1. What are these documents about?
  2. How can these documents be grouped? Make a detailed list
  3. Reduce the amount of categories to three
  4. Which file names belong to each of these categories?
  5. What is the price for each car model offered?
  6. Which model has the lowest interest rate?
  7. What are the main differences between the developers?
  8. Make a brief description of Chahuan y Filippi Limitada
  9. Which Chilean company has the largest capital?

]]>
Web research https://sokosolutions.com/2023/09/28/web-research/ Thu, 28 Sep 2023 23:34:14 +0000 https://sokosolutions.com/?p=1550
large language models

Web research

The goal: Conduct research and present reports on the companies specified by the user in a table format. The information provided by the tool includes: website address, company logo, funding, annual revenue, most stared github repository and a summary of their activities.

Technical procedure

1. Context retrieval:

The app performs multiple google searches over specialized websites in order to get the appropriate data.

2. Document summarization:

Summarize every chunk of the webpage content using gpt-3.5 turbo, then all the summaries are combined into the final summary.

3. Webpage rendering:

Each web page it’s rendered using selenium to avoid losing information on javascript rendered pages.

4. Data agents:

There are 7 agents specialized in obtaining each piece of information (e.g. summary, logo, annual revenue, etc.). Each of these is capable of doing Google searches, rendering web pages and deducing in one or more iterations what the correct answer is.

5. Generate response:

The user’s question and the retrieved context are sent as input to the GPT model to generate a response in natural language based on this input. The generated response is presented to the user through the chatbot interface.

User instructions

1. Chatbot interaction:

The user can provide company names to the chatbot interface and the app will display all the information it can get in a table format. The user can also specify what data field he needs.

2. Batch request:

The user can provide an email and a list of companies to process. After the process finished the results are sent to the provided email address

Questions

  1. Openai
  2. Openai, Flair, Facet ai
  3. Give me the annual revenue of Openai, Flair, Facet ai
  4. Give me the Logo and funding of Openai, Flair, Facet ai

]]>
Chat with your data https://sokosolutions.com/2023/09/28/chat-with-your-data-2/ Thu, 28 Sep 2023 22:55:01 +0000 https://sokosolutions.com/?p=1537
large language models

Chat with your data

Provide users with a convenient and interactive way to access information within PDF documents

Technical procedure

Document preprocessing: When a user uploads a PDF document, the app preprocesses it to extract and structure the text content.

Chunking: The text content from the document is divided into smaller, manageable chunks. The chunk size multiplied by the number of relevant chunks selected should not exceed the maximum context window supported by the underlying language model (GPT-3.5)

Vectorization: Each chunk of text is converted into a numerical vector representation with Word Embeddings techniques. These vectors capture the semantic meaning of the text and allow for efficient searching and retrieval.

Storage: The vector representations of the text chunks are stored in a vector database. This database serves as a repository of contextual information from the documents.

Context retrieval: The app performs a similarity search within the database of vectorized documents to find the most relevant chunks based on the vector representation of the user’s question. These retrieved chunks serve as context for generating a relevant response.

Chat history: The app is a chatbot, therefore it saves the chat history and uses it to generate a standalone new question based on it. 

Generate response: The user’s question and the retrieved context are sent as input to the GPT model to generate a response in natural language based on this input. The generated response is presented to the user through the chatbot interface.

User instructions

1. Document upload:

The user begins by uploading one or more PDF documents into the app. These documents contain the information the user wants to access and query.

2. Chatbot interaction:

After the documents are uploaded, the user interacts with the chatbot interface provided by the app. The user can type questions in natural language to the chatbot. The user can continue to ask questions and receive responses, creating an iterative conversation with the chatbot.

Download the example document

Click here to upload the document

Questions

  1. What is the full name of the notary public who certified this document?
  2. Who are the individuals involved in establishing the 'CHAHUAN Y FILIPPI LIMITADA' company?
  3. What is the registered capital of 'CHAHUAN Y FILIPPI LIMITADA' and how was it contributed?
  4. What is the stated business objective or purpose of the company?
  5. Where is the registered office of the company?
  6. What is the extent of liability for the partners in this Limited Liability Company (Sociedad de Responsabilidad Limitada)?
  7. Who can administrate, represent, and use the company's business name according to the document?
  8. Is there a specified term for the existence of 'CHAHUAN Y FILIPPI LIMITADA'?
  9. What is the date of the document's certification?

]]>