updated readme to reflect current functionality

emcf · Oct 30, 2024 · 59e50f4 · 59e50f4
1 parent 5183396
commit 59e50f4
Show file tree

Hide file tree

Showing 2 changed files with 62 additions and 30 deletions.
diff --git a/README.md b/README.md
@@ -20,16 +20,16 @@
   </a>
 </div>
 
-### Extract clean markdown from PDFs URLs, slides, videos, and more, ready for any LLM. ⚡
+### Extract clean data from tricky documents ⚡
 
-thepi.pe is a package that can scrape clean markdown and extract structured data from tricky sources, like PDFs. It uses vision-language models (VLMs) under the hood, and works out-of-the-box with any LLM, VLM, or vector database. It can be used right away on a [hosted cloud](https://thepi.pe), or it can be run locally.
+thepi.pe is a package that can scrape clean markdown or accurately extract structured data from complex documents. It uses vision-language models (VLMs) under the hood, and works out-of-the-box with any LLM, VLM, or vector database. It can be used right away on a [hosted cloud](https://thepi.pe), or it can be run locally.
 
 ## Features 🌟
 
 - Scrape clean markdown, tables, and images from any document or webpage
 - Works out-of-the-box with LLMs, vector databases, and RAG frameworks
 - AI-native filetype detection, layout analysis, and structured data extraction
-- Accepts a wide range of sources, including Word docs, Powerpoints, Python notebooks, GitHub repos, videos, audio, and more
+- Accepts a wide range of sources, including PDFs, URLs, Word docs, Powerpoints, Python notebooks, GitHub repos, videos, audio, and more
 
 ## Get started in 5 minutes  🚀
 
@@ -43,6 +43,8 @@ pip install thepipe-api
 
 You can get an API key by signing up for a free account at [thepi.pe](https://thepi.pe). It is completely free to try out. The, simply set the `THEPIPE_API_KEY` environment variable to your API key.
 
+## Scrape Function
+
 ```python
 from thepipe.scraper import scrape_file
 from thepipe.core import chunks_to_messages
@@ -59,6 +61,57 @@ response = client.chat.completions.create(
 )
 ```
 
+The output from thepi.pe is a list of chunks containing all content within the source document. These chunks can easily be converted to a prompt format that is compatible with any LLM or multimodal model with `thepipe.core.chunks_to_messages`, which gives the following format:
+```json
+[
+  {
+    "role": "user",
+    "content": [
+      {
+        "type": "text",
+        "text": "..."
+      },
+      {
+        "type": "image_url",
+        "image_url": {
+          "url": "data:image/jpeg;base64,..."
+        }
+      }
+    ]
+  }
+]
+```
+
+## Extract Function
+
+The extract function allows you to extract structured data from documents. You can use it as follows:
+
+```python
+from thepipe.extract import extract_from_chunk
+from thepipe.scraper import scrape_file
+
+# First, scrape the document
+chunks = scrape_file(filepath="document.pdf", ai_extraction=True)
+
+# Define your schema
+schema = {
+    "name": "string",
+    "age": "int",
+    "is_student": "bool"
+}
+
+# Extract data from each chunk
+for chunk in chunks:
+    result, tokens_used = extract_from_chunk(
+        chunk=chunk,
+        schema=json.dumps(schema),
+        ai_model="gpt-4o-mini",
+        multiple_extractions=True
+    )
+    print(result)
+    print(f"Tokens used: {tokens_used}")
+```
+
 ### Local Installation (Python)
 
 For a local installation, you can use the following command:
@@ -92,7 +145,7 @@ thepipe path/to/folder --include_regex .*\.tsx --local
 | Source              | Input types                                                    | Multimodal | Notes |
 |--------------------------|----------------------------------------------------------------|---------------------|----------------------|
 | Webpage                  | URLs starting with `http`, `https`, `ftp`                      | ✔️                  | Scrapes markdown, images, and tables from web pages. `ai_extraction` available for AI content extraction from the webpage's screenshot |
-| PDF                      | `.pdf`                                                          | ✔️                  | Extracts page markdown and page images. `ai_extraction` available for AI layout analysis |
+| PDF                      | `.pdf`                                                          | ✔️                  | Extracts page markdown and page images. `ai_extraction` available to use a VLM for complex or scanned documents |
 | Word Document  | `.docx`                                                         | ✔️                  | Extracts text, tables, and images |
 | PowerPoint     | `.pptx`                                                         | ✔️                  | Extracts text and images from slides |
 | Video                    | `.mp4`, `.mov`, `.wmv`                                          | ✔️                  | Uses Whisper for transcription and extracts frames |
@@ -109,28 +162,7 @@ thepipe path/to/folder --include_regex .*\.tsx --local
 
 ## How it works 🛠️
 
-thepi.pe uses computer vision models and heuristics to extract clean content from the source and process it for downstream use with [language models](https://en.wikipedia.org/wiki/Large_language_model), or [vision transformers](https://en.wikipedia.org/wiki/Vision_transformer). The output from thepi.pe is a list of chunks containing all content within the source document. These chunks can easily be converted to a prompt format that is compatible with any LLM or multimodal model with `thepipe.core.chunks_to_messages`, which gives the following format:
-```json
-[
-  {
-    "role": "user",
-    "content": [
-      {
-        "type": "text",
-        "text": "..."
-      },
-      {
-        "type": "image_url",
-        "image_url": {
-          "url": "data:image/jpeg;base64,..."
-        }
-      }
-    ]
-  }
-]
-```
-
-You can feed these messages directly into the model, or alternatively you can use `chunker.chunk_by_document`, `chunker.chunk_by_page`, `chunker.chunk_by_section`, `chunker.chunk_semantic` to chunk these messages for a vector database such as ChromaDB or a RAG framework. A chunk can be converted to LlamaIndex Document/ImageDocument with `.to_llamaindex`.
+thepi.pe uses computer vision models and heuristics to extract clean content from the source and process it for downstream use with [language models](https://en.wikipedia.org/wiki/Large_language_model), or [vision transformers](https://en.wikipedia.org/wiki/Vision_transformer). You can feed these messages directly into the model, or alternatively you can use `chunker.chunk_by_document`, `chunker.chunk_by_page`, `chunker.chunk_by_section`, `chunker.chunk_semantic` to chunk these messages for a vector database such as ChromaDB or a RAG framework. A chunk can be converted to LlamaIndex Document/ImageDocument with `.to_llamaindex`.
 
 > ⚠️ **It is important to be mindful of your model's token limit.**
 GPT-4o does not work with too many images in the prompt (see discussion [here](https://community.openai.com/t/gpt-4-vision-maximum-amount-of-images/573110/6)). To remedy this issue, either use an LLM with a larger context window, extract larger documents with `text_only=True`, or embed the chunks into vector database.

diff --git a/tests/test_api.py b/tests/test_api.py
@@ -34,13 +34,13 @@ def test_extract_from_file_with_multiple_extractions(self):
             self.assertIsInstance(result, list)
             self.assertGreater(len(result), 0)
             # Check if the extracted data matches the schema
-            # since multiple extractions is enabled, we have the 'extractions' key for each chunk
+            # since multiple extractions is enabled, we have the 'extraction' key for each chunk
             # containing all the extractions.
             # the result looks like: [{'chunk_index': 0, 'source': 'example.pdf', 'extraction': [{'document_topic': 'Density PDFs in Supersonic Turbulence', 'document_sentiment': None}]}]
             for item in result:
                 self.assertIsInstance(item, dict)
-                if 'extractions' in item:
-                    for extraction in item['extractions']:
+                if 'extraction' in item:
+                    for extraction in item['extraction']:
                         self.assertIsInstance(extraction, dict)
                         for key in self.schema:
                             self.assertIn(key, extraction)
@@ -70,7 +70,7 @@ def test_extract_from_url_with_one_extraction(self):
             self.assertGreater(len(result), 0)
 
             # Check if the extracted data matches the schema
-            # since multiple extractions is disabled, we don't have the 'extractions' key for each chunk
+            # since multiple extractions is disabled, we don't have the 'extraction' key for each chunk
             # [{'chunk_index': 0, 'source': 'https://thepi.pe/', 'document_topic': 'AI document extraction and data processing', 'document_sentiment': 0.8}]
             for item in result:
                 self.assertIsInstance(item, dict)