update documentation

raphaelmansuy · Jun 28, 2024 · 256133a · 256133a
1 parent 31b4123
commit 256133a
Show file tree

Hide file tree

Showing 4 changed files with 508 additions and 38 deletions.
diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@ Code2Prompt is a powerful command-line tool that simplifies the process of provi
 
 With Code2Prompt, you can easily create a well-structured and informative document that serves as a valuable resource for feeding questions to LLMs, enabling them to better understand and assist with your code-related queries.
 
-![Illustration](./docs/code2Prompt.jpg)
+![](./docs/code2Prompt.jpg)
 
 ## Features
 
@@ -17,13 +17,14 @@ With Code2Prompt, you can easily create a well-structured and informative docume
 - Optionally strips comments from code files to focus on the core code
 - Includes the actual code content of each file in fenced code blocks
 - Handles binary files and files with encoding issues gracefully
+- Supports custom Jinja2 templates for flexible output formatting
+- Offers token counting functionality for generated prompts
 
 ## How It Works
 
 The following diagram illustrates the high-level workflow of Code2Prompt:
 
-![Diagram](./docs/code2prompt.process.excalidraw.png)
-
+Diagram
 
 1. The tool starts by parsing the command-line options provided by the user.
 2. It then parses the .gitignore file (if specified) to obtain a set of patterns for excluding files and directories.
@@ -36,8 +37,39 @@ The following diagram illustrates the high-level workflow of Code2Prompt:
 9. The file summary, code block, and metadata are appended to the Markdown content.
 10. Steps 4-9 are repeated for each file in the directory and its subdirectories.
 11. After processing all files, the tool generates a table of contents based on the file paths.
-12. If an output file is specified, the generated Markdown content is written to the file. Otherwise, it is printed to the console.
-13. The tool ends its execution.
+12. If a custom template is provided, the tool processes the template with the collected data.
+13. If token counting is enabled, the tool counts the tokens in the generated content.
+14. If an output file is specified, the generated Markdown content is written to the file. Otherwise, it is printed to the console.
+15. The tool ends its execution.
+
+## Project Structure
+
+The Code2Prompt project is organized as follows:
+
+- `code2prompt/`: Main package directory
+  - `__init__.py`: Package initialization
+  - `main.py`: Entry point of the application
+  - `process_file.py`: File processing logic
+  - `template_processor.py`: Custom template processing
+  - `write_output.py`: Output writing functionality
+  - `utils/`: Utility functions
+    - `add_line_numbers.py`: Function to add line numbers to code
+    - `generate_markdown_content.py`: Markdown content generation
+    - `is_binary.py`: Binary file detection
+    - `is_filtered.py`: File filtering logic
+    - `is_ignored.py`: Gitignore pattern matching
+    - `language_inference.py`: Programming language inference
+    - `parse_gitignore.py`: Gitignore file parsing
+  - `comment_stripper/`: Comment removal functionality
+    - `__init__.py`: Subpackage initialization
+    - `strip_comments.py`: Main comment stripping logic
+    - `c_style.py`: C-style comment removal
+    - `html_style.py`: HTML-style comment removal
+    - `python_style.py`: Python-style comment removal
+    - `r_style.py`: R-style comment removal
+    - `shell_style.py`: Shell-style comment removal
+    - `sql_style.py`: SQL-style comment removal
+    - `matlab_style.py`: MATLAB-style comment removal
 
 ## Installation
 
@@ -99,11 +131,20 @@ To generate a Markdown file with the content of your codebase, use the following
 code2prompt --path /path/to/your/codebase --output output.md
 ```
 
-- `--path` (required): Path to the directory containing your codebase.
-- `--output` (optional): Name of the output Markdown file. If not provided, the output will be displayed in the console.
-- `--gitignore` (optional): Path to a custom .gitignore file. If not provided, the tool will look for a .gitignore file in the specified directory.
-- `--filter` (optional): Filter pattern to include specific files (e.g., "*.py" to include only Python files).
-- `--suppress-comments` (optional): Strip comments from the code files. If not provided, comments will be included.
+### Command-line Options
+
+- `--path` or `-p` (required): Path to the directory containing your codebase.
+- `--output` or `-o` (optional): Name of the output Markdown file. If not provided, the output will be displayed in the console.
+- `--gitignore` or `-g` (optional): Path to a custom .gitignore file. If not provided, the tool will look for a .gitignore file in the specified directory.
+- `--filter` or `-f` (optional): Comma-separated filter patterns to include specific files (e.g., "*.py,*.js" to include only Python and JavaScript files).
+- `--exclude` or `-e` (optional): Comma-separated patterns to exclude files (e.g., "*.txt,*.md" to exclude text and Markdown files).
+- `--case-sensitive` (optional): Perform case-sensitive pattern matching.
+- `--suppress-comments` or `-s` (optional): Strip comments from the code files. If not provided, comments will be included.
+- `--line-number` or `-ln` (optional): Add line numbers to source code blocks.
+- `--no-codeblock` (optional): Disable wrapping code inside markdown code blocks.
+- `--template` or `-t` (optional): Path to a Jinja2 template file for custom prompt generation.
+- `--tokens` (optional): Display the token count of the generated prompt.
+- `--encoding` (optional): Specify the tokenizer encoding to use (default: 'cl100k_base').
 
 ### Examples
 
@@ -127,6 +168,110 @@ code2prompt --path /path/to/your/codebase --output output.md
    code2prompt --path /path/to/your/project --output project.md --suppress-comments
    ```
 
+5. Generate a Markdown file using a custom template:
+   ```
+   code2prompt --path /path/to/your/project --output project.md --template /path/to/custom/template.jinja2
+   ```
+
+6. Generate a Markdown file and display token count:
+   ```
+   code2prompt --path /path/to/your/project --output project.md --tokens
+   ```
+
+## Templating System
+
+Code2Prompt includes a powerful templating system that allows you to customize the output format using Jinja2 templates. This feature provides flexibility in generating prompts tailored to specific use cases or LLM requirements.
+
+### How It Works
+
+1. **Template Loading**: When you specify a template file using the `--template` option, Code2Prompt loads the Jinja2 template from the specified file.
+
+2. **Variable Extraction**: The system extracts user-defined variables from the template. These are placeholders in the template that you want to fill with custom values.
+
+3. **User Input**: For each extracted variable, Code2Prompt prompts the user to enter a value.
+
+4. **Data Preparation**: The system prepares a context dictionary containing:
+   - `files`: A list of dictionaries, each representing a processed file with its metadata and content.
+   - User-defined variables and their input values.
+
+5. **Template Rendering**: The Jinja2 template is rendered using the prepared context, producing the final output.
+
+### Example
+
+Let's say you have a template file named `custom_prompt.jinja2` with the following content:
+
+```jinja2
+You are a {{ role }} tasked with analyzing the following codebase:
+
+{% for file in files %}
+## File: {{ file.path }}
+Language: {{ file.language }}
+Content:
+```{{ file.language }}
+{{ file.content }}
+```
+
+{% endfor %}
+
+Based on this codebase, please {{ task }}.
+```
+
+You can use this template with Code2Prompt as follows:
+
+```bash
+code2prompt --path /path/to/your/project --template custom_prompt.jinja2
+```
+
+When you run this command, Code2Prompt will:
+
+1. Load the `custom_prompt.jinja2` template.
+2. Detect the user-defined variables: `role` and `task`.
+3. Prompt you to enter values for these variables:
+   ```
+   Enter value for role: senior software engineer
+   Enter value for task: identify potential security vulnerabilities
+   ```
+4. Process the files in the specified path.
+5. Render the template with the file data and user inputs.
+
+The resulting output might look like this:
+
+```
+You are a senior software engineer tasked with analyzing the following codebase:
+
+## File: /path/to/your/project/main.py
+Language: python
+Content:
+```python
+import os
+
+def read_sensitive_file(filename):
+    with open(filename, 'r') as f:
+        return f.read()
+
+secret = read_sensitive_file('secret.txt')
+print(f"The secret is: {secret}")
+```
+
+## File: /path/to/your/project/utils.py
+Language: python
+Content:
+```python
+import base64
+
+def encode_data(data):
+    return base64.b64encode(data.encode()).decode()
+
+def decode_data(encoded_data):
+    return base64.b64decode(encoded_data).decode()
+```
+
+Based on this codebase, please identify potential security vulnerabilities.
+```
+
+This templating system allows you to create custom prompts that can be easily adapted for different analysis tasks, code review scenarios, or any other purpose where you need to present code to an LLM in a structured format.
+
+
 ## Build
 
 To build a distributable package of Code2Prompt using Poetry, follow these steps:
@@ -161,4 +306,4 @@ Code2Prompt was inspired by the need to provide better context to LLMs when aski
 
 If you have any questions or need further assistance, please don't hesitate to reach out. Happy coding!
 
-Made with ❤️ by Raphël MANSUY.
+Made with ❤️ by Raphël MANSUY
diff --git a/code2prompt/main.py b/code2prompt/main.py
@@ -1,7 +1,6 @@
 import click
 from pathlib import Path
-from jinja2 import Template, Environment, FileSystemLoader
-from prompt_toolkit import prompt
+import tiktoken
 from code2prompt.utils.is_binary import is_binary
 from code2prompt.utils.generate_markdown_content import generate_markdown_content
 from code2prompt.utils.is_filtered import is_filtered
@@ -22,6 +21,9 @@
 @click.option("--line-number", "-ln", is_flag=True, help="Add line numbers to source code blocks.", default=False)
 @click.option("--no-codeblock", is_flag=True, help="Disable wrapping code inside markdown code blocks.")
 @click.option("--template", "-t", type=click.Path(exists=True), help="Path to a Jinja2 template file for custom prompt generation.")
+@click.option("--tokens", is_flag=True, help="Display the token count of the generated prompt.")
+@click.option("--encoding", type=click.Choice(['cl100k_base', 'p50k_base', 'p50k_edit', 'r50k_base']), 
+              default='cl100k_base', help="Specify the tokenizer encoding to use.")
 def create_markdown_file(**options):
     """
     Creates a Markdown file based on the provided options.
@@ -33,54 +35,59 @@ def create_markdown_file(**options):
 
     Args:
         **options (dict): Key-value pairs of options to customize the behavior of the function.
-            Possible keys include 'path', 'output', 'gitignore', 'filter', 'exclude',
-            'case_sensitive', 'suppress_comments', 'line_number', 'no_codeblock', and 'template'.
+                          Possible keys include 'path', 'output', 'gitignore', 'filter', 'exclude',
+                          'case_sensitive', 'suppress_comments', 'line_number', 'no_codeblock',
+                          'template', 'tokens', and 'encoding'.
 
     Returns:
         None
     """
     files_data = process_files(options)
     content = generate_content(files_data, options)
+
+    if options['tokens']:
+        token_count = count_tokens(content, options['encoding'])
+        click.echo(f"Token count: {token_count}")
+
     write_output(content, options['output'])
 
 def process_files(options):
     """
-    Processes files within a specified directory, applying filters and transformations based on the provided options.
+    Processes files within a specified directory, applying filters and transformations
+    based on the provided options.
 
     Args:
-        options (dict): A dictionary containing options such as path, gitignore patterns, and flags for processing files.
+        options (dict): A dictionary containing options such as path, gitignore patterns,
+                        and flags for processing files.
 
     Returns:
         list: A list of dictionaries containing processed file data.
     """
     path = Path(options['path'])
     gitignore_patterns = get_gitignore_patterns(path, options['gitignore'])
-
-
     files_data = []
     for file_path in path.rglob("*"):
         if should_process_file(file_path, gitignore_patterns, path, options):
             result = process_file(file_path, options['suppress_comments'], options['line_number'], options['no_codeblock'])
             if result:
                 files_data.append(result)
-
-
     return files_data
 
 def get_gitignore_patterns(path, gitignore):
     """
-    Retrieve gitignore patterns from a specified path or a default.gitignore file.
+    Retrieve gitignore patterns from a specified path or a default .gitignore file.
 
-    This function reads the.gitignore file located at the specified path or uses the default
-   .gitignore file in the project root if no specific path is provided. It then parses the file
-    to extract ignore patterns and adds a default pattern to ignore the.git directory itself.
+    This function reads the .gitignore file located at the specified path or uses
+    the default .gitignore file in the project root if no specific path is provided.
+    It then parses the file to extract ignore patterns and adds a default pattern
+    to ignore the .git directory itself.
 
-    Parameters:
-    - path (Path): The root path of the project where the default.gitignore file is located.
-    - gitignore (Optional[str]): An optional path to a specific.gitignore file to use instead of the default.
+    Args:
+        path (Path): The root path of the project where the default .gitignore file is located.
+        gitignore (Optional[str]): An optional path to a specific .gitignore file to use instead of the default.
 
     Returns:
-    - Set[str]: A set of gitignore patterns extracted from the.gitignore file.
+        Set[str]: A set of gitignore patterns extracted from the .gitignore file.
     """
     gitignore_path = Path(gitignore) if gitignore else path / ".gitignore"
     patterns = parse_gitignore(gitignore_path)
@@ -116,24 +123,42 @@ def generate_content(files_data, options):
     Generate content based on the provided files data and options.
 
     This function either processes a Jinja2 template with the given files data and user inputs
-    or generates markdown content directly from the files data, depending on whether a template
-    option is provided.
+    or generates markdown content directly from the files data, depending on whether a
+    template option is provided.
 
     Args:
         files_data (list): A list of dictionaries containing processed file data.
-        options (dict): A dictionary containing options such as template path and whether to wrap
-                        code inside markdown code blocks.
+        options (dict): A dictionary containing options such as template path and whether
+                        to wrap code inside markdown code blocks.
 
     Returns:
-        str: The generated content as a string, either from processing a template or directly
-             generating markdown content.
+        str: The generated content as a string, either from processing a template or
+             directly generating markdown content.
     """
     if options['template']:
         template_content = load_template(options['template'])
         user_inputs = get_user_inputs(template_content)
         return process_template(template_content, files_data, user_inputs)
     return generate_markdown_content(files_data, options['no_codeblock'])
 
+def count_tokens(text: str, encoding: str) -> int:
+    """
+    Count the number of tokens in the given text using the specified encoding.
+
+    Args:
+        text (str): The text to tokenize and count.
+        encoding (str): The encoding to use for tokenization.
+
+    Returns:
+        int: The number of tokens in the text.
+    """
+    try:
+        encoder = tiktoken.get_encoding(encoding)
+        return len(encoder.encode(text))
+    except Exception as e:
+        click.echo(f"Error counting tokens: {str(e)}", err=True)
+        return 0
+
 if __name__ == "__main__":
     # pylint: disable=no-value-for-parameter
-    create_markdown_file()
+    create_markdown_file()