-
Notifications
You must be signed in to change notification settings - Fork 104
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* First Website Draft * Implemented Feedback from supervisor * Logo Upload * Edits for Release 2 * Removed Tech Stack * Unnecessary files removed * Revert submodule version * compressed images * Rename Filenames * Final Website Version * Picture resize
- Loading branch information
Showing
13 changed files
with
91 additions
and
84 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
+++ | ||
title = "Challenges" | ||
weight = 20 | ||
draft = false | ||
+++ | ||
|
||
|
||
{{<section title="Large Amounts of Data">}} | ||
According to the Federal Office of Administration, there are a total of **966 federal authorities and institutions in Germany** ([Statista, 20.07.2024](https://de.statista.com/statistik/daten/studie/1128113/umfrage/bundesbehoerden-in-deutschland-nach-behoerdenart/)). However, there are many more, as the statistics do not include the offices at state level. One of the challenges of the project was to create a tool that would fit as many different organizational charts as possible in order to capture as much data as feasible. | ||
{{</section>}} | ||
|
||
{{<section title="Extracting Text from PDF">}} | ||
PDF is a format for displaying documents, not for extracting structural data from them. So, extracting all the data you need from the documents is a challenge. When text is extracted, the output is unstructured, often out of order. Semantic information is completely missing. There are also a large number of **edge cases**, which makes it difficult to write an algorithm that covers every edge case and extracts all the text in a meaningful way. | ||
{{</section>}} | ||
|
||
{{<section title="No Content Standards">}} | ||
Every organizational chart is different. Not only do the people who work in different departments differ, but so do their job titles. However, even similar job titles may have different names in different departments. For example, it is not a given that the Ministry of Education in Berlin will have the same job titles as the same ministry in Baden-Württemberg. Abbreviations are also commonly used, but again they are not always the same, and may vary from authority to authority. | ||
{{</section>}} | ||
|
||
{{<section title="Semantic Analysis">}} | ||
The challenge for semantic analysis is that we work with unnatural language, so natural language processing (NLP) models are not sufficient. NLPs can extract meaningful text from coherent sentences, but not from individual words. This requires the use of Large Language Models (LLM). Therefore, a **combination of LLMs and NLPs** was used. | ||
{{</section>}} | ||
|
||
|
||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,22 +1,24 @@ | ||
+++ | ||
title = "Features" | ||
weight = 20 | ||
draft = true | ||
weight = 30 | ||
draft = false | ||
+++ | ||
|
||
{{<section title="Features">}} | ||
|
||
* **Roles** | ||
|
||
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla facilisis neque id vulputate malesuada. Quisque dignissim finibus urna sed sagittis. | ||
{{<section title="Python Library">}} | ||
The library contains primitives for extracting PDFs and detects shapes and words. It includes semantic analysis of organizational charts. The library can be used in your Python projects. It can be customized to your needs for your organizational charts by providing your own datasets. | ||
{{</section>}} | ||
|
||
{{<section title="Command Line Tool">}} | ||
The command line tool is designed to convert PDF files into JSON files. This main feature offers a simple user interface to quickly extract data from individual or multiple files. | ||
{{</section>}} | ||
|
||
* **Game-Flow** | ||
{{<section title="LLM Integration">}} | ||
The integration of various large language models facilitates advanced text analysis. For our demo, we employed the LLM from OpenAI. However, any supported LLM can be integrated to perform text analysis. | ||
|
||
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla facilisis neque id vulputate malesuada. Quisque dignissim finibus urna sed sagittis. | ||
{{</section>}} | ||
|
||
* **Voting Systems** | ||
{{<section title="Web Interface">}} | ||
A web interface has been created to visualize the results from text extraction. At the moment it has been built for demonstration purposes, but can be built into a user-friendly web view in the future. | ||
{{</section>}} | ||
|
||
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla facilisis neque id vulputate malesuada. Quisque dignissim finibus urna sed sagittis. | ||
|
||
{{</section>}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
+++ | ||
title = "Future" | ||
weight = 50 | ||
draft = false | ||
+++ | ||
|
||
{{<section title="Open Data Formats">}} | ||
For now, the PDF files are only converted into JSON Files. Developing the project further, integrating more open data formats (RDF, CSV, etc.) would be an option to create more possible use cases. | ||
{{</section>}} | ||
|
||
{{<section title="Integration with External Tool">}} | ||
The Technologiestiftung Berlin has developed an [organizational chart tool](https://organigramme.odis-berlin.de/) that makes it easy to create organizational charts in a browser. The foundation has developed a JSON format called orgjson. Files in this format can be uploaded via their web interface. These are then rendered into an organizational chart. The orgXtract project could be linked to the org chart tool in the future by outputting the extracted file in orgjson format. | ||
More information about the project can be found here: https://odis-berlin.de/projekte/2023-07-organigramm-tool/. | ||
{{</section>}} | ||
|
||
{{<section title="Recognition of Hierarchies">}} | ||
The tool currently focuses on extracting the content of organization nodes individually. A future step could be linking those nodes by recognizing the hierarchy. This was not a priority for our current project, since the research of the Open Knowledge Foundation focuses on the people and their positions and not the hierarchical structure of the document. | ||
{{</section>}} |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,72 +1,40 @@ | ||
+++ | ||
title = "Tech Stack" | ||
weight = 30 | ||
draft = true | ||
hasMermaid = true | ||
weight = 40 | ||
draft = false | ||
+++ | ||
|
||
{{<section title="Tech Stack">}} | ||
Hugo offers a couple of options to include diagrams right in the source code, see | ||
[https://gohugo.io/content-management/diagrams/](https://gohugo.io/content-management/diagrams/) | ||
{{<image src="techstack_logos.png" alt="Techstack">}} | ||
|
||
Here's a mermaid example: (note the hasMermaid = true parameter in the front matter!) | ||
|
||
```mermaid | ||
classDiagram | ||
Animal <|-- Duck | ||
Animal <|-- Fish | ||
Animal <|-- Zebra | ||
Animal : +int age | ||
Animal : +String gender | ||
Animal: +isMammal() | ||
Animal: +mate() | ||
class Duck{ | ||
+String beakColor | ||
+swim() | ||
+quack() | ||
} | ||
class Fish{ | ||
-int sizeInFeet | ||
-canEat() | ||
} | ||
class Zebra{ | ||
+bool is_wild | ||
+run() | ||
} | ||
``` | ||
{{<section title="Organization">}} | ||
* **[Github](https://github.com/FDS-HTW-2024/fds_orgchart)** for code collaboration and distribution | ||
* **[Trello](https://trello.com/)** for task organization | ||
* **[Figma](https://www.figma.com/)** for visualization prototyping | ||
{{</section>}} | ||
|
||
{{<section title="Future">}} | ||
{{<section title="Core">}} | ||
* **[Python](https://www.python.org/)** | ||
|
||
Python was selected due to its status as the most prominent programming language for data science. Given the extensive range of available packages, it proved to be the most effective language for our problem. | ||
|
||
* **[PyMuPDF](https://pymupdf.readthedocs.io/en/latest/)** | ||
|
||
The PyMuPDF library was employed for the extraction of data from PDF documents, as it is the most high-performance Python library for this purpose. | ||
* **[spaCy](https://spacy.io/)** | ||
|
||
spaCy is a high quality package for natural language processing tasks. We chose to work with spaCy because it provides many components without much configuration. | ||
* **[Python LLM](https://llm.datasette.io/en/stable/)** | ||
|
||
```goat | ||
+-------------------+ ^ .---. | ||
| A Box |__.--.__ __.--> | .-. | | | ||
| | '--' v | * |<--- | | | ||
+-------------------+ '-' | | | ||
Round *---(-. | | ||
.-----------------. .-------. .----------. .-------. | | | | ||
| Mixed Rounded | | | / Diagonals \ | | | | | | | ||
| & Square Corners | '--. .--' / \ |---+---| '-)-' .--------. | ||
'--+------------+-' .--. | '-------+--------' | | | | / Search / | ||
| | | | '---. | '-------' | '-+------' | ||
|<---------->| | | | v Interior | ^ | ||
' <---' '----' .-----------. ---. .--- v | | ||
.------------------. Diag line | .-------. +---. \ / . | | ||
| if (a > b) +---. .--->| | | | | Curved line \ / / \ | | ||
| obj->fcn() | \ / | '-------' |<--' + / \ | | ||
'------------------' '--' '--+--------' .--. .--. | .-. +Done?+-' | ||
.---+-----. | ^ |\ | | /| .--+ | | \ / | ||
| | | Join \|/ | | Curved | \| |/ | | \ | \ / | ||
| | +----> o --o-- '-' Vertical '--' '--' '-- '--' + .---. | ||
<--+---+-----' | /|\ | | 3 | | ||
v not:line 'quotes' .-' '---' | ||
.-. .---+--------. / A || B *bold* | ^ | ||
| | | Not a dot | <---+---<-- A dash--is not a line v | | ||
'-' '---------+--' / Nor/is this. --- | ||
Python LLM provides a simple API to wrap around the many existing large language models. | ||
{{</section>}} | ||
|
||
{{<section title="Web View">}} | ||
* **HTML** | ||
* **JavaScript** | ||
* **CSS** | ||
|
||
``` | ||
Goat Diagrams are also supported :) | ||
{{</section>}} | ||
|
||
{{<section title="Project Architecture">}} | ||
{{</section>}} | ||
{{<image src="architecture.png" alt="Architecture">}} |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.