Skip to content

Commit

Permalink
M4 - final website version (#389)
Browse files Browse the repository at this point in the history
* First Website Draft

* Implemented Feedback from supervisor

* Logo Upload

* Edits for Release 2

* Removed Tech Stack

* Unnecessary files removed

* Revert submodule version

* compressed images

* Rename Filenames

* Final Website Version

* Picture resize
  • Loading branch information
lauralgh authored Jul 26, 2024
1 parent 29266a4 commit 4ede622
Show file tree
Hide file tree
Showing 13 changed files with 91 additions and 84 deletions.
6 changes: 3 additions & 3 deletions content/ss24/master/m4-orgxtract/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
project_id = "M4"
title = "OrgXtract"
subtitle = ""
claim = ""
abstract = "In our quest to challenge German bureaucracy, our project transforms organizational charts stored as PDFs into machine-readable open data formats, enhancing research capabilities and promoting transparency of German authorities."
claim = "In our quest to challenge German bureaucracy, our project transforms organizational charts stored as PDFs into machine-readable open data formats, enhancing research capabilities and promoting transparency of German authorities."
abstract = ""

# Properties for displaying the project in the project list
card_image = "logo_orgxtract.jpg"
Expand Down Expand Up @@ -34,7 +34,7 @@ study_focus = []


{{<section title="Our Goal">}}
The project aims to transform organizational charts, traditionally stored as PDFs, into a machine-readable format to enhance transparency in German authorities personnel structure. It focuses on converting these static documents into data that can be easily analyzed, to provide deeper insights into public administration.
The project aims to **transform organizational charts**, traditionally stored as PDFs, into a machine-readable format to enhance transparency in German authorities personnel structure. It focuses on converting these static documents into data that can be easily analyzed, to provide deeper insights into public administration.
The goal was to create a tool for the Open Knowledge Foundation team to extract data from organizational charts of public authorities in PDF and use the output data to conduct their research.
{{</section>}}

Expand Down
Binary file added content/ss24/master/m4-orgxtract/architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 25 additions & 0 deletions content/ss24/master/m4-orgxtract/challenges.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
+++
title = "Challenges"
weight = 20
draft = false
+++


{{<section title="Large Amounts of Data">}}
According to the Federal Office of Administration, there are a total of **966 federal authorities and institutions in Germany** ([Statista, 20.07.2024](https://de.statista.com/statistik/daten/studie/1128113/umfrage/bundesbehoerden-in-deutschland-nach-behoerdenart/)). However, there are many more, as the statistics do not include the offices at state level. One of the challenges of the project was to create a tool that would fit as many different organizational charts as possible in order to capture as much data as feasible.
{{</section>}}

{{<section title="Extracting Text from PDF">}}
PDF is a format for displaying documents, not for extracting structural data from them. So, extracting all the data you need from the documents is a challenge. When text is extracted, the output is unstructured, often out of order. Semantic information is completely missing. There are also a large number of **edge cases**, which makes it difficult to write an algorithm that covers every edge case and extracts all the text in a meaningful way.
{{</section>}}

{{<section title="No Content Standards">}}
Every organizational chart is different. Not only do the people who work in different departments differ, but so do their job titles. However, even similar job titles may have different names in different departments. For example, it is not a given that the Ministry of Education in Berlin will have the same job titles as the same ministry in Baden-Württemberg. Abbreviations are also commonly used, but again they are not always the same, and may vary from authority to authority.
{{</section>}}

{{<section title="Semantic Analysis">}}
The challenge for semantic analysis is that we work with unnatural language, so natural language processing (NLP) models are not sufficient. NLPs can extract meaningful text from coherent sentences, but not from individual words. This requires the use of Large Language Models (LLM). Therefore, a **combination of LLMs and NLPs** was used.
{{</section>}}



Binary file modified content/ss24/master/m4-orgxtract/chi.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
26 changes: 14 additions & 12 deletions content/ss24/master/m4-orgxtract/features.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,24 @@
+++
title = "Features"
weight = 20
draft = true
weight = 30
draft = false
+++

{{<section title="Features">}}

* **Roles**

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla facilisis neque id vulputate malesuada. Quisque dignissim finibus urna sed sagittis.
{{<section title="Python Library">}}
The library contains primitives for extracting PDFs and detects shapes and words. It includes semantic analysis of organizational charts. The library can be used in your Python projects. It can be customized to your needs for your organizational charts by providing your own datasets.
{{</section>}}

{{<section title="Command Line Tool">}}
The command line tool is designed to convert PDF files into JSON files. This main feature offers a simple user interface to quickly extract data from individual or multiple files.
{{</section>}}

* **Game-Flow**
{{<section title="LLM Integration">}}
The integration of various large language models facilitates advanced text analysis. For our demo, we employed the LLM from OpenAI. However, any supported LLM can be integrated to perform text analysis.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla facilisis neque id vulputate malesuada. Quisque dignissim finibus urna sed sagittis.
{{</section>}}

* **Voting Systems**
{{<section title="Web Interface">}}
A web interface has been created to visualize the results from text extraction. At the moment it has been built for demonstration purposes, but can be built into a user-friendly web view in the future.
{{</section>}}

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla facilisis neque id vulputate malesuada. Quisque dignissim finibus urna sed sagittis.

{{</section>}}
18 changes: 18 additions & 0 deletions content/ss24/master/m4-orgxtract/future.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
+++
title = "Future"
weight = 50
draft = false
+++

{{<section title="Open Data Formats">}}
For now, the PDF files are only converted into JSON Files. Developing the project further, integrating more open data formats (RDF, CSV, etc.) would be an option to create more possible use cases.
{{</section>}}

{{<section title="Integration with External Tool">}}
The Technologiestiftung Berlin has developed an [organizational chart tool](https://organigramme.odis-berlin.de/) that makes it easy to create organizational charts in a browser. The foundation has developed a JSON format called orgjson. Files in this format can be uploaded via their web interface. These are then rendered into an organizational chart. The orgXtract project could be linked to the org chart tool in the future by outputting the extracted file in orgjson format.
More information about the project can be found here: https://odis-berlin.de/projekte/2023-07-organigramm-tool/.
{{</section>}}

{{<section title="Recognition of Hierarchies">}}
The tool currently focuses on extracting the content of organization nodes individually. A future step could be linking those nodes by recognizing the hierarchy. This was not a priority for our current project, since the research of the Open Knowledge Foundation focuses on the people and their positions and not the hierarchical structure of the document.
{{</section>}}
Binary file modified content/ss24/master/m4-orgxtract/jimmy.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/ss24/master/m4-orgxtract/laura.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/ss24/master/m4-orgxtract/lorenzo.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/ss24/master/m4-orgxtract/niklas.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 4 additions & 10 deletions content/ss24/master/m4-orgxtract/process.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,14 @@ draft = false
+++

{{<section title="Defining our Problem">}}

To understand the problem better, we took a deep dive into the composition of organizational charts of German authorities. Every organizational chart not only looked different but also did not follow the same structure. The only thing they had in common was their format: PDF. We had a lot of discussion with the Open Knowledge Foundation about their objective to make these organizational charts machine-readable. Their answer was straightforward: To make research easier and, above all, faster. Organizational charts contain a great deal of information about the distribution of power positions. Automatically reading this data allows us to trace personnel structures and clarify any potential grievances.
To understand the problem better, we took a deep dive into the composition of organizational charts of German authorities. Every organizational chart not only looked different but also did not follow the same structure. The only thing they had in common was their format: PDF. We had a lot of discussion with the Open Knowledge Foundation about their objective to make these organizational charts machine-readable. Their answer was straightforward: **To make research easier and, above all, faster.** Organizational charts contain a great deal of information about the distribution of power positions. Automatically reading this data allows us to trace personnel structures and clarify any potential grievances.
We needed to find an efficient solution to make large amounts of organizational charts machine readable and put them in a format that the Open Knowledge Foundation could easily research.

{{</section>}}

{{<section title="Research">}}

It is relatively easy to read text out of PDFs. However, the difficult part is getting the information structured and ready to use. The research phase involved a great deal of trial and error. During the research phase, we tested different technologies to identify the best solution for our problem. We tested a number of natural language processing solutions to retrieve the names of organizational units, the people leading them and other responsibilities. Using a simple language model was not sufficient to deal with the output from PDF, which is why we needed to combine different solutions. To get all the necessary information, we must pre-process the text by removing unnecessary content from the documents, match patterns to extract data and then use a named entity recognition system.

It is relatively easy to read text out of PDFs. However, the difficult part is getting the information structured and ready to use. The research phase involved a great deal of trial and error. During the research phase, we **tested different technologies** to identify the best solution for our problem. We tested a number of natural language processing solutions to retrieve the names of organizational units, the people leading them and other responsibilities. Using a simple language model was not sufficient to deal with the output from PDF, which is why we needed to combine different solutions. To get all the necessary information, we must pre-process the text by removing unnecessary content from the documents, match patterns to extract data and then use a named entity recognition system.
{{</section>}}

{{<section title="Project planning">}}

At the beginning of our project, we divided our group into distinct sub-teams, each responsible for conducting research in a specific area. We then planned our project using a Trello board, which enabled us to split all our tasks into an organized format and assign each task to different project members. We split into groups to tackle different project areas. One team was responsible for implementing the algorithm for extracting structured data, while the other team worked on the visualization and interactive web interface to demonstrate the capabilities of our tool.

{{<section title="Project Planning">}}
At the beginning of our project, we divided our group into distinct sub-teams, each responsible for conducting research in a specific area. We then planned our project using a Trello board, which enabled us to **split all our tasks into an organized format** and assign each task to different project members. We split into groups to tackle different project areas. One team was responsible for implementing the algorithm for extracting structured data, while the other team worked on the visualization and interactive web interface to demonstrate the capabilities of our tool.
{{</section>}}
86 changes: 27 additions & 59 deletions content/ss24/master/m4-orgxtract/techstack.md
Original file line number Diff line number Diff line change
@@ -1,72 +1,40 @@
+++
title = "Tech Stack"
weight = 30
draft = true
hasMermaid = true
weight = 40
draft = false
+++

{{<section title="Tech Stack">}}
Hugo offers a couple of options to include diagrams right in the source code, see
[https://gohugo.io/content-management/diagrams/](https://gohugo.io/content-management/diagrams/)
{{<image src="techstack_logos.png" alt="Techstack">}}

Here's a mermaid example: (note the hasMermaid = true parameter in the front matter!)

```mermaid
classDiagram
Animal <|-- Duck
Animal <|-- Fish
Animal <|-- Zebra
Animal : +int age
Animal : +String gender
Animal: +isMammal()
Animal: +mate()
class Duck{
+String beakColor
+swim()
+quack()
}
class Fish{
-int sizeInFeet
-canEat()
}
class Zebra{
+bool is_wild
+run()
}
```
{{<section title="Organization">}}
* **[Github](https://github.com/FDS-HTW-2024/fds_orgchart)** for code collaboration and distribution
* **[Trello](https://trello.com/)** for task organization
* **[Figma](https://www.figma.com/)** for visualization prototyping
{{</section>}}

{{<section title="Future">}}
{{<section title="Core">}}
* **[Python](https://www.python.org/)**

Python was selected due to its status as the most prominent programming language for data science. Given the extensive range of available packages, it proved to be the most effective language for our problem.

* **[PyMuPDF](https://pymupdf.readthedocs.io/en/latest/)**

The PyMuPDF library was employed for the extraction of data from PDF documents, as it is the most high-performance Python library for this purpose.
* **[spaCy](https://spacy.io/)**

spaCy is a high quality package for natural language processing tasks. We chose to work with spaCy because it provides many components without much configuration.
* **[Python LLM](https://llm.datasette.io/en/stable/)**

```goat
+-------------------+ ^ .---.
| A Box |__.--.__ __.--> | .-. | |
| | '--' v | * |<--- | |
+-------------------+ '-' | |
Round *---(-. |
.-----------------. .-------. .----------. .-------. | | |
| Mixed Rounded | | | / Diagonals \ | | | | | |
| & Square Corners | '--. .--' / \ |---+---| '-)-' .--------.
'--+------------+-' .--. | '-------+--------' | | | | / Search /
| | | | '---. | '-------' | '-+------'
|<---------->| | | | v Interior | ^
' <---' '----' .-----------. ---. .--- v |
.------------------. Diag line | .-------. +---. \ / . |
| if (a > b) +---. .--->| | | | | Curved line \ / / \ |
| obj->fcn() | \ / | '-------' |<--' + / \ |
'------------------' '--' '--+--------' .--. .--. | .-. +Done?+-'
.---+-----. | ^ |\ | | /| .--+ | | \ /
| | | Join \|/ | | Curved | \| |/ | | \ | \ /
| | +----> o --o-- '-' Vertical '--' '--' '-- '--' + .---.
<--+---+-----' | /|\ | | 3 |
v not:line 'quotes' .-' '---'
.-. .---+--------. / A || B *bold* | ^
| | | Not a dot | <---+---<-- A dash--is not a line v |
'-' '---------+--' / Nor/is this. ---
Python LLM provides a simple API to wrap around the many existing large language models.
{{</section>}}

{{<section title="Web View">}}
* **HTML**
* **JavaScript**
* **CSS**

```
Goat Diagrams are also supported :)
{{</section>}}

{{<section title="Project Architecture">}}
{{</section>}}
{{<image src="architecture.png" alt="Architecture">}}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 4ede622

Please sign in to comment.