Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR pipeline for extracting the data from daily reports #2

Open
pedrocruzio opened this issue Apr 2, 2020 · 3 comments
Open

OCR pipeline for extracting the data from daily reports #2

pedrocruzio opened this issue Apr 2, 2020 · 3 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@pedrocruzio
Copy link
Collaborator

Create a pipeline to upload a report from the log. I'm thinking the easiest way would be with a small web app that does the following:

  1. Home page shows instructions to upload a report and an upload box
  2. Images and PDFs can be dragged and dropped onto the upload box
  3. After the report has been uploaded, the server will extract the text and display it in CSV, JSON, and plain text.

Afterwards, we might be able to start adding the data to the API with a cron job.

@pedrocruzio pedrocruzio added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Apr 2, 2020
@froi
Copy link
Member

froi commented Apr 8, 2020

I'll take a stab at this.

@froi
Copy link
Member

froi commented Apr 9, 2020

Dandole un vistazo rapido utilizando libs de Python Pillow y Pytesseract se puede sacar el texto de las imagenes. Tome la imagen de 3-24-2020.md

El texto se extrae de la siguiente forma:

RESULTADOS DE PRUEBAS PARA COVID-19

Fecha de actualización de datos: 24 de marzo de 2020
Total de casos nuevos desde último informe: 12
* Departamento de Salud ú
* Administración de Veteranos:
* Laboratorios Privados: 1

RESUMEN DE RESULTADOS DE PRUEB)

 

 

Total Total Laboratorios Total PR

 

 

 

 

 

 

 

 

Resultado Salud Veteranos — privados e
Positivos 34 16 1 51 88
Negativos 254 48 15 317 545
Pendientes 70 36 108 214 368
Total 358 100 124 582 100.0
DESCRIPCIÓN DE CASOS POSITIVOS:
7 Frecuencia Porciento
Característica a eS
Sexo
* Femenino 16 320
* Masculino 34 68.0
*No disponible 1
¡Grupo de edad
*20-29 3 60
*30-39 9 18.0
* 40-49 7 14.0
* 50-59 7 14.0
* 60-69 10 20.0
*70-79 9 18.0
* 80-89 5 10.0
Promediotd.e. 56.3 118.0
*No disponible 1
¡Región
* Arecibo 0 0.0
* Bayamón 2 4.0
* Caguas 3 60
“Fajardo 1 20
* Mayagúez 9 18.0
* Metro 34 68.0
* Ponce 1 20
*No disponible 1
Sintomático
«sí 20 952
*No 1 48
*No disponible 30

Horrible pero trabajable. Voy a crear un pequeño PoC para esto. Comments and ideas welcome.

@froi
Copy link
Member

froi commented Apr 9, 2020

Quick and dirty example here https://github.com/Code4PuertoRico/ocr_poc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants