Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Añadir corpus original Spanish Dish Tiltle. #43

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions datasets.csv
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@ Spanish Skip-Gram Word Embeddings in FastText,"modelado del lenguaje,FastText",g
TDX Thesis Spanish Corpus,modelado del lenguaje,academico,"catalán, español",España,https://doi.org/10.5281/zenodo.7313149,,,,David Arias
WikiCorpus,"modelado del lenguaje,POS (Part of Speech)",general,"catalán, español, inglés",Varios,https://www.cs.upc.edu/~nlp/wikicorpus/,,https://www.cs.upc.edu/~nlp/papers/reese10.pdf,wikicorpus,Albert Villanova @Hugging Face
eHealth-KD,reconocimiento de entidades nombradas (NER),clinico,es,España,https://knowledge-learning.github.io/ehealthkd-2020/,https://github.com/knowledge-learning/ehealthkd-2020,http://ceur-ws.org/Vol-2664/eHealth-KD_overview.pdf,ehealth_kd,María Grandury
Spanish Dish title,Imagen a texto,general,español,Varios,https://huggingface.co/datasets/hacktoberfest-corpus-es/spanish_dish_title,,,,Fredy Orozco
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Este fichero no hace falta que lo incluyas, incluye .ipynb_checkpoints en el .gitignore :)

"cells": [],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 5
}
878 changes: 878 additions & 0 deletions datasets/spanish_dish_title/EDA.ipynb

Large diffs are not rendered by default.

42 changes: 42 additions & 0 deletions datasets/spanish_dish_title/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Platos de comida
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propuesta para el estudio de sesgos: de dónde son las recetas? Incluyen recetas de diferentes países/continentes?

## Descripción
El siguiente dataset son imagenes con platos de comidas y su titulo. El dataset se creó haciendo scrapy a la siguiente página web <a href="https://www.recetasgratis.net/">Recetas gratis</a>, la metodología es la siguiente:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
El siguiente dataset son imagenes con platos de comidas y su titulo. El dataset se creó haciendo scrapy a la siguiente página web <a href="https://www.recetasgratis.net/">Recetas gratis</a>, la metodología es la siguiente:
El siguiente dataset son imágenes con platos de comidas y su título. El dataset se creó haciendo scrapy a la siguiente página web <a href="https://www.recetasgratis.net/">Recetas gratis</a>, la metodología es la siguiente:

1. Se obtiene el link de la página principal de la categoría de comida.
2. Se obtiene el link de la página de cada receta.
3. Se obtiene el link de la imagen de la receta.
4. Se obtiene el titulo de la receta.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. Se obtiene el titulo de la receta.
4. Se obtiene el título de la receta.

## Imagenes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Imagenes
## Imágenes

Las imagenes tienen un tamaño de 300x300 pixeles y se encuentran en formato jpg.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Las imagenes tienen un tamaño de 300x300 pixeles y se encuentran en formato jpg.
Las imágenes tienen un tamaño de 300x300 pixeles y se encuentran en formato jpg.

## Metadatos
Los metadatos que se encuentran en el dataset son los siguientes:
+ **prompt**: Titulo de la receta.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ **prompt**: Titulo de la receta.
+ **prompt**: Título de la receta.

+ **source**: path de la imagen.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ **source**: path de la imagen.
+ **source**: Path de la imagen.

+ **uuid**: Identificador único de la imagen.

Nota 1: El dataset se encuentra en formato csv.
Nota 2: El nombre de las imagenes tambien va el titulo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Nota 2: El nombre de las imagenes tambien va el titulo
Nota 2: En el nombre de las imágenes tambien va el título.


## Directorio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incluye por favor todos los ficheros y su explicación

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especifica por favor la función del notebook en el nombre de Untitled.ipynb

```bash
|-- README.md - Este archivo
|-- dataset.csv - Dataset
|-- images - Imagenes
|-- src - Código fuente, en especial el script de scrapy
```
## Análisis exploratorio de datos

El ánilisis exploratorio se centra en el texto, para las imagenes tocaría aplicar herramientas de visión por computador como clip, para crear ciertas clasificaciones.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incluye también una frase diciendo que el notebook está disponible con un enlace al notebook EDA.ipynb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
El ánilisis exploratorio se centra en el texto, para las imagenes tocaría aplicar herramientas de visión por computador como clip, para crear ciertas clasificaciones.
El ánilisis exploratorio se centra en el texto, para las imágenes tocaría aplicar herramientas de visión por computador como clip, para crear ciertas clasificaciones.


### Análisis de texto

<img src="nube_de_palabras.png">
En la imagen podemos ver las palabras más frecuentes para el texto, tambien podemos ver un boxplot del texto
<img src="box_plot.png">
Aquí podemos ver como existen palabras muy pequeñas y muy grandes, por lo que recomendamos al usario que se fije en el texto para ver si le sirve el tamaño del texto
<img src="distribution.png">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

En este caso igual size_distribution.png es un nombre más específico :)

En el siguiente histograma podemos ver la distribución de los tamaños de los textos, podemos ver que la mayoría de textos tienen un tamaño menor a 78 caracteres, el 75% del dataset tiene un tamaño de 31 caracteres.

### Análisis de imagenes
Se recomienda analizar por medio de redes neuronales, para sacar más provecho y verificar la correspondecia entre el prompt y la imagen. (Una idea es hacer esto con CLIP)

<img src="dishes_prompt.png">
935 changes: 935 additions & 0 deletions datasets/spanish_dish_title/Untitled.ipynb

Large diffs are not rendered by default.

Binary file added datasets/spanish_dish_title/box_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added datasets/spanish_dish_title/dishes_prompt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added datasets/spanish_dish_title/distribution.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added datasets/spanish_dish_title/nube_de_palabras.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions datasets/spanish_dish_title/src/creates_prompt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
from pathlib import Path
import pandas as pd
import re
import uuid

patron = r"_\d+_\d+_\d+"
values = list(Path("images").glob("*.jpg"))
images = [value.stem for value in values]
images = list(map(lambda x: re.sub(patron, "", x), images))
prompts = list(map(lambda x: x.replace("_", " "), images))

df = pd.DataFrame({"prompt": prompts, "image": values})
df["uuid"] = df["image"].apply(lambda x: str(uuid.uuid4()))
df.to_csv("final_dataset.csv", index=False)
73 changes: 73 additions & 0 deletions datasets/spanish_dish_title/src/scraper_images_dish.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
import pandas as pd
import newspaper as ns
from bs4 import BeautifulSoup
import requests
from pathlib import Path


def donwload(url, directory):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Por favor incluye una pequeña descripción de las funciones en docstrings, p.ej:

Suggested change
def donwload(url, directory):
def donwload(url, directory):
"""
Descargar ...
"""

print("t...Descargando la imagen: " + url)
image = requests.get(url)
is_save = False
if image.status_code == 200:
is_save = True
name_save = directory / url.split("/")[-1]
with open(name_save, 'wb') as f:
f.write(image.content)
return name_save, is_save


def next_link(soup):
link = soup.find("a", {"class": "next ga"})
if link is None:
return None
return link.get("href")

def scrapy_page(link):
directorio = Path("images")
nombres_de_archivos = [archivo.name for archivo in directorio.iterdir() if archivo.is_file()]
nombres_de_archivos = list(map(lambda x: x.replace(".csv", ""), nombres_de_archivos))
images_ns = ns.Article(link)
images_ns.download()
soup = BeautifulSoup(images_ns.html, 'html.parser')
images = soup.find_all("img", {"class": "imagen"})
print(f"Scraping {link} ... ")
list_values = []
name_save = link.split("/")[-1].replace(".html", "")
path = directorio / name_save
if path.name in nombres_de_archivos:
return next_link(soup)
path.mkdir(parents=True, exist_ok=True)
for image in images:
url = image.get('src')
title = image.get('alt')
path_image,is_save = donwload(url, path)
values_dict = {'url': url, 'title': title, 'path': path_image, "is_save ":is_save}
list_values.append(values_dict)
pd.DataFrame(list_values).to_csv(f'images/{name_save}.csv', index=False)
next = next_link(soup)
return next

def main():
read_list = pd.read_csv("links_scrapped_images_2.csv")
url = "https://www.recetasgratis.net"
d = ns.Article(url)
d.download()
soup = BeautifulSoup(d.html, 'html.parser')
links = soup.find('div', {'class': 'categorias-home'}).find_all('a', {'class': 'titulo'})
links = [link.get('href') for link in links]
links_scrapped = []
for link in links:
if link in read_list.values:
continue
try:
print(link)
link = scrapy_page(link)
links_scrapped.append(link)
except Exception as e:
print(e)
read_list.values.append(link)
pd.DataFrame(links_scrapped).to_csv(f'links_scrapped_images_2.csv', index=False)
return
if __name__ == '__main__':
main()
110 changes: 110 additions & 0 deletions datasets/spanish_dish_title/upload_hugginface.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"from datasets import load_dataset, Image\n",
"from datasets import DatasetDict\n",
"dataset = load_dataset(\"csv\", data_files=\"dataset.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "3103220ff4d24d5694a129bfb2349ef4",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VBox(children=(HTML(value='<center> <img\\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from huggingface_hub import login\n",
"login()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"dataset = dataset.cast_column(\"image\", Image())"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"train_testvalid = dataset[\"train\"].train_test_split(test_size=0.2)\n",
"test_valid = train_testvalid [\"test\"].train_test_split(test_size=0.2)\n",
"ds = DatasetDict({\n",
" 'train': train_testvalid['train'],\n",
" 'test': test_valid['test'],\n",
" 'valid': test_valid['train']})"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Map: 100%|██████████| 13170/13170 [00:05<00:00, 2297.54 examples/s]\n",
"Creating parquet from Arrow format: 100%|██████████| 132/132 [00:00<00:00, 153.55ba/s]\n",
"Pushing dataset shards to the dataset hub: 100%|██████████| 1/1 [00:11<00:00, 11.70s/it]\n",
"Map: 100%|██████████| 659/659 [00:00<00:00, 2920.74 examples/s]\n",
"Creating parquet from Arrow format: 100%|██████████| 7/7 [00:00<00:00, 167.03ba/s]\n",
"Pushing dataset shards to the dataset hub: 100%|██████████| 1/1 [00:01<00:00, 1.57s/it]\n",
"Map: 100%|██████████| 2634/2634 [00:01<00:00, 2519.60 examples/s]\n",
"Creating parquet from Arrow format: 100%|██████████| 27/27 [00:00<00:00, 183.84ba/s]\n",
"Pushing dataset shards to the dataset hub: 100%|██████████| 1/1 [00:03<00:00, 3.01s/it]\n",
"Downloading metadata: 100%|██████████| 21.0/21.0 [00:00<?, ?B/s]\n"
]
}
],
"source": [
"ds.push_to_hub(\"hacktoberfest-corpus-es/spanish_dish_title\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}