- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2
hollisterco.com scraper #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| page.setUserAgent( | ||
| 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36', | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
por que hace falta este UA? no anda con el default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
El UA default tiene la string HeadlessChrome, si lo corremos headless es necesario cambiar el Header. Más info: https://jsoverson.medium.com/how-to-bypass-access-denied-pages-with-headless-chrome-87ddd5f3413c
| const client = await page.target().createCDPSession() | ||
| await client.send('Network.clearBrowserCookies') | ||
| await client.send('Network.clearBrowserCache') | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Para que se necesita esto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
El sitio por algun motivo solo acepta la primer request y después marca el browser, hace falta borrar cookies y cache.
|  | ||
| await page.goto(request.pageUrl) | ||
|  | ||
| const data = await page.evaluate(() => { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
como comentario general para los proximos scrapers, hacer llamadas mas pequeñas al dom usando .$eval y .$$eval porque sino es mas dificil entender de donde salio cada dato
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lo voy a tener en cuenta, gracias
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opino lo mismo, para poder entender y mantener mejor el scraper hacer llamadas mas atomicas ayuda a la legibilidad.
Por ejemplo:
const title = await page.$evanl('.product-title-main-header', e => e.textContent?.trim() || '')There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
En general esta bien, me gustaron varias cosas y conseguiste los datos de size chart! solo un par de comentarios
|  | ||
| await page.goto(request.pageUrl) | ||
|  | ||
| const data = await page.evaluate(() => { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opino lo mismo, para poder entender y mantener mejor el scraper hacer llamadas mas atomicas ayuda a la legibilidad.
Por ejemplo:
const title = await page.$evanl('.product-title-main-header', e => e.textContent?.trim() || '')| await page.goto(request.pageUrl) | ||
|  | ||
| const data = await page.evaluate(() => { | ||
| let title = document.querySelector('.product-title-main-header')?.textContent?.trim() || '' | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer const cuando no se reasigna el valor (inmutable)
| let p = new Product(`${v.idx}-${v.productid}`, data.title, v.producturl) | ||
|  | ||
| p.subTitle = data.subtitle | ||
| p.metadata = {} | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no es necesario agregar metadata si no tiene nada, ya es el valor por defecto del Product
| const client = await page.target().createCDPSession() | ||
| await client.send('Network.clearBrowserCookies') | ||
| await client.send('Network.clearBrowserCache') | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agregar comentario de que es lo que resuelve
| p.color = v.swatch | ||
| p.colorFamily = v.swatchColorFamily | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debería ser al revés
p.color = v.swatchColorFamily
p.colorFamily = v.swatch| p.metadata = {} | ||
| p.images = v.images | ||
| p.videos = [] | ||
| p.description = v.description | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
La descripción no debería tener HTML
| p.sku = v.sku | ||
| p.brand = v.brand | ||
| p.size = v.sizePrimary + (v.sizeSecondary ? ` - ${v.sizeSecondary}` : '') | ||
| p.realPrice = v.offerPrice | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case. Generalmente es mas seguro suponer que cuando no hay oferta puede venir empty el dato, por lo que se podría contemplar de la siguiente manera:
p.realPrice = v.offerPrice || v.listPrice
No description provided.