-1

I’m having trouble with my ETL process. Let me explain my problem, I have this code:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import chromedriver_autoinstaller

chromedriver_autoinstaller.install()

options = webdriver.ChromeOptions()
driver = webdriver.Chrome( options = options )

######################## WEBSCRAPING 
table = None
attempts = 0
max_attempts = 1000
wait_time = 10

while not table and attempts < max_attempts:
    driver.get("https://www.congreso.gob.pe/pleno/congresistas/?=undefined&m1_idP=13")
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    table = soup.find('table')
    attempts += 1
    if not table:
        print(f"Tabla no encontrada. Intento {attempts}/{max_attempts}. Reintentando en {wait_time} segundos...")
        time.sleep(wait_time)
    else:
        print("Tabla encontrada.")

if table:
    headers = [th.text.strip() for th in table.find_all('tr')[0].find_all('th')]  
    data = [[td.text.strip() for td in row.find_all('td')] for row in table.find_all('tr')[1:]]
    df = pd.DataFrame(data, columns=headers)
else:
    print("No se encontró ninguna tabla después de múltiples intentos.")
    driver.quit()
    exit()

The output is the table of this link

print(df.head(3))
Apellidos y Nombres Grupo Parlamentario e-mail
Acuña Peralta María Grimaneza ALIANZA PARA EL PROGRESO [email protected]
Acuña Peralta Segundo Héctor HONOR Y DEMOCRACIA [email protected]

So, I'm trying to automate this process in other environment (without graphical interface). For that reason, I want to add these options:

options.add_argument('--no-sandbox')
options.add_argument('--start-maximized')
options.add_argument('--disable-dev-shm-usage')
options.add_argument( '--incognito' ) 
options.add_argument( '--headless' )  

Finally, my code (Before ####WEBSCRAPING) will be:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import chromedriver_autoinstaller

chromedriver_autoinstaller.install()

options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--start-maximized')
options.add_argument('--disable-dev-shm-usage')
options.add_argument( '--incognito' ) 
options.add_argument( '--headless' )  
driver = webdriver.Chrome( options = options )

######################## WEBSCRAPING...

But when I try to execute this, Python simply doesn't recognize the table and my code end the loop without the data frame.

Can anyone help me? I want to maintain the options to disable graphic interface.

4
  • Python simply doesn't recognize the table Show us the actual result from the code, so we don't have to guess what this means. Commented Sep 2 at 20:41
  • Just this: "Tabla no encontrada. Intento 1/1000. Reintentando en 10 segundos... Tabla no encontrada. Intento 2/1000. Reintentando en 10 segundos... …. Tabla no encontrada. Intento 1000/1000. Reintentando en 10 segundos... No se encontró ninguna tabla después de múltiples intentos. " Commented Sep 2 at 20:46
  • The output is the table of this link If you're getting output, then the table was found. So I don't understand why you say the table was not recognized...? Commented Sep 2 at 20:56
  • Not positive but I found when running headless, a lot of the times the page was different. i.e. a prompt to accept cookies or something similar. A hack I used was to save the raw html when running headless to a file and manually inspect it to ensure the content was the same. Commented Sep 2 at 21:08

1 Answer 1

1

Finally, I could do it with these aditional options:

options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36")
options.add_argument("accept-language=es-ES,es;q=0.9")
options.add_argument("accept-encoding=gzip, deflate, br")
options.add_argument("referer=
https://www.google.com/")
options.add_argument('--disable-gpu')
options.add_argument('--disable-software-rasterizer')

Thanks!

Not the answer you're looking for? Browse other questions tagged or ask your own question.