I’m having trouble with my ETL process. Let me explain my problem, I have this code:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import chromedriver_autoinstaller
chromedriver_autoinstaller.install()
options = webdriver.ChromeOptions()
driver = webdriver.Chrome( options = options )
######################## WEBSCRAPING
table = None
attempts = 0
max_attempts = 1000
wait_time = 10
while not table and attempts < max_attempts:
driver.get("https://www.congreso.gob.pe/pleno/congresistas/?=undefined&m1_idP=13")
soup = BeautifulSoup(driver.page_source, 'html.parser')
table = soup.find('table')
attempts += 1
if not table:
print(f"Tabla no encontrada. Intento {attempts}/{max_attempts}. Reintentando en {wait_time} segundos...")
time.sleep(wait_time)
else:
print("Tabla encontrada.")
if table:
headers = [th.text.strip() for th in table.find_all('tr')[0].find_all('th')]
data = [[td.text.strip() for td in row.find_all('td')] for row in table.find_all('tr')[1:]]
df = pd.DataFrame(data, columns=headers)
else:
print("No se encontró ninguna tabla después de múltiples intentos.")
driver.quit()
exit()
The output is the table of this link
print(df.head(3))
Apellidos y Nombres | Grupo Parlamentario | ||
---|---|---|---|
Acuña Peralta María Grimaneza | ALIANZA PARA EL PROGRESO | [email protected] | |
Acuña Peralta Segundo Héctor | HONOR Y DEMOCRACIA | [email protected] |
So, I'm trying to automate this process in other environment (without graphical interface). For that reason, I want to add these options:
options.add_argument('--no-sandbox')
options.add_argument('--start-maximized')
options.add_argument('--disable-dev-shm-usage')
options.add_argument( '--incognito' )
options.add_argument( '--headless' )
Finally, my code (Before ####WEBSCRAPING) will be:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import chromedriver_autoinstaller
chromedriver_autoinstaller.install()
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--start-maximized')
options.add_argument('--disable-dev-shm-usage')
options.add_argument( '--incognito' )
options.add_argument( '--headless' )
driver = webdriver.Chrome( options = options )
######################## WEBSCRAPING...
But when I try to execute this, Python simply doesn't recognize the table and my code end the loop without the data frame.
Can anyone help me? I want to maintain the options to disable graphic interface.