All Questions
Tagged with beautifulsoup html-parsing
161
questions with no upvoted or accepted answers
2
votes
0
answers
628
views
Read local html files and convert to dataframe with python
I have a local directory on my machine with multiple html files, all with the following naming format
> XXXXXXXX_XXXX-XX-XX.html
with the X representing numeric characters (the number of numeric ...
2
votes
2
answers
189
views
How can I extract the links from HTML?
I'm trying to get a link of every article in this category on the SF chronicle but I'm not sure as to where I should begin on extracting the URLs. Here is my progress so far:
from urllib.request ...
2
votes
1
answer
2k
views
Beautiful Soup can not find all image tags in html (stops exactly at 5)
I am trying to use beautifulsoup to get all the images of a site with a certain class. my issue is that when i run the code just to see if my code can find each image it only gets images 1-5. I ...
2
votes
2
answers
966
views
Web parsing with python beautifulsoup producing inconsistent result
I am trying to parse the table of this site. I am using python beautiful soup to do that. While it's producing correct output in my Ubuntu 14.04 machine, it's producing wrong output in my friend's ...
2
votes
1
answer
96
views
How to click one of the href links from output that doesn't have a particular word in it?
I've parsed a list of href links and it's titles from a webpage. I want to click all the links that don't have the word "[$]". Here is my code.
from selenium.webdriver.common.keys import Keys
from ...
2
votes
1
answer
928
views
How to parse a web page containing CSS and HTML using python
Am trying to parse and extract some information from a web page that contains CSS and of course HTML. I am using cssutils and beatifulsoup for this. Lets say I want to find out the font size used for ...
2
votes
1
answer
2k
views
using lxml with beautiful soup
I'm having trouble making lxml work with beautiful soup. Running on osx 10.8.4. To install lxml, i did port install py25-lxml and it installed fine. Now I'm getting this error when I try to use lxml ...
1
vote
1
answer
47
views
Beautiful Soup only gets header of table
I am trying to import the data from a table on this website to a csv:http://www.ameren.com/illinois/residential/supply-choice/renewables/interconnection-queue.
I have tried many different solutions, ...
1
vote
1
answer
32
views
Python: How can i get a list of li tags in BeautifulSoup4
I'm trying to scrape a persian webpage and i want to get 3 li tags from a ul containing 6 of them. my problem is that every li, has nested li tags in it and when i use soup.find_all('li'), it finds ...
1
vote
1
answer
45
views
Why is my code giving me an AttributeError?
I am trying to iterate through a couple levels of html to retrieve links associated with legislation. However, once I reach the 2nd level of links, instead of retrieving a list of links associated ...
1
vote
1
answer
360
views
Trying to use pd.read_html to extract information and export data to a Pandas dataframe
I am trying to extract the information from the table on this Wikipedia page to automate data collection.
Link to webpage: https://en.wikipedia.org/wiki/List_of_members_of_the_17th_Lok_Sabha
I am ...
1
vote
0
answers
284
views
Word count of text extracted from URL in Python
I am working on this NLP project that takes URL as an input and summarizes it using gensim library, But as for metrics of the summary that comes as output I want to calculate the word count of the ...
1
vote
0
answers
40
views
segmenting bs4.element.Tag
Is it possible to segment a bs4.element.Tag into several bs4.element.Tag?
You can think of an application as the following:
1- The original bs4.element.Tag contains a paragraph.
2- We want to segment ...
1
vote
1
answer
487
views
How to parse HTML with source mapping?
I want to use Python to parse HTML markup, and given one of the resultant DOM tree elements, get the start and end offsets of that element within the original, unmodified markup.
For example, given ...
1
vote
0
answers
144
views
Beautifulsoup only returning metadata
Can someone help me understand why beautifulsoup seems to only be returning metadata?
Here's my code:
import requests
from bs4 import BeautifulSoup
#create a session
client = requests.Session()
#...