I scraped some data from the web using:
import requests
from bs4 import BeautifulSoup
def get_lines_from_url(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
lines = soup.get_text("\n").strip().splitlines()
return lines
while printing them with:
all_lines = get_lines_from_url(link)
for line in all_lines:
print(line)
i encounter UnicodeEncodeError: 'charmap' codec can't encode character '\u2588' in position 0: character maps to <undefined>
I printed the encoded version of each line with:
for line in all_lines:
print(line.encode('utf-8'))
The culprit was a line containing b'\xe2\x96\x88'
.
I did some research and found that all lines print when i include sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
as in:
import sys
import codecs
# other imports...
# `get_lines_from_url` declaration...
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
all_lines = get_lines_from_url(link)
for line in all_lines:
print(line)
What does sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
do?
why couldn't Python print b'\xe2\x96\x88'
directly?
print(b"\xe2\x96\x88")
works just fineFULL BLOCK
andE2 96 88
is that character encoded in UTF-8. It was correctly read from the web and interpreted as the correct Unicode character, but when written to stdout, the character is encoded in the terminal's encoding (a non-UTF8 encoding that didn't support that character).codecs.getwriter
wrapped the stdout byte stream in a UTF-8 encoder so it supported the character.set "PYTHONIOENCODING=utf-8"
before running the interpreter (and/or enable the Python UTF-8 Mode.).