0

I scraped some data from the web using:

import requests
from bs4 import BeautifulSoup


def get_lines_from_url(url):
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        lines = soup.get_text("\n").strip().splitlines()
    return lines

while printing them with:

all_lines = get_lines_from_url(link)
for line in all_lines:
    print(line)

i encounter UnicodeEncodeError: 'charmap' codec can't encode character '\u2588' in position 0: character maps to <undefined>

I printed the encoded version of each line with:

for line in all_lines:
    print(line.encode('utf-8'))

The culprit was a line containing b'\xe2\x96\x88'.

I did some research and found that all lines print when i include sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach()) as in:

import sys
import codecs
# other imports...

# `get_lines_from_url` declaration...

sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())

all_lines = get_lines_from_url(link)
for line in all_lines:
    print(line)

What does sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach()) do? why couldn't Python print b'\xe2\x96\x88' directly?

5
  • Which version of python is this? I just tried with 3.12 and print(b"\xe2\x96\x88") works just fine
    – Sayse
    Commented Sep 11 at 14:20
  • 1
    @Sayse Printing a byte string doesn't try to decode it as utf-8.
    – Barmar
    Commented Sep 11 at 14:39
  • 1
    That byte sequence can be decoded using a variety of codecs (without exception). What you need to tell us is what output you were expecting - i.e., what should b"\xe2\x96\x88" represent in its printed form?
    – SIGHUP
    Commented Sep 11 at 15:00
  • 2
    The terminal/IDE you were using was using a code page that didn't support that Unicode character. U+2588 is █ FULL BLOCK and E2 96 88 is that character encoded in UTF-8. It was correctly read from the web and interpreted as the correct Unicode character, but when written to stdout, the character is encoded in the terminal's encoding (a non-UTF8 encoding that didn't support that character). codecs.getwriter wrapped the stdout byte stream in a UTF-8 encoder so it supported the character. Commented Sep 11 at 17:06
  • If you are on Windows, try set "PYTHONIOENCODING=utf-8" before running the interpreter (and/or enable the Python UTF-8 Mode.).
    – JosefZ
    Commented Sep 11 at 17:41

0