web scraping - Why is sys.stdout adjustment needed to print Unicode in Python?

I scraped some data from the web using:

import requests
from bs4 import BeautifulSoup


def get_lines_from_url(url):
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        lines = soup.get_text("\n").strip().splitlines()
    return lines

while printing them with:

all_lines = get_lines_from_url(link)
for line in all_lines:
    print(line)

i encounter UnicodeEncodeError: 'charmap' codec can't encode character '\u2588' in position 0: character maps to <undefined>

I printed the encoded version of each line with:

for line in all_lines:
    print(line.encode('utf-8'))

The culprit was a line containing b'\xe2\x96\x88'.

I did some research and found that all lines print when i include sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach()) as in:

import sys
import codecs
# other imports...

# `get_lines_from_url` declaration...

sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())

all_lines = get_lines_from_url(link)
for line in all_lines:
    print(line)

What does sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach()) do? why couldn't Python print b'\xe2\x96\x88' directly?

asked Sep 11 at 14:15

Kun.tito

3752 silver badges11 bronze badges

Which version of python is this? I just tried with 3.12 and print(b"\xe2\x96\x88") works just fine
– Sayse
Commented Sep 11 at 14:20
1

@Sayse Printing a byte string doesn't try to decode it as utf-8.
– Barmar
Commented Sep 11 at 14:39
1

That byte sequence can be decoded using a variety of codecs (without exception). What you need to tell us is what output you were expecting - i.e., what should b"\xe2\x96\x88" represent in its printed form?
– SIGHUP
Commented Sep 11 at 15:00
2

The terminal/IDE you were using was using a code page that didn't support that Unicode character. U+2588 is █ FULL BLOCK and E2 96 88 is that character encoded in UTF-8. It was correctly read from the web and interpreted as the correct Unicode character, but when written to stdout, the character is encoded in the terminal's encoding (a non-UTF8 encoding that didn't support that character). codecs.getwriter wrapped the stdout byte stream in a UTF-8 encoder so it supported the character.
– Mark Tolonen
Commented Sep 11 at 17:06
If you are on Windows, try set "PYTHONIOENCODING=utf-8" before running the interpreter (and/or enable the Python UTF-8 Mode.).
– JosefZ
Commented Sep 11 at 17:41

Add a comment |

Collectives™ on Stack Overflow

Why is sys.stdout adjustment needed to print Unicode in Python?

0

Browse other questions tagged
python
web-scraping
beautifulsoup
unicode
codec
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Browse other questions tagged pythonweb-scrapingbeautifulsoupunicodecodec or ask your own question.

Browse other questions tagged
python
web-scraping
beautifulsoup
unicode
codec
or ask your own question.