7
\$\begingroup\$

I am a moderator of /r/sweepstakes on Reddit, which lets users post their referral links to contests/sweepstakes. One main rule is that a user is not allowed to post their link to a contest, if another user has already done so. It's not so simple checking for reposts since all referral links have a different URL (i.e. contest.com/?ref=Kevin & contest.com/?ref=Steve).

I thought a good way to find a repost is to retrieve the title of the webpage (the <title> tag) and store it in a database along with some other vital information.

It scans the subreddit every 15m for new posts. For every post it does the following:

  1. See if we have already looked at the post by searching the DB for the pid (PostId). If we have that, skip and move onto the next post.

  2. Get the final URL using urllib. Some URLs redirect to another webpage (i.e. bit.ly links)

  3. Get the title (<title>) of the webpage by using BeautifulSoup.

  4. Search the DB for the Title. If the title is in the database, then that means the submitted post is a repost and we want to retrieve some information on the original post (permalink, submitter). We add this information to a string that will be sent to the moderators.

  5. If the submitted post's title does not already exist in the database, then it is a unique post and we will add it to the database.

  6. Once all posts have been processed, send the message of all reposts to the moderators for them to manually inspect.

I ran into a lot of issues and they predominantly had to do with finding the final URL of the post and finding the page's title. To keep things simple, I may end up removing the function to find a URL's final URL, since it isn't very important.

I ran into ASCII/Unicode issues and I kept getting UnicodeEncodeError/UnicodeDecodeError exceptions.

Suggestions on how to improve the code would be appreciated.

import traceback
import praw # simple interface to the reddit API, also handles rate limiting of requests
import time
import sqlite3
import re
from urlparse import urlparse
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
import requests

'''USER CONFIGURATION'''

APP_ID = 'XXXX'
APP_SECRET = 'XXXX'
APP_URI = 'XXXX'
APP_REFRESH = 'XXXX'
USERAGENT = 'XXXX'
SUBREDDIT = "XXXX"
MAXPOSTS = 30
WAIT = 900 #15m This is how many seconds you will wait between cycles. The bot is completely inactive during this time.

# Resolve redirects for a URL. i.e. bit.ly/XXXX --> somesite.com/blahblah
# Also input # of retries in case rate-limit
def resolve_redirects(url, tries):
    tries -= 1
    try:
        req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.125 Safari/537.36"}) # User agent since some sites block python/urllib2 useragent
        return urllib2.urlopen(req)
    except urllib2.HTTPError, e:
        print('HTTPError: ' + str(e.code) + ': ' + domain)
        if (e.code == 403 or e.code == 429) and tries > 0:
            time.sleep(5)
            resolve_redirects(url, tries)
    except urllib2.URLError, e:
        print('URLError: ' + str(e.reason) + ': ' + domain)
    except Exception:
        import traceback
        print('Generic Exception: ' + traceback.format_exc())

# Get title of webpage if possible. Otherwise just set the page title equal to the pages URL        
def get_title(url):
    try:
        title = BeautifulSoup(url).title.string.strip()
    except AttributeError:
        title = url.geturl()
    return title.encode('utf-8').strip()

# Load Database
sql = sqlite3.connect('Reddit_DB.db')
print('Loaded SQL Database')
cur = sql.cursor()

# Create Table and Login to Reddit
cur.execute('CREATE TABLE IF NOT EXISTS duplicates(id TEXT, permalink TEXT, domain TEXT, url TEXT, title TEXT, submitter TEXT)')
sql.commit()
print('Logging in...')
r = praw.Reddit(USERAGENT)
r.set_oauth_app_info(APP_ID, APP_SECRET, APP_URI)
r.refresh_access_information(APP_REFRESH)

# Main portion of code
def replybot():
    print('Searching %s @ %s' % (SUBREDDIT, time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))))
    subreddit = r.get_subreddit(SUBREDDIT)
    submissions = list(subreddit.get_new(limit=MAXPOSTS))
    msg = ""
    for post in submissions:
        global domain # Need to be global to use in resolve_redirects()
        pid = post.id

        try:
            author = post.author.name
        except AttributeError:
            print "AttributeError: Author is deleted"
            continue

        # See if we have already looked at this post before. If we have, skip it.
        cur.execute('SELECT * FROM duplicates WHERE ID=?', [pid])
        sql.commit()
        if cur.fetchone(): # Post is already in the database
            continue

        url = post.url
        domain = post.domain
        if domain == "self." + str(SUBREDDIT): # Skip self posts
            continue

        # Get the final url after redirects (i.e. in case URL redirects to a different URL)
        try:
            post_url = resolve_redirects(url, 3)
            effective_url = post_url.geturl()
        except AttributeError:
            print "AttributeError: Post URL/Effective URL"
            continue

        # Get Title of webpage in Final URL
        try:    
            post_title = get_title(post_url).encode('utf-8').strip()
        except UnicodeDecodeError:
            post_title = unicode(get_title(post_url).strip(),"utf-8")
        except UnicodeEncodeError:
            print "UnicodeError: " + post.title
            continue

        # Check if the post is a repost by seeing if the Title already exists. If it does, get the Repost's permalink, title, submitter and create the message. Otherwise post is unique and is added to DB
        cur.execute('SELECT * FROM duplicates where TITLE=?', [post_title])
        sql.commit()
        row = cur.fetchone()
        if row:
            repost_permalink = row[1]
            repost_title = row[4]
            repost_submitter = row[5]
            print "Found repost of %s by %s" % (post.title, author)
            msg += 'Repost: [%s](%s) by /u/%s. Original: [Here](%s) by /u/%s.\n\n' % (post.title, post.permalink, author, repost_permalink, repost_submitter)
        else:
            cur.execute('INSERT INTO duplicates VALUES(?,?,?,?,?,?)', [pid, post.permalink, domain, effective_url, post_title, author])
            sql.commit()

    # If message exists (meaning there was a repost), send message to moderators
    if len(msg) > 0:
        r.send_message('/r/sweepstakes', 'Possible Repost', msg)
        print "Sent message"
    else:
        print "Nothing to send"

cycles = 0
while True:
    try:
        # Keep refresh alive by refreshing every 45m
        if cycles % 3 == 0:
            r.refresh_access_information(APP_REFRESH)
            print "Refreshed OAuth"
        replybot()
        cycles += 1
    except Exception as e:
        traceback.print_exc()
    time.sleep(WAIT)
\$\endgroup\$
2
  • \$\begingroup\$ You mention some errors; can you clarify whether or not this code is currently working as intended, please? \$\endgroup\$
    – jonrsharpe
    Commented Jul 30, 2015 at 7:50
  • \$\begingroup\$ The code works it's just that I keep getting random Exceptions that I didnt think would come up (i.e. attributerror, unicodeerror) \$\endgroup\$
    – Bijan
    Commented Jul 30, 2015 at 15:57

1 Answer 1

7
\$\begingroup\$

Use modern versions of things

The most obvious thing here is use Python 3. This will help massively with your Unicode problems, because Python 3 maintains a stricter separation between things that Python 2 conflated. In some cases, your errors will just be artifacts of Python 2's way of doing things, and will just go away. In others, you'll get errors that give you a much better idea of what the problem is.

In general, the only reason to use Python 2 for new code these days is if you have to use one of an increasingly small number of libraries that hasn't been ported. You use three non-stdlib packages: requests and praw both support Python 3.

Which leaves: BeautifulSoup. The fact that you are importing it as BeautifulSoup implies you're using bs3, which only works on Python 2.x and hasn't had an update since 2012. Upgrade to BeautifulSoup 4 - it is actively maintained (at the time of this post, the last release was just shy of 4 weeks ago), and supports all current versions of Python.

Use requests

You import requests, but you also import urllib and urllib2. Of those, the easiest to use for what you want is requests, and the only one you actually use is urllib2.

General Pythonisms

e.code == 403 or e.code == 429

can be shortened to:

e.code in 403, 429

In general, Python style prefers iteration to recursion. So, instead of retrying like this:

def resolve_redirects(url, tries):
    tries -= 1
    # Several lines of code unrelated to tries
    ...
    except urllib2.HTTPError, e:
        time.sleep(5)
        resolve_redirects(url, tries)

do this (also converted to use requests, and string formatting instead of concatenation):

def resolve_redirects(url, tries):
    for _ in range(tries):
        response = requests.get(url, headers=...)
        if response.status_code in 403, 429:
            print('HTTP Error: {} ')
            continue
        elif response.status_code != 200:
            # Generic error 
            response.raise_for_status()
        else:
            return response

I've also removed your exception handling for generic errors in here, because I don't think this is the right place to handle them. Instead, let them bubble up to the main line and deal with them there.

This has a flow on implication down here:

try:
    post_url = resolve_redirects(url, 3)
    effective_url = post_url.geturl()
except AttributeError:
    print "AttributeError: Post URL/Effective URL"
    continue

That AttributeError was almost certainly coming up because of your previous exception handling. You were printing the error and then ignoring it and continuing on, which made resolve_redirects return None by falling off the end. So now, you can change this guard to except URLError: so it gives you a better idea of what's going on.

You should probably also rename post_url, since it's not really a url anymore (it's a Response, so for lack of a better name, let's call it post_response).

This is the right place to handle that error. But instead of calling print here, consider using the logging module .

Above this:

submissions = list(subreddit.get_new(limit=MAXPOSTS))

There's no need to turn that result into a list. Anything you can pass to list you can also iterate over directly. Only bother turning it into a list if you need to iterate over it more than once (you don't).

url = post.url
domain = post.domain

Just use post.url and post.domain directly.

try:    
    post_title = get_title(post_url).encode('utf-8').strip()
except UnicodeDecodeError:
    post_title = unicode(get_title(post_url).strip(),"utf-8")
except UnicodeEncodeError:
    print "UnicodeError: " + post.title
    continue

That is a lovely abomination. It.. looks like you're trying to handle the page being in an arbitrary encoding, and standardise it to UTF8? If that's the case, do this:

title = get_title(post_response.text).strip().encode('utf8')

In Python 3, encode will not raise a UnicodeDecodeError, because someone realised that that was a little odd. Encoding to utf8 should not raise a UnicodeEncodeError, because there are no unicode codepoints that utf8 can't encode.

If you're happy with the raw bytes in whatever encoding they happen to be, do this:

title = get_title(post_response.content).strip()

For reposts, you gradually build a string message to send to someone. It would be better (and probably a little faster) to build a list of the pertinent information:

reposts = []
for post in posts:
    ...
    if row:
        # There's a repost
        reposts.append((tuple of the things you current make a string for))
    ...
if reposts:
    msg = 'Repost: [{}]({}) by /u/{}. Original: [Here]({}) by /u/{}.'
    msg = '\n\n'.join(msg.format(post) for post in reposts)
    r.send_message(...)

Sqlite row objects can be access by column name - rename your row variable to repost, and you can do, eg, `repost['permalink'], instead of having to create variables to keep track of what each one is.

A more Pythonic way to manage your cycles counter down the bottom is like this:

import itertools as it

for cycle in it.count(1):
    ...
\$\endgroup\$
1
  • 3
    \$\begingroup\$ Oops. Fixed. In fact, that wouldn't have shadowed a builtin, it would actually be an outright error (try is a keyword). \$\endgroup\$
    – lvc
    Commented Jul 30, 2015 at 12:44

Not the answer you're looking for? Browse other questions tagged or ask your own question.