Deploy a very simple 'always on ' webscraper

Question:
I have been happily using the always-on-feature for years on replit and now I have been told I need to use deployment. I don’t have a webapp or a complicated bot, its just a simple program that is supposed to run every 60 minutes.

The deployment options are confusing, I tried it for a similar program and it just ended in a black box, no console to check the print statements, no up to date csv file visible!

Please show me how to use the deployment as a simple replacement for this very simple ‘always on’ program.
Thanks for your help!

Repl link:
https://replit.com/@KatharinaNi/webscraping-bot-always-on

import lxml
import csv
import datetime as dt
import time
from random import randint, seed
from urllib.request import Request, urlopen
import bs4
import requests


'''simplyfied version of my always on program:
1. program run every 60 minutes
2. program reads and writes a csv file
3. I need to easily access the console to check the print statements from time to time
'''

# scrape website and get all article links
articles = []
date_time = dt.datetime.now()
response = requests.get("https://www.thechinastory.org/blog/", headers = {'User-agent': 'Mozilla/5.0'})
soup = bs4.BeautifulSoup(response.text, 'lxml')
mydivs = soup.findAll("article")
for mydiv in mydivs:
  articles.append(mydiv.find('a').get('href'))

# read csv file with all already scraped article links
with open('article_db.csv') as c:
  article_record_database = ' '.join(c.readlines())
  
# create article list will all new links that are not in article_db.csv
article_list = [a for a in articles if a not in article_record_database]

# save all new article links in article_db.csv and print to console
if article_list:
  print(date_time, len(article_list) , ' new articles found')
  for article in article_list:  
    with open('article_db.csv', mode='a') as data_file:
      csv_file = csv.writer(data_file,
                            quotechar='"',
                            quoting=csv.QUOTE_MINIMAL)
      csv_file.writerow([article, date_time])
else:
  print(date_time, 'no new articles found')

# wait 60 minutes until scraoing starts again
waiting_minutes = 60
print(f'wait for {waiting_minutes} minutes, {date_time}')
time.sleep(waiting_minutes*60)
1 Like

Hi @KatharinaNi , welcome to the forums!
Since it’s a bot, you should be deploying as a Reserved VM.
Hope this helps!

2 Likes

And where do I access the current version of the csv file?
And where do I access the console to read the print statements?

1 Like

I’m not sure about the csv file, but to access the print statements, the deployments pane has a logs section, and there you can access the deployment’s console.

1 Like

I found the logs now, but still I need access to the csv, which I use as a simple database.

1 Like

@KatharinaNi You could periodically print out the csv data and copy and paste it. Keep in mind that deployments currently do not have a persistent file system, so when you redeploy, data could be lost. If I were you, I’d switch to using a database to store your info, and you can create a function that pulls that data from the database and compiles it into a csv so that you can download the csv file. If you want a recommendation, I’d use neon.tech for free database storage.

If you’d like, you can invite me to your repl and I can help with some of the integration with neon.tech, if that’s what you decide to do.

2 Likes

Thanks for your advise! I have zero experience with databases, since a csv has ALL the functionality I need.
But I realised, that this is something to try to keep my project running. I signed up for neon and invited you to my “small project example repl” (the original repl is private and still running). Now I have no idea how to use neon, I am stuck since the documentation doesnt mention python :woozy_face: Neon documentation - Neon Docs

Ahh yes. You must use a separate library to integrate it. I can’t help much today, as I’m rather busy, but tomorrow I may have some more time to help. Sorry!

@KatharinaNi I can help a tad bit right now, but not in an hour.