Ad cron functionality

Refactor the code to get rid of global variables. Also improve the docs.
This commit is contained in:
Stefan Bethke 2024-05-20 12:09:04 +02:00
parent 3c1a44c4ee
commit 5575cc9156
3 changed files with 278 additions and 204 deletions

View file

@ -6,17 +6,25 @@ can be emailed to the author as a backup.
## Expiring old notes and revisions ## Expiring old notes and revisions
Hedgedoc keeps notes and revisions (versions) of those notes forever. This might not be desirable, for example because Hedgedoc keeps notes and revisions (versions) of those notes forever. This might not be desirable, for example because
of data protection reasons. With this utility, you can remove old revisions and old notes from the database. of data protection reasons. With this utility, you can remove old revisions and old notes from the
database. `hedgedoc-expire` works by talking directly to a Postgres database; no API access to Hedgedoc is required.
Currently, it only works with Postgres.
### Expiring old revisions ### Expiring old revisions
All revisions that have been created before the specified age will be removed. If all revisions are expired, the note remains available, it just won't have any revisions to revert to. Once you continue editing it, new revisions will be added. All revisions that have been created before the specified age will be removed. If all revisions are expired, the note
remains available, it just won't have any revisions to revert to. Once you continue editing it, new revisions will be
added.
### Expiring old notes ### Expiring old notes
Notes that are being expired will be emailed to the account that initially created the note. This allows that user to restore the note, if necessary. Expiring a note will also remove all existing revisions for the note. Notes that are being expired will be emailed to the account that initially created the note. This allows that user to
restore the note, if necessary. Expiring a note will also remove all existing revisions for the note.
You will need to configure your environment for the `hedgedoc-expire` to be able to send mail. If the mail is not accepted by the mail server, the note will not be removed. You will need to configure your environment for `hedgedoc-expire` to be able to send mail. If the mail is not accepted
by the mail server, the note will not be removed. Note however that this utility has no idea if the mail server has
successfully delivered that mail to the intended recipient; if the mail gets lost somewhere on the way, the note cannot
be recovered.
## Running `hedgedoc-expire.py` ## Running `hedgedoc-expire.py`
@ -38,22 +46,43 @@ From Docker Compose:
- database - database
``` ```
Running against a local setup with one note, with times set to a fraction of a day:
```shell
$ poetry run python ./hedgedoc-expire.py -n .001 -r .001
Revisions to be deleted created before 2024-05-20 09:02:46.407671+00:00 (a minute):
foo@example.com http://localhost:3000/foo: hedgedoc-expire - remove old notes
a day: 328dd778-6914-4a04-be58-940b1fe83e01
a day: 5acca085-51da-4463-ad20-74dc03e68255
Notes to be deleted not changed since 2024-05-20 09:02:46.416294+00:00 (a minute):
foo@example.com (a day) http://localhost:3000/foo: hedgedoc-expire - remove old notes
```
## Arguments and Environment Variables ## Arguments and Environment Variables
There are two main modes to run `hedgedoc-require`: check and expire. With `-c`, a report is generated on standard out. There are two main modes to run `hedgedoc-require`: check and expire. With `-c`, a report is generated on standard out.
Without it, the expiry action will be taken. Without it, the expiry action will be taken.
| Option | Default | Description | | Option | Default | Description |
|--------|---------|------------------------------------------------------------| |--------|---------|-------------------------------------------------------------------|
| -c | | Create a report which revisions and notes will be expired | | -n | 90 | remove all notes not changed in these many days |
| -n | 90 | remove all notes not changed in these many days | | -r | 7 | remove all revisions created more than these many days ago |
| -r | 7 | remove all revisions created more than these many days ago | | -v | false | Print info on current action during `cron` and `expire` commandds |
Command is one of:
| Command | Description |
|---------|------------------------------------------------------------------------------------------------------------|
| check | Print a list of revisions and notes that would be expired, based on the given parameters for `-n` and `-r` |
| cron | Run `expire` at 2 am local time each day. Will run until killed. |
| expire | Expire old revisions and notes. |
### Environment variables ### Environment variables
To configure the Postgres database connection and the SMTP parameters to send mail, set these variables. To configure the Postgres database connection and the SMTP parameters to send mail, set these variables.
For the SMTP connection, the code assumes a standard submission protocol setup with enable StartTLS and authentication, so you will need to configure a username and password. For the SMTP connection, the code assumes a standard submission protocol setup with enable StartTLS and authentication,
so you will need to configure a username and password.
| Variable | Default | Description | | Variable | Default | Description |
|-------------------|-----------------------|-------------------------------------| |-------------------|-----------------------|-------------------------------------|

View file

@ -46,7 +46,7 @@ services:
hedgedoc-expire: hedgedoc-expire:
image: hedgedoc-expire image: hedgedoc-expire
command: "-c -r .001 -n .001" command: "-v -r .001 -n .001 check"
environment: environment:
- "POSTGRES_HOSTNAME=database" - "POSTGRES_HOSTNAME=database"
depends_on: depends_on:

View file

@ -11,6 +11,7 @@ from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText from email.mime.text import MIMEText
from os import getenv from os import getenv
from textwrap import dedent from textwrap import dedent
from time import sleep
import humanize import humanize
import pgsql import pgsql
@ -22,6 +23,10 @@ class Config:
""" """
def __init__(self): def __init__(self):
self.verbose = False
self.revision_age = timedelta(days=14)
self.note_age = timedelta(days=95)
self.postgres_hostname = getenv('POSTGRES_HOSTNAME', 'localhost') self.postgres_hostname = getenv('POSTGRES_HOSTNAME', 'localhost')
self.postgres_username = getenv('POSTGRES_USERNAME', 'hedgedoc') self.postgres_username = getenv('POSTGRES_USERNAME', 'hedgedoc')
self.postgres_password = getenv('POSTGRES_PASSWORD', 'geheim') self.postgres_password = getenv('POSTGRES_PASSWORD', 'geheim')
@ -62,210 +67,250 @@ class EmailSender:
smtp_server.send_message(message) smtp_server.send_message(message)
def email_from_email_or_profile(row) -> str: class HedgedocExpire:
""" def __init__(self, config: Config, email_sender: EmailSender):
Get the email address of the creator from a database row. If the email column is populated, use that, otherwise self.config = config
try to extract it from the login profile. The profile is a JSON object that has an emails array. We're using the self.email_sender = email_sender
first address from there.
:param row: database row as a dict with email and profile columns @staticmethod
:return: email address def email_from_email_or_profile(row) -> str:
""" """
if row['email'] is not None: Get the email address of the creator from a database row. If the email column is populated, use that, otherwise
return row['email'] try to extract it from the login profile. The profile is a JSON object that has an emails array. We're using the
profile = json.loads(row['profile']) first address from there.
return profile['emails'][0] :param row: database row as a dict with email and profile columns
:return: email address
"""
if row['email'] is not None:
return row['email']
profile = json.loads(row['profile'])
return profile['emails'][0]
def notes_to_be_expired(self, db) -> list[any]:
"""
Get a list of all notes to be expired.
:return:
"""
notes = []
cutoff = datetime.now(timezone.utc) - self.config.note_age
with db.prepare('''SELECT
"Notes"."alias",
"Notes"."content",
"Notes"."createdAt",
"Notes"."ownerId",
"Notes"."shortid",
"Notes"."id",
"Notes"."title",
"Notes"."updatedAt",
"Users"."email",
"Users"."profile"
FROM "Notes", "Users"
WHERE "Notes"."updatedAt" < $1
AND "Notes"."ownerId" = "Users"."id"
ORDER BY "Notes"."updatedAt"
''') as notes_older_than:
for row in notes_older_than(cutoff):
notes.append({
'alias': row.alias if row.alias is not None else row.shortid,
'content': row.content,
'createdAt': row.createdAt,
'email': row.email,
"id": row.id,
'ownerId': row.ownerId,
'profile': row.profile,
'shortid': row.shortid,
'title': row.title,
'updatedAt': row.updatedAt
})
return notes
def revisions_to_be_expired(self, db) -> list[any]:
"""
Obtain a list of revisions to be expired.
:param db: the database connection
:return:
"""
revisions = []
cutoff = datetime.now(timezone.utc) - self.config.revision_age
with db.prepare('''SELECT
"Notes"."alias",
"Revisions"."createdAt",
"Users"."email",
"Users"."profile",
"Revisions"."id" as "revisionId",
"Notes"."id" as "noteId",
"Notes"."shortid" as "shortid",
"Notes"."title"
FROM "Revisions", "Notes", "Users"
WHERE "Revisions"."createdAt" < $1
AND "Revisions"."noteId" = "Notes"."id"
AND "Notes"."ownerId" = "Users"."id"
ORDER BY "Notes"."createdAt", "Revisions"."createdAt"
''') as revs_older_than:
for row in revs_older_than(cutoff):
revisions.append({
'alias': row.alias,
'createdAt': row.createdAt,
'email': row.email,
'noteId': row.noteId,
'profile': row.profile,
'revisionId': row.revisionId,
'shortid': row.shortid,
'title': row.title
})
return revisions
def check_notes_to_be_expired(self, db) -> None:
"""
Print a list of notes that will be expired.
:param db: the database connection
:return:
"""
cutoff = datetime.now(timezone.utc) - self.config.note_age
print(f'Notes to be deleted not changed since {cutoff} ({humanize.naturaldelta(self.config.note_age)}):')
for note in self.notes_to_be_expired(db):
age = datetime.now(timezone.utc) - datetime.fromisoformat(note['updatedAt'])
url = self.config.url + '/' + (note["alias"] if note["alias"] is not None else note["shortid"])
print(f' {self.email_from_email_or_profile(note)} ({humanize.naturaldelta(age)}) {url}: {note["title"]}')
def check_revisions_to_be_expired(self, db) -> None:
"""
Print a list of revisions that will be expired.
:return:
"""
cutoff = datetime.now(timezone.utc) - self.config.revision_age
print(f'Revisions to be deleted created before {cutoff} ({humanize.naturaldelta(self.config.revision_age)}):')
notes = {}
for row in self.revisions_to_be_expired(db):
row['age'] = datetime.now(timezone.utc) - datetime.fromisoformat(row['createdAt'])
if row['noteId'] not in notes:
notes[row['noteId']] = []
notes[row['noteId']].append(row)
for revisionId, revisions in notes.items():
addr = self.email_from_email_or_profile(revisions[0])
url = self.config.url + '/' + (
revisions[0]["alias"] if revisions[0]["alias"] is not None else revisions[0]["shortid"])
print(f' {addr} {url}: {revisions[0]["title"]}')
for rev in revisions:
print(f' {humanize.naturaldelta(rev["age"])}: {rev["revisionId"]}')
def expire_old_notes(self, db) -> None:
"""
Email old notes to their owners, then delete them.
:param db: the database connection
:return:
"""
with db.prepare('DELETE FROM "Notes" WHERE "id" = $1') as delete_statement:
for note in self.notes_to_be_expired(db):
try:
note_age = datetime.now(timezone.utc) - datetime.fromisoformat(note['updatedAt'])
msg = MIMEMultipart()
msg['From'] = self.email_sender.mail_from
msg['To'] = self.email_from_email_or_profile(note)
msg['Subject'] = f'Your HedgeDoc Note "{note["title"]}" has expired'
msg.attach(MIMEText(dedent(f'''\
You created the note titled "{note["title"]}" on {note["createdAt"]}.
It was lasted updated {note['updatedAt']}, {humanize.naturaldelta(note_age)} ago. We expire all notes
that have not been updated within {humanize.naturaldelta(self.config.note_age)}.
Please find attached the contents of the latest revision of your note.
The admin team for {self.config.url}
''')))
md = MIMEBase('text', 'markdown')
md.add_header('Content-Disposition', f'attachment; filename={note["title"]}.md')
md.set_payload(note["content"])
msg.attach(md)
self.email_sender.send(msg)
# email backup of the note sent, now we can delete it
delete_statement(note["id"])
if self.config.verbose:
url = self.config.url + '/' + (note["alias"] if note["alias"] is not None else note["shortid"])
print(f'Note "{note["title"]}" ({url}) emailed to {msg["To"]}')
except Exception as e:
print(f'Unable to send email to {note["email"]}: {e}', file=sys.stderr)
def expire_old_revisions(self, db) -> None:
"""
Removes all revision on all notes that have been modified earlier than age.
:param db: the database connection
:return:
"""
cutoff = datetime.now(timezone.utc) - self.config.revision_age
with db.prepare('DELETE FROM "Revisions" WHERE "createdAt" < $1 RETURNING id') as delete:
rows = list(delete(cutoff))
if self.config.verbose:
print(f'Deleted {len(rows)} old revisions')
def cmd_check(self) -> None:
with pgsql.Connection((self.config.postgres_hostname, self.config.postgres_port),
self.config.postgres_username, self.config.postgres_password) as db:
self.check_revisions_to_be_expired(db)
self.check_notes_to_be_expired(db)
def cmd_expire(self) -> None:
with pgsql.Connection((self.config.postgres_hostname, self.config.postgres_port),
self.config.postgres_username, self.config.postgres_password) as db:
self.expire_old_revisions(db)
self.expire_old_notes(db)
def notes_to_be_expired(cutoff: datetime) -> list[any]: def main():
"""
Get a list of all notes to be expired.
:param cutoff: notes that have last beed updated before this date are designated to be expired.
:return:
"""
notes = []
with db.prepare('''SELECT
"Notes"."alias",
"Notes"."content",
"Notes"."createdAt",
"Notes"."ownerId",
"Notes"."shortid",
"Notes"."id",
"Notes"."title",
"Notes"."updatedAt",
"Users"."email",
"Users"."profile"
FROM "Notes", "Users"
WHERE "Notes"."updatedAt" < $1
AND "Notes"."ownerId" = "Users"."id"
ORDER BY "Notes"."updatedAt"
''') as notes_older_than:
for row in notes_older_than(cutoff):
notes.append({
'alias': row.alias if row.alias is not None else row.shortid,
'content': row.content,
'createdAt': row.createdAt,
'email': row.email,
"id": row.id,
'ownerId': row.ownerId,
'profile': row.profile,
'shortid': row.shortid,
'title': row.title,
'updatedAt': row.updatedAt
})
return notes
def revisions_to_be_expired(cutoff: datetime) -> list[any]:
"""
Obtain a list of revisions to be expired.
:param cutoff:
:return:
"""
revisions = []
with db.prepare('''SELECT
"Notes"."alias",
"Revisions"."createdAt",
"Users"."email",
"Users"."profile",
"Revisions"."id" as "revisionId",
"Notes"."id" as "noteId",
"Notes"."shortid" as "shortid",
"Notes"."title"
FROM "Revisions", "Notes", "Users"
WHERE "Revisions"."createdAt" < $1
AND "Revisions"."noteId" = "Notes"."id"
AND "Notes"."ownerId" = "Users"."id"
ORDER BY "Notes"."createdAt", "Revisions"."createdAt"
''') as revs_older_than:
for row in revs_older_than(cutoff):
revisions.append({
'alias': row.alias,
'createdAt': row.createdAt,
'email': row.email,
'noteId': row.noteId,
'profile': row.profile,
'revisionId': row.revisionId,
'shortid': row.shortid,
'title': row.title
})
return revisions
def check_notes_to_be_expired(age: timedelta, config: Config) -> None:
"""
Print a list of notes that will be expired.
:param age: expire notes not updated in this timespan
:param config: configuration parameters used in output
:return:
"""
cutoff = datetime.now(timezone.utc) - age
print(f'Notes to be deleted older than {cutoff} ({humanize.naturaldelta(age)}):')
for note in notes_to_be_expired(cutoff):
age = datetime.now(timezone.utc) - datetime.fromisoformat(note['updatedAt'])
url = config.url + '/' + (note["alias"] if note["alias"] is not None else note["shortid"])
print(f' {email_from_email_or_profile(note)} ({humanize.naturaldelta(age)}) {url}: {note["title"]}')
def check_revisions_to_be_expired(age: timedelta, config: Config) -> None:
"""
Print a list of revisions that will be expired.
:param age: expire revisions created before this timespan
:param config: configuration parameters used in output
:return:
"""
cutoff = datetime.now(timezone.utc) - age
print(f'Revisions to be deleted older than {cutoff} ({humanize.naturaldelta(age)}):')
notes = {}
for row in revisions_to_be_expired(cutoff):
row['age'] = datetime.now(timezone.utc) - datetime.fromisoformat(row['createdAt'])
if row['noteId'] not in notes:
notes[row['noteId']] = []
notes[row['noteId']].append(row)
for id, revisions in notes.items():
email = email_from_email_or_profile(revisions[0])
url = config.url + '/' + (
revisions[0]["alias"] if revisions[0]["alias"] is not None else revisions[0]["shortid"])
print(f' {email} {url}: {revisions[0]["title"]}')
for rev in revisions:
print(f' {humanize.naturaldelta(rev["age"])}: {rev["revisionId"]}')
def expire_old_notes(age: timedelta, config: Config, mail: EmailSender) -> None:
"""
Email old notes to their owners, then delete them.
:param age: expire notes not updated in this timespan
:param config: configuration parameters used in output
:param mail: how to send the mail
:return:
"""
cutoff = datetime.now(timezone.utc) - age
with db.prepare('DELETE FROM "Notes" WHERE "id" = $1') as delete_statement:
for note in notes_to_be_expired(cutoff):
try:
note_age = datetime.now(timezone.utc) - datetime.fromisoformat(note['updatedAt'])
msg = MIMEMultipart()
msg['From'] = mail.mail_from
msg['To'] = email_from_email_or_profile(note)
msg['Subject'] = f'Your HedgeDoc Note "{note["title"]}" has been expired'
msg.attach(MIMEText(dedent(f'''\
You created the note titled "{note["title"]}" on {note["createdAt"]}.
It was lasted updated {note['updatedAt']}, {humanize.naturaldelta(note_age)} ago. We expire all notes
that have not been updated within {humanize.naturaldelta(age)}.
Please find attached the contents of the latest revision of your note.
The admin team for {config.url}
'''
)))
md = MIMEBase('text', "markdown")
md.add_header('Content-Disposition', f'attachment; filename={note["title"]}')
md.set_payload(note["content"])
msg.attach(md)
mail.send(msg)
# email backup of the note sent, now we can delete it
delete_statement(note["id"])
except Exception as e:
print(f'Unable to send email to {note["email"]}: {e}', file=sys.stderr)
def expire_old_revisions(age: timedelta) -> None:
"""
Removes all revision on all notes that have been modified earlier than age.
:param age:
:return:
"""
cutoff = datetime.now(timezone.utc) - age
with db.prepare('DELETE FROM "Revisions" WHERE "createdAt" < $1') as delete:
delete(cutoff)
if __name__ == '__main__':
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
prog='hedgedoc-expire', prog='hedgedoc-expire',
description='Remove old notes and revisions from Hedgedoc', formatter_class=argparse.RawDescriptionHelpFormatter,
epilog='See https://git.hamburg.ccc.de/CCCHH/hedgedoc-expire') description=dedent('''\
parser.add_argument('-c', '--check', action='store_true', Remove old notes and revisions from Hedgedoc
help='print what would be done, then exit')
Notes that have not been updated in the specified time will be emailed to the creator and then deleted.
Revisions of notes that have been created before the specified time will be deleted.
'''),
epilog=dedent('''\
command is one of:
- check: print a list of revisions and notes to be expired
- cron: run expire every 24 hours
- expire: expire old revisions and untouched notes
See https://git.hamburg.ccc.de/CCCHH/hedgedoc-expire
''')
)
parser.add_argument('-n', '--notes', metavar='DAYS', type=float, default=95, parser.add_argument('-n', '--notes', metavar='DAYS', type=float, default=95,
help='remove all notes not changed in these many days') help='remove all notes not changed in these many days')
parser.add_argument('-r', '--revisions', metavar='DAYS', type=float, default=14, parser.add_argument('-r', '--revisions', metavar='DAYS', type=float, default=14,
help='remove all revisions created more than these many days ago') help='remove all revisions created more than these many days ago')
parser.add_argument('command', choices=['check', 'cron', 'expire'], default='check', nargs='?',
help='action to perform')
parser.add_argument('-v', '--verbose', action='store_true', default=False,
help='print more info while running')
args = parser.parse_args() args = parser.parse_args()
revisions_delta = timedelta(days=args.revisions)
notes_delta = timedelta(days=args.notes)
config = Config() config = Config()
config.note_age = timedelta(days=args.revisions)
config.revision_age = timedelta(days=args.notes)
config.verbose = args.verbose
mail = EmailSender(config.smtp_hostname, config.smtp_port, config.smtp_username, config.smtp_password, mail = EmailSender(config.smtp_hostname, config.smtp_port, config.smtp_username, config.smtp_password,
config.smtp_from) config.smtp_from)
hedgedoc_expire = HedgedocExpire(config, mail)
with pgsql.Connection((config.postgres_hostname, config.postgres_port), config.postgres_username, if args.command == 'check':
config.postgres_password) as db: hedgedoc_expire.cmd_check()
if args.check: elif args.command == 'cron':
check_revisions_to_be_expired(revisions_delta, config) while True:
check_notes_to_be_expired(notes_delta, config) next_expire = datetime.now().replace(hour=2, minute=0, second=0, microsecond=0) + timedelta(days=1)
sys.exit(0) if args.verbose:
expire_old_revisions(revisions_delta) print(f'Next expire execution: {next_expire}')
expire_old_notes(notes_delta, config, mail) seconds = (next_expire - datetime.now()).total_seconds()
if seconds > 0:
sleep(seconds)
hedgedoc_expire.cmd_expire()
elif args.command == 'expire':
hedgedoc_expire.cmd_expire()
else:
parser.print_help()
if __name__ == '__main__':
main()