First public release of the python rewrite

Very late due to me being very busy + procrastinating
This commit is contained in:
Xerbo 2020-04-06 23:27:20 +01:00
parent 4f3c9eb6bb
commit 972dacb5bd
5 changed files with 198 additions and 276 deletions

27
.gitignore vendored
View file

@ -1,22 +1,7 @@
# Pictures
*.jpg
*.png
*.bmp
*.gif
# Meta
*.meta
# Audio
*.mp3
*.wav
*.wmv
# Flash
*.swf
# Documents
*.txt
*.pdf
*.doc
*.docx
# Swap files
*.swp
# Cookies
# My cookies, committing them to a public repo would not be a good idea
cookies.txt
# Downloaded types
*.png
*.jpg
*.json

29
LICENSE
View file

@ -1,28 +1,7 @@
Copyright (c) 2015, Sergey "Shnatsel" Davidoff
All rights reserved.
Copyright 2020 Xerbo (xerbo@protonmail.com)
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of furaffinity-dl nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

View file

@ -1,38 +1,39 @@
This branch is the development version of furaffinity-dl rewritten in python.
# FurAffinity Downloader
**furaffinity-dl** is a bash script for batch downloading of galleries and favorites from furaffinity.net users.
**furaffinity-dl** is a python script for batch downloading of galleries (and scraps/favourites) from furaffinity.net users.
It was written for preservation of culture, to counter the people nuking their galleries every once a while.
Supports all known submission types: images, texts and audio.
## Requirements
Coreutils, bash and wget are the only dependencies. However if you want to embed metadata into files you will need eyed3 and exiftool
The exacts are unknown due to the fact that this is still early in development, you should only need beautifulsoup4 to be installed though. I will put a `requirements.txt` in the repo soon
furaffinity-dl was tested only on Linux. It should also work on Mac and BSDs.
Windows users can get it to work via Microsoft's [WSL](https://docs.microsoft.com/en-us/windows/wsl/install-win10). Cygwin is not supported.
furaffinity-dl was tested only on Linux. It should also work on Mac, Windows and any other platform that supports python.
## Usage
Make it executable with
`chmod +x faraffinity-dl`
And then run it with
`./furaffinity-dl section/username`
Run it with
`./furaffinity-dl.py category username`
or:
`python3 furaffinity-dl.py category username`
All files from the given section and user will be downloaded to the current directory.
### Examples
`./furaffinity-dl gallery/mylafox`
`python3 fadl.py gallery koul`
`./furaffinity-dl -o mylasArt gallery/mylafox`
`python3 fadl.py -o koulsArt gallery koul`
`./furaffinity-dl -o koulsFavs favorites/koul`
`python3 fadl.py -o mylasFavs favorites mylafox`
For a full list of command line arguemnts use `./furaffinity-dl -h`.
For a full list of command line arguments use `./furaffinity-dl -h`.
You can also log in to download restricted content. To do that, log in to FurAffinity in your web browser, export cookies to a file from your web browser in Netscape format (there are extensions to do that [for Firefox](https://addons.mozilla.org/en-US/firefox/addon/ganbo/) and [for Chrome/Vivaldi](https://chrome.google.com/webstore/detail/cookiestxt/njabckikapfpffapmjgojcnbfjonfjfg)) and pass them to the script as a second parameter, like this:
`./furaffinity-dl -c /path/to/your/cookies.txt gallery/gonnaneedabiggerboat`
`python3 fadl.py -c cookies.txt gallery letodoesartt`
## TODO
* Download user bio, post tags and ideally user comments
* Download user information.
## Disclaimer
It is your own responsibility to check whether batch downloading is allowed by FurAffinity's terms of service and to abide by them. For further disclaimers see LICENSE.

View file

@ -1,216 +0,0 @@
#!/bin/bash
# shellcheck disable=SC2001
set -e
# Default options
outdir="."
prefix="https:"
metadata=true
rename=true
maxsavefiles="0"
overwrite=false
textmeta=false
classic=false
# Helper functions
help() {
echo "Usage: $0 [ARGUMENTS] SECTION/USER
Downloads the entire gallery/scraps/favorites of any furaffinity user.
Arguments:
-h (H)elp screen
-i Use an (I)nsecure connection when downloading
-o The (O)utput directory to put files in
-c If you need to download restricted content
you can provide a path to a (C)ookie file
-p (P)lain file without any additional metadata
-r Don't (R)ename files, just give them the same
filename as on facdn
-n (N)unmber of images to download, starting from
the most recent submission
-w Over(Write) files if they already exist
-s (S)eperate metadata files, to make sure all
metadata is downloaded regardless of file
-t Not using the \"beta\" (T)heme
Examples:
$0 gallery/mylafox
$0 -o mylasArt gallery/mylafox
$0 -o koulsFavs favorites/koul
You can also log in to FurAffinity to download restricted content, like this:
$0 -c /path/to/your/cookies.txt gallery/gonnaneedabiggerboat
DISCLAIMER: It is your own responsibility to check whether batch downloading is allowed by FurAffinity terms of service and to abide by them."
exit 1
}
# Display help if no arguments given
[[ $# -eq 0 ]] && help
# Options via arguments
while getopts 'o:c:n:iphrwst' flag; do
case "${flag}" in
t) classic=true;;
w) overwrite=true;;
o) outdir=${OPTARG};;
c) cookiefile=${OPTARG};;
i) prefix="http:";;
p) metadata=false;;
r) rename=false;;
n) maxsavefiles=${OPTARG};;
h) help;;
s) textmeta=true;;
*) help;;
esac
done
# Detect installed metadata injectors
eyed3=true
if [ -z "$(command -v eyeD3)" ]; then
eyed3=false
echo "INFO: eyed3 is not installed, no metadata will be injected into music files."
fi
exiftool=true
if [ -z "$(command -v exiftool)" ]; then
exiftool=false
echo "INFO: exiftool is not installed, no metadata will be injected into pictures."
fi
cleanup() {
rm -f "$tempfile"
}
# Attempt to create the output directory
mkdir -p -- "$outdir"
# Setup temporarily file with 600 perms
tempfile="$(umask u=rwx,g=,o= && mktemp --suffix=_fa-dl)"
# Call cleanup function on exit
trap cleanup EXIT
if [ -z "$cookiefile" ]; then
# Set wget with a custom user agent
fwget() {
wget --quiet --user-agent="Mozilla/5.0 furaffinity-dl (https://github.com/Xerbo/furaffinity-dl)" "$@"
}
else
# Set wget with a custom user agent and cookies
fwget() {
wget --quiet --user-agent="Mozilla/5.0 furaffinity-dl (https://github.com/Xerbo/furaffinity-dl)" --load-cookies "$cookiefile" "$@"
}
fi
url="https://www.furaffinity.net/${*: -1}"
download_count="0"
# Iterate over the gallery pages with thumbnails and links to artwork view pages
while true; do
fwget "$url" -O "$tempfile"
if [ -n "$cookiefile" ] && grep -q 'furaffinity.net/login/' "$tempfile"; then
echo "ERROR: You have provided a cookies file, but it does not contain valid cookies.
If this file used to work, this means that the cookies have expired;
you will have to log in to FurAffinity from your web browser and export the cookies again.
If this is the first time you're trying to use cookies, make sure you have exported them
in Netscape format (this is normally done through \"cookie export\" browser extensions)
and supplied the correct path to the cookies.txt file to this script.
If that doesn't resolve the issue, please report the problem at
https://github.com/Xerbo/furaffinity-dl/issues" >&2
exit 1
fi
# Get URL for next page out of "Next" button. Required for favorites, pages of which are not numbered
if [ $classic = true ]; then
next_page_url="$(grep '<a class="button-link right" href="' "$tempfile" | grep '">Next &nbsp;&#x276f;&#x276f;</a>' | cut -d '"' -f 4 | sort -u)"
else
next_page_url="$(grep -B 1 --max-count=1 'type="submit">Next' "$tempfile" | grep form | cut -d '"' -f 2)"
fi
# Extract links to pages with individual artworks and iterate over them
artwork_pages="$(grep '<a href="/view/' "$tempfile" | grep -E --only-matching '/view/[[:digit:]]+/' | uniq)"
for page in $artwork_pages; do
# Download the submission page
fwget -O "$tempfile" "https://www.furaffinity.net$page"
if grep -q "System Message" "$tempfile"; then
echo "WARNING: $page seems to be inaccessible, skipping."
continue
fi
# Get the full size image URL.
# This will be a facdn.net link, we will default to HTTPS
# but this can be disabled with -i or --http for specific reasons
image_url="$prefix$(grep --only-matching --max-count=1 ' href="//d.facdn.net/art/.\+">Download' "$tempfile" | cut -d '"' -f 2)"
# Get metadata
description="$(grep 'og:description" content="' "$tempfile" | cut -d '"' -f 4)"
if [ $classic = true ]; then
title="$(grep -Eo '<h2>.*</h2>' "$tempfile" | awk -F "<h2>" '{print $2}' | awk -F "</h2>" '{print $1}')"
else
title="$(grep -Eo '<h2><p>.*</p></h2>' "$tempfile" | awk -F "<p>" '{print $2}' | awk -F "</p>" '{print $1}')"
fi
file_type="${image_url##*.}"
file_name="$(echo "$image_url" | cut -d "/" -f 7)"
if [[ "$file_name" =~ ^[0-9]{0,12}$ ]]; then
file_name="$(echo "$image_url" | cut -d "/" -f 8)"
fi
# Choose the output path
if [ $rename = true ]; then
# FIXME titles that are just a single emoji get changed to " " and overwrite eachother
file="$outdir/$(echo "$title" | sed -e 's/[^A-Za-z0-9._-]/ /g').$file_type"
else
file="$outdir/$file_name"
fi
# Download the image
if [ ! -f "$file" ] || [ $overwrite = true ] ; then
wget --quiet --show-progress "$image_url" -O "$file"
else
echo "File already exists, skipping. Use -w to skip this check"
fi
mime_type="$(file -- "$file")"
if [ $textmeta = true ]; then
echo -ne "Title: $title\nURL: $page\nFilename: $file_name\nDescription: $description" > "$file.meta"
fi
# Add metadata
if [[ $mime_type == *"audio"* ]]; then
# Use eyeD3 for injecting metadata into audio files (if it's installed)
if [ $eyed3 = true ] && [ $metadata = true ]; then
if [ -z "$description" ]; then
eyeD3 -t "$title" -- "$file" || true
else
# HACK: eyeD3 throws an error if a description containing a ":"
eyeD3 -t "$title" --add-comment "${description//:/\\:}" -- "$file" || true
fi
fi
elif [[ $mime_type == *"image"* ]]; then
# Use exiftool for injecting metadata into pictures (if it's installed)
if [ $exiftool = true ] && [ $metadata = true ]; then
cat -- "$file" | exiftool -description="$description" -title="$title" -overwrite_original - > "$tempfile" && mv -- "$tempfile" "$file" || true
fi
fi
# If there is a file download limit then keep track of it
if [ "$maxsavefiles" -ne "0" ]; then
download_count="$((download_count + 1))"
if [ "$download_count" -ge "$maxsavefiles" ]; then
echo "Reached set file download limit."
exit 0
fi
fi
done
[ -z "$next_page_url" ] && break
url='https://www.furaffinity.net'"$next_page_url"
done

173
furaffinity-dl.py Executable file
View file

@ -0,0 +1,173 @@
#!/usr/bin/python3
import argparse
from argparse import RawTextHelpFormatter
import json
from bs4 import BeautifulSoup
import requests
import urllib.request
import http.cookiejar as cookielib
import urllib.parse
import re
import os
'''
Please refer to LICENSE for licensing conditions.
current ideas / things to do:
-r replenish, keep downloading until it finds a already downloaded file
-n number of posts to download
file renaming to title
metadata injection (gets messy easily)
sqlite database
support for beta theme
using `requests` instead of `urllib`
turn this into a module
'''
# Argument parsing
parser = argparse.ArgumentParser(formatter_class=RawTextHelpFormatter, description='Downloads the entire gallery/scraps/favorites of a furaffinity user', epilog='''
Examples:
python3 fadl.py gallery koul
python3 fadl.py -o koulsArt gallery koul
python3 fadl.py -o mylasFavs favorites mylafox\n
You can also log in to FurAffinity in a web browser and load cookies to download restricted content:
python3 fadl.py -c cookies.txt gallery letodoesart\n
DISCLAIMER: It is your own responsibility to check whether batch downloading is allowed by FurAffinity terms of service and to abide by them.
''')
parser.add_argument('category', metavar='category', type=str, nargs='?', default='gallery',
help='the category to download, gallery/scraps/favorites')
parser.add_argument('username', metavar='username', type=str, nargs='?',
help='username of the furaffinity user')
parser.add_argument('-o', metavar='output', dest='output', type=str, default='.', help="output directory")
parser.add_argument('-c', metavar='cookies', dest='cookies', type=str, default='', help="path to a NetScape cookies file")
parser.add_argument('-s', metavar='start', dest='start', type=int, default=1, help="page number to start from")
args = parser.parse_args()
if args.username == None:
parser.print_help()
exit()
# Create output directory if it doesn't exist
if args.output != '.':
os.makedirs(args.output, exist_ok=True)
# Check validity of category
valid_categories = ['gallery', 'favorites', 'scraps']
if not args.category in valid_categories:
raise Exception('Category is not valid', args.category)
# Check validity of username
if bool(re.compile(r'[^a-zA-Z0-9\-~._]').search(args.username)):
raise Exception('Username contains non-valid characters', args.username)
# Initialise a session
session = requests.Session()
session.headers.update({'User-Agent': 'furaffinity-dl redevelopment'})
# Load cookies from a netscape cookie file (if provided)
if args.cookies != '':
cookies = cookielib.MozillaCookieJar(args.cookies)
cookies.load()
session.cookies = cookies
base_url = 'https://www.furaffinity.net'
gallery_url = '{}/gallery/{}'.format(base_url, args.username)
page_num = args.start
# The cursed function that handles downloading
def download_file(path):
page_url = '{}{}'.format(base_url, path)
response = session.get(page_url)
s = BeautifulSoup(response.text, 'html.parser')
image = s.find(class_='download').find('a').attrs.get('href')
title = s.find(class_='submission-title').find('p').contents[0];
filename = image.split("/")[-1:][0]
data = {
'id': int(path.split('/')[-2:-1][0]),
'filename': filename,
'author': s.find(class_='submission-id-sub-container').find('a').find('strong').text,
'date': s.find(class_='popup_date').attrs.get('title'),
'title': title,
'description': s.find(class_='submission-description').text.strip().replace('\r\n', '\n'),
"tags": [],
'views': int(s.find(class_='views').find(class_='font-large').text),
'favorites': int(s.find(class_='favorites').find(class_='font-large').text),
'rating': s.find(class_='rating-box').text.strip(),
'comments': []
}
# Extact tags
for tag in s.find(class_='tags-row').findAll(class_='tags'):
data['tags'].append(tag.find('a').text)
# Extract comments
for comment in s.findAll(class_='comment_container'):
temp_ele = comment.find(class_='comment-parent')
parent_cid = None if temp_ele == None else int(temp_ele.attrs.get('href')[5:])
# Comment deleted or hidden
if comment.find(class_='comment-link') == None:
continue
data['comments'].append({
'cid': int(comment.find(class_='comment-link').attrs.get('href')[5:]),
'parent_cid': parent_cid,
'content': comment.find(class_='comment_text').contents[0].strip(),
'username': comment.find(class_='comment_username').text,
'date': comment.find(class_='popup_date').attrs.get('title')
})
# Write a UTF-8 encoded JSON file for metadata
with open(os.path.join(args.output, '{}.json'.format(filename)), 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)
print('Downloading "{}"... '.format(title))
# Because for some god forsaken reason FA keeps the original filename in the upload, in the case that it contains non-ASCII
# characters it can make this thing blow up. So we have to do some annoying IRI stuff to make it work. Maybe consider `requests`
# instead of urllib
def strip_non_ascii(s): return ''.join(i for i in s if ord(i) < 128)
url = 'https:{}'.format(image)
url = urllib.parse.urlsplit(url)
url = list(url)
url[2] = urllib.parse.quote(url[2])
url = urllib.parse.urlunsplit(url)
urllib.request.urlretrieve(url, os.path.join(args.output, strip_non_ascii(filename)))
# Main downloading loop
while True:
page_url = '{}/{}'.format(gallery_url, page_num)
response = session.get(page_url)
s = BeautifulSoup(response.text, 'html.parser')
# Account status
if page_num == 1:
if s.find(class_='loggedin_user_avatar') != None:
account_username = s.find(class_='loggedin_user_avatar').attrs.get('alt')
print('Logged in as', account_username)
else:
print('Not logged in, some users gallery\'s may be unaccessible and NSFW content is not downloadable')
# System messages
if s.find(class_='notice-message') != None:
message = s.find(class_='notice-message').find('div')
for ele in message:
if ele.name != None:
ele.decompose()
raise Exception('System Message', message.text.strip())
# End of gallery
if s.find(id='no-images') != None:
print('End of gallery')
break
# Download all images on the page
for img in s.findAll('figure'):
download_file(img.find('a').attrs.get('href'))
page_num += 1
print('Downloading page', page_num)
print('Finished downloading')