Gutenberg
About
Projekt Gutenberg-DE wurde in den Kindertagen des Internet im Jahr 1994 gegründet. Wir wollten sofort dieses »Neuland« nutzen, um neben die oft flüchtigen und flachen Inhalte etwas Dauerhafteres zu setzen und nach unseren Möglichkeiten einen Beitrag zu Förderung und Stärkung der deutschen Kultur und Sprache zu leisten. Mit Gutenberg-DE bieten wir die weltweit größte deutschsprachige Volltext-Literatursammlung kostenlos für alle an: für Schüler, Lehrer und Studenten, für Menschen, die Deutsch lernen möchten und für die, die einfach Freude am Lesen haben.
MANY THANKS TO PROJEKT GUTENBERG!
Import Projekt Gutenberg in Calibre
Importing Projekt Gutenberg DE into Calibre is tricky!
A nav-bar header is included in every html file. While the import Calibre parses the document structure, finds the nav-bar and tries to im port all the authors, too. ~10000 * 250MiB ~ 2,5TiB easily exceedthe RAM capacity and the Out-Of-Memory-Killer (OOM) reaps the import processes.
What we need is to ensure that the nav-bar is found in no document. We just remove the <div class="navi-gb-ed15"> … </div> in every html-file with xsl-transformation and a wrapping script.
Preparation
- Create a copy of the USB-medium on disk and remove the USB-medium (it's faster and you won't destroy the stick).
- Fix the unix permissions in the clone.
Maybe mount a tmpfs in /tmp
remove_div.xsl
xsl-transformation to remove the div with navigation
~/workspace/gutenberg/remove_div.xsl
1 <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
2 <xsl:output method="html" indent="yes" encoding="ISO-8859-1"/>
3 <xsl:strip-space elements="*"/>
4
5 <xsl:template match="node()|@*">
6 <xsl:copy>
7 <xsl:apply-templates select="node()|@*"/>
8 </xsl:copy>
9 </xsl:template>
10
11 <xsl:template match="div[@class='navi-gb-ed15']"/>
12 </xsl:stylesheet>
A script to
- remove the navigation header from each html-file using the xsl-transformation above
- select the directories and files to be imported
gutenberg_prepare.sh
~/workspace/gutenberg/gutenberg_prepare.sh
1 #!/bin/bash
2
3 CPUS="$(nproc)"
4 SELF="$(basename "$0")"
5
6 # Note that we use "$@" to let each command-line parameter expand to a
7 # separate word. The quotes around "$@" are essential!
8 # We need TEMP as the 'eval set --' would nuke the return value of getopt.
9 TEMP=$(getopt \
10 -o 'c:d:s:x:' \
11 --long 'cpus:,destination:,source:,xslt-file:' \
12 -n "$SELF" -- "$@")
13
14 if [ $? -ne 0 ]; then
15 echo 'Error during option parsing. Terminating...' >&2
16 exit 1
17 fi
18
19 # Note the quotes around "$TEMP": they are essential!
20 eval set -- "$TEMP"
21 unset TEMP
22
23 while true; do
24 case "$1" in
25 '-c'|'--cpus')
26 CPUS="$2"
27 shift 2
28 continue
29 ;;
30 '-d'|'--destination')
31 DIR_DST="$2"
32 shift 2
33 continue
34 ;;
35 '-s'|'--source')
36 DIR_SRC="$2"
37 shift 2
38 continue
39 ;;
40 '-x'|'--xsl-file')
41 XSLT="$2"
42 shift 2
43 continue
44 ;;
45 '--')
46 shift
47 break
48 ;;
49 *)
50 echo 'Unknown option. Terminating …' >&2
51 exit 1
52 ;;
53 esac
54 done
55
56 ### SANITY CHECKS
57 if [ ! -d "$DIR_SRC" ]; then
58 echo "Source may not be null."
59 exit 1
60 fi
61
62 if [ -z "$DIR_DST" ]; then
63 echo "Destination may not be null."
64 exit 1
65 fi
66
67 if [ ! -f "$XSLT" ]; then
68 echo "Please specify a xslt-file."
69 exit 1
70 fi
71
72 ### MAIN
73 cd "$DIR_SRC" || exit 1
74
75
76 DIRS_HTML="$(find . \
77 -mindepth 2 \
78 -regextype posix-extended \
79 -regex "./.*\.html?$" \
80 |cut -d/ -f -3 \
81 |sort \
82 |uniq \
83 |sed 's#./##')"
84
85 DIRS_PDF="$(find . \
86 -mindepth 2 \
87 -regextype posix-extended \
88 -regex "./.*\.pdf$" \
89 |cut -d/ -f -3 \
90 |sort \
91 |uniq \
92 |sed 's#./##')"
93
94 SPECIAL=( 'css' 'js' 'bin' )
95 BLOCKLIST=( 'autoren' 'info' 'hirschbe' )
96
97 declare -a FILTER
98 FILTER=( $(echo "${SPECIAL[@]}" "${BLOCKLIST[@]}" \
99 |sed -r 's/\<(\w+)/-e \1$ /g' ) );
100
101 DIRS_HTML_FILTERED="$(grep -v "${FILTER[@]}" <<< $DIRS_HTML)"
102 DIRS_PDF_FILTERED="$(grep -v "${FILTER[@]}" <<< $DIRS_PDF)"
103
104
105 [ -d "$DIR_DST" ] || mkdir -vp "$DIR_DST"
106
107 ### COPY SPECIALS
108 echo "Copying special directories"
109 rsync -apP "${SPECIAL[@]}" "$DIR_DST"
110
111
112 ### PROCESS ALL FILES OF AN AUTHOR IN PARALLEL
113 for DIR in ${DIRS_HTML_FILTERED};do
114 echo -e "\nProcessing '$DIR'"
115 [ -d "$DIR_DST/$DIR" ] || mkdir -v "$DIR_DST/$DIR"
116 echo "Copying contained files."
117 rsync -qapP "$DIR_SRC/$DIR/" "$DIR_DST/$DIR"
118 echo "Removing nav-bars."
119 FILES_HTML="$(find "$DIR" \
120 -regextype posix-extended \
121 -regex ".*\.html?$")"
122 <<< $FILES_HTML xargs -L1 -P "$CPUS" -I {} -- \
123 xsltproc --html \
124 --encoding ISO-8859-1 \
125 "$XSLT" \
126 -o "$DIR_DST"/"{}" "{}" \
127 2>/dev/null
128 done
129
130 echo "Copying EPUBs/PDFs"
131 find $DIRS_PDF_FILTERED \
132 -name '*.pdf' -o -name '*.epub' \
133 -exec cp -v {} "$DIR_DST"/{} \;
134
135 echo "done"
Call it like this (runs ~6min on my machine)
Setup Calibre
Create a new empty library in /tmp/gutenberg, which is located in RAM. This speeds up the import.
Options > Extended > Misc
- Disable the limitation to max. number of CPUs
Set the number to CPUs * 1.5 in -> in my case 48
Options > Import/Export > Add Books > Read Metadata
- Read Metadata from content instead from filename
Options > Import/Export > Add Books > Add by action (tab)
- Disable "Automatically convert added books to the preferred output format"
- Enable "Mark added books"
- Restart Calibre
Import
- Add from directories and sub-directories
Assume all e-books in a single directory are the saem book in different formats? -> Yes
Choose input directory, which was produced by shell script. /tmp/gutenberg_tmp/Edition15
Wait and watch sudo inotifywait -r --monitor -e create /tmp/gutenberg
- You probably want not to import duplicates.
If you created the new library on tmp
- Remove the library in Calibre
- Move the library to the final destination and
- Import the library into Calibre again.
- Convert all created ZIPs to epub.
Have fun reading on your ebook reader.
done.