I'm a rabid Linux fan. I write books about it, I have servers running it, and I even have various flavors of Linux as dual-boot defaults on my PCs. But keeping up with Linux news can be a bit of effort, particularly if I want to have that up-to-date news on a Web page, rather than in an RSS Aggregator.
Fortunately, it's a matter of ten minutes of shell script programming to remedy this. In this article, I'll show you step-by-step exactly how to create a cron job that'll automatically create an HTML file that contains the latest headlines from LinuxWorld.com. Just don't tell their Webmaster! :-)
Getting to the Right Page
Like many sites, LinuxWorld.com has "XML" buttons on its various category pages, so it takes only a few seconds to identify that http://www.linuxworld.com/topic_content/c_news.rss is the URL of the RSS feed for LinuxWorld.com's news.
Now, to tap into that XML feed--RSS files are written in XML format. I'll utilize the fast, simple curl program, which makes it very easy to get files from Web servers, FTP servers, and much more. Well worth knowing if you want to script anything Internet-related. You should have it on your Linux box too. For step one, a simple shell script I'll call get-linuxworld-news.sh:
#!/bin/sh
# Get the latest Linux news from LinuxWorld.com
url="http://www.linuxworld.com/topic_content/c_news.rss"
/usr/bin/curl --silent "$url"
That's it. When I run this script, feeding the output to more so as not to be overwhelmed, here's what I see:
$ sh get-linuxworld-news.sh | head
<?xml version="1.0" encoding="ISO-8859-1"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://purl.org/rss/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
><channel rdf:about="http://www.linuxworld.com/rss/default.rss">
<title>LinuxWorld: News</title>
<description>The latest articles from News @ LinuxWorld</description>
<link>http://www.linuxworld.com/</link>
<image rdf:resource="http://www.linuxworld.com/images/aa_logo.gif"/>
Lots of weird XML stuff, but a little more examination shows that the key XML fields we want are the <title>,<link>, and <description> so we'll slip in a grep call to look for just that:
$ sh get-linuxworld-news.sh | grep -E '(<title|<link|<desc)' | head -6
<title>LinuxWorld: News</title>
<description>The latest articles from News @ LinuxWorld</description>
<link>http://www.linuxworld.com/</link>
<title>Flash To be Ported to Linux?</title>
<link>http://www.linuxworld.com/story/43917.htm</link>
<description>Hard on the heels on the Linux version of Flash Player, the free player for displaying Flash content, and Linux flavors of several server applications, Macromedia is contemplating a more comprehensive move toward Linux, said its chief software architect this week - starting, he thinks, with Flash MX.</description>
Lots better. The problem now, though, is that we don't really need the top few lines of the output, so a quick call to sed solves this problem: you might not know it, but in addition to changing old to new, sed can also extract blocks of the input stream based on specific parameters. To see lines 4 through the end, for example, use sed -n '4,$p' as shown:
$ sh get-linuxworld-news.sh | sed -n '4,$p' | head -3
<title>Flash To be Ported to Linux?</title>
<link>http://www.linuxworld.com/story/43917.htm</link>
<description>Hard on the heels on the Linux version of Flash Player, the free player for displaying Flash content, and Linux flavors of several server applications, Macromedia is contemplating a more comprehensive move toward Linux, said its chief software architect this week - starting, he thinks, with Flash MX.</description> Looks like we're getting somewhere, finally.
Changing the Order of Lines
The next step is to actually flip the first and second lines of each three-line sequence so that the link appears before the title. This sounds fairly daunting, but it turns out that it's a perfect job for awk., a simple interpreted programming language that's been included with Unix since the very beginning of the Operating System You could certainly use Perl for this too, though if you were going to crack open a Perl script, you'd probably just write this entire script in Perl. But that wouldn't be any where near as interesting as a nice handy shell script, would it?
So here's another version of the script, but with the necessary awk syntax tucked in so we can change the order of lines in the output stream:
#!/bin/sh
# Get the latest Linux news from LinuxWorld.com
url="http://www.linuxworld.com/topic_content/c_news.rss"
temp="/tmp/$(basename $0).$$" ; trap "/bin/rm -f $temp" 0
cat << "EOF" > $temp
{ if (NR % 3 == 1) {
title=$0
} else if (NR % 3 == 2) {
link=$0
} else {
print link; print title ; print $0
}
}
EOF
/usr/bin/curl --silent "$url" | \
grep -E '(<title|<link|<desc)' | \
sed -n '4,$p' | \
awk -f $temp
This is really close to the final format, believe it or not. Here's the output, you can see for yourself:
$ sh get-linuxworld-news.sh | head -3
<link>http://www.linuxworld.com/story/43917.htm</link>
<title>Flash To be Ported to Linux?</title>
<description>Hard on the heels on the Linux version of Flash Player, the free player for displaying Flash content, and Linux flavors of several server applications, Macromedia is contemplating a more comprehensive move toward Linux, said its chief software architect this week - starting, he thinks, with Flash MX.</description>
All that's left is to turn the XML tags into HTML tags, which can be done with sed in a much more traditional and typical application of the utility:
/usr/bin/curl --silent "$url" | \
grep -E '(<title|<link|<desc)' | \
sed -n '4,$p' | \
awk -f $temp | \
sed -e 's/<link>/<li><a href="/' -e 's/<\/link>/">/' \
-e 's/<title>//' -e 's/<\/title>/<\/a><br>/' \
-e 's/<description>//' -e 's/<\/description>/<\/li>/'
The result of this updated script is almost exactly what I'd like:
$ sh get-linuxworld-news.sh | head -3
<li><a href="http://www.linuxworld.com/story/43917.htm">
Flash To be Ported to Linux?</a><br>
Hard on the heels on the Linux version of Flash Player, the free player for displaying Flash content, and Linux flavors of several server applications, Macromedia is contemplating a more comprehensive move toward Linux, said its chief software architect this week - starting, he thinks, with Flash MX.</li>
The only problem here is that we need to add a <ul> to the top and a </ul> to the bottom, which is easily done with two additional echo statements. Put it all together and here's the final script:
#!/bin/sh
# Get the latest Linux news from LinuxWorld.com
url="http://www.linuxworld.com/topic_content/c_news.rss"
temp="/tmp/$(basename $0).$$" ; trap "/bin/rm -f $temp" 0
cat << "EOF" > $temp
{ if (NR % 3 == 1) {
title=$0
} else if (NR % 3 == 2) {
link=$0
} else {
print link; print title ; print $0
}
}
EOF
echo "<ul>" # assuming you want a bullet list
/usr/bin/curl --silent "$url" | \
grep -E '(<title|<link|<desc)' | \
sed -n '4,$p' | \
awk -f $temp | \
sed -e 's/<link>/<li><a href="/' -e 's/<\/link>/">/' \
-e 's/<title>//' -e 's/<\/title>/<\/a><br>/' \
-e 's/<description>//' -e 's/<\/description>/<\/li>/'
echo "</ul>"
exit 0
Adding the Headlines to your Web Page
To create a Web page using this script is straightforward:
$ sh get-linuxworld-rss.sh > headlines.html
To include that fragment into a Web page, use serverside includes (SSI), which would look something like this:
<!--#include virtual="headlines.html"-->
and every time that page is served up to a visitor, they'll see the contents of the headlines.html file.
How to keep them up-to-date? Put the get-linuxworld-rss.sh invocation into a cron job, perhaps every four hours you'll rebuild the HTML output file:
9 6,18 * * * get-linuxworld-rss.sh > headlines.html
That's it. Not too bad, was it?
It's also worth noting that this use of shell scripts to parse and format XML has more applications than just a bullet list of headlines from this site. For example, go to http://www.casino-bookstore.com/ and have a close look at the "Latest Gambling News" box: it's using almost an identical script to keep track of the gambling news XML feed from about.com.
Another example? Go to http://www.healthy-bookstore.com/ and look at the medicinenet news feed. Again, it's using curl and sed to turn the XML data into HTML data. |