中国IT动力,最新最全的IT技术教程
最新100篇 | 推荐100篇 | 专题100篇 | 排行榜 | 搜索 | 在线API文档
首 页 | 程序开发 | 操作系统 | 软件应用 | 图形图象 | 网络应用 | 精文荟萃 | 教育认证 | 硬件维护 | 未整理篇 | 站长教程
ASP JS PHP工程 ASP.NET 网站建设 UML J2EESUN .NET VC VB VFP 网络维护 数据库 DB2 SQL2000 Oracle Mysql
服务器 Win2000 Office C DreamWeaver FireWorks Flash PhotoShop 上网宝典 CorelDraw 协议大全 网络安全 微软认证
硬件维护  CPU  主板  硬盘  内存  显卡  显示器  键盘鼠标  声卡音箱  打印机  机箱电源  BIOS  网卡  C#  Java  Delphi  vs.net2005
  当前位置:> 程序开发 > 编程语言 > Java > Java与XML
The Worlds of RSS, XML, HTML, and Linux Meet @ JDJ
作者:未知 时间:2005-08-10 19:01 出处:Java频道 责编:chinaitpower
              摘要:The Worlds of RSS, XML, HTML, and Linux Meet @ JDJ
I'm a rabid Linux fan. I write books about it, I have servers running it, and I even have various flavors of Linux as dual-boot defaults on my PCs. But keeping up with Linux news can be a bit of effort, particularly if I want to have that up-to-date news on a Web page, rather than in an RSS Aggregator.

Fortunately, it's a matter of ten minutes of shell script programming to remedy this. In this article, I'll show you step-by-step exactly how to create a cron job that'll automatically create an HTML file that contains the latest headlines from LinuxWorld.com. Just don't tell their Webmaster! :-)

Getting to the Right Page

Like many sites, LinuxWorld.com has "XML" buttons on its various category pages, so it takes only a few seconds to identify that http://www.linuxworld.com/topic_content/c_news.rss is the URL of the RSS feed for LinuxWorld.com's news.

Now, to tap into that XML feed--RSS files are written in XML format. I'll utilize the fast, simple curl program, which makes it very easy to get files from Web servers, FTP servers, and much more. Well worth knowing if you want to script anything Internet-related. You should have it on your Linux box too. For step one, a simple shell script I'll call get-linuxworld-news.sh:


#!/bin/sh

 

# Get the latest Linux news from LinuxWorld.com

 

url="http://www.linuxworld.com/topic_content/c_news.rss"

 

/usr/bin/curl --silent "$url"

 

 

That's it. When I run this script, feeding the output to more so as not to be overwhelmed, here's what I see:

$ sh get-linuxworld-news.sh | head

<?xml version="1.0" encoding="ISO-8859-1"?>

<rdf:RDF  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

          xmlns="http://purl.org/rss/1.0/"

          xmlns:dc="http://purl.org/dc/elements/1.1/"

          xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"

><channel rdf:about="http://www.linuxworld.com/rss/default.rss">

    <title>LinuxWorld: News</title>

    <description>The latest articles from News @ LinuxWorld</description>

    <link>http://www.linuxworld.com/</link>

    <image rdf:resource="http://www.linuxworld.com/images/aa_logo.gif"/>

 


Lots of weird XML stuff, but a little more examination shows that the key XML fields we want are the <title>,<link>, and <description> so we'll slip in a grep call to look for just that:

 

$ sh get-linuxworld-news.sh | grep -E '(<title|<link|<desc)' | head  -6

    <title>LinuxWorld: News</title>

    <description>The latest articles from News @ LinuxWorld</description>

    <link>http://www.linuxworld.com/</link>

    <title>Flash To be Ported to Linux?</title>

    <link>http://www.linuxworld.com/story/43917.htm</link>

    <description>Hard on the heels on the Linux version of Flash Player, the free player for displaying Flash content, and Linux flavors of several server applications, Macromedia is contemplating a more comprehensive move toward Linux, said its chief software architect this week - starting, he thinks, with Flash MX.</description>

 

 

Lots better. The problem now, though, is that we don't really need the top few lines of the output, so a quick call to sed solves this problem: you might not know it, but in addition to changing old to new, sed can also extract blocks of the input stream based on specific parameters. To see lines 4 through the end, for example, use sed -n '4,$p' as shown:

$ sh get-linuxworld-news.sh |  sed -n '4,$p' | head -3

    <title>Flash To be Ported to Linux?</title>

    <link>http://www.linuxworld.com/story/43917.htm</link>

    <description>Hard on the heels on the Linux version of Flash Player, the free player for displaying Flash content, and Linux flavors of several server applications, Macromedia is contemplating a more comprehensive move toward Linux, said its chief software architect this week - starting, he thinks, with Flash MX.</description>


Looks like we're getting somewhere, finally.


Changing the Order of Lines

The next step is to actually flip the first and second lines of each three-line sequence so that the link appears before the title. This sounds fairly daunting, but it turns out that it's a perfect job for awk., a simple interpreted programming language that's been included with Unix since the very beginning of the Operating System You could certainly use Perl for this too, though if you were going to crack open a Perl script, you'd probably just write this entire script in Perl. But that wouldn't be any where near as interesting as a nice handy shell script, would it?

So here's another version of the script, but with the necessary awk syntax tucked in so we can change the order of lines in the output stream:

#!/bin/sh

 

# Get the latest Linux news from LinuxWorld.com

 

url="http://www.linuxworld.com/topic_content/c_news.rss"

temp="/tmp/$(basename $0).$$" ; trap "/bin/rm -f $temp" 0

 

cat << "EOF" > $temp

{ if (NR % 3 == 1) {

    title=$0

  } else if (NR % 3 == 2) {

    link=$0

  } else {

    print link; print title ; print $0

  }

}

EOF

 

/usr/bin/curl --silent "$url" | \

  grep -E '(<title|<link|<desc)' | \

  sed -n '4,$p' | \

  awk -f $temp

 

 

This is really close to the final format, believe it or not. Here's the output, you can see for yourself:


$ sh get-linuxworld-news.sh | head -3

    <link>http://www.linuxworld.com/story/43917.htm</link>

    <title>Flash To be Ported to Linux?</title>

    <description>Hard on the heels on the Linux version of Flash Player, the free player for displaying Flash content, and Linux flavors of several server applications, Macromedia is contemplating a more comprehensive move toward Linux, said its chief software architect this week - starting, he thinks, with Flash MX.</description>

 

All that's left is to turn the XML tags into HTML tags, which can be done with sed in a much more traditional and typical application of the utility:

/usr/bin/curl --silent "$url" | \

  grep -E '(<title|<link|<desc)' | \

  sed -n '4,$p' | \

  awk -f $temp | \

  sed -e 's/<link>/<li><a href="/' -e 's/<\/link>/">/' \

      -e 's/<title>//' -e 's/<\/title>/<\/a><br>/' \

      -e 's/<description>//' -e 's/<\/description>/<\/li>/'

 


The result of this updated script is almost exactly what I'd like:

$ sh get-linuxworld-news.sh | head -3

    <li><a href="http://www.linuxworld.com/story/43917.htm">

    Flash To be Ported to Linux?</a><br>

    Hard on the heels on the Linux version of Flash Player, the free player for displaying Flash content, and Linux flavors of several server applications, Macromedia is contemplating a more comprehensive move toward Linux, said its chief software architect this week - starting, he thinks, with Flash MX.</li>

 


The only problem here is that we need to add a <ul> to the top and a </ul> to the bottom, which is easily done with two additional echo statements. Put it all together and here's the final script:

#!/bin/sh

 

# Get the latest Linux news from LinuxWorld.com

 

url="http://www.linuxworld.com/topic_content/c_news.rss"

temp="/tmp/$(basename $0).$$" ; trap "/bin/rm -f $temp" 0

 

cat << "EOF" > $temp

{ if (NR % 3 == 1) {

    title=$0

  } else if (NR % 3 == 2) {

    link=$0

  } else {

    print link; print title ; print $0

  }

}

EOF

 

echo "<ul>"             # assuming you want a bullet list

 

/usr/bin/curl --silent "$url" | \

  grep -E '(<title|<link|<desc)' | \

  sed -n '4,$p' | \

  awk -f $temp | \

  sed -e 's/<link>/<li><a href="/' -e 's/<\/link>/">/' \

      -e 's/<title>//' -e 's/<\/title>/<\/a><br>/' \

      -e 's/<description>//' -e 's/<\/description>/<\/li>/'

 

echo "</ul>"

 

exit 0

Adding the Headlines to your Web Page

To create a Web page using this script is straightforward:

$ sh get-linuxworld-rss.sh > headlines.html

 

To include that fragment into a Web page, use serverside includes (SSI), which would look something like this:

<!--#include virtual="headlines.html"-->

 and every time that page is served up to a visitor, they'll see the contents of the headlines.html file.

How to keep them up-to-date? Put the get-linuxworld-rss.sh invocation into a cron job, perhaps every four hours you'll rebuild the HTML output file:

9 6,18 * * *            get-linuxworld-rss.sh > headlines.html


 

That's it. Not too bad, was it?

It's also worth noting that this use of shell scripts to parse and format XML has more applications than just a bullet list of headlines from this site. For example, go to http://www.casino-bookstore.com/ and have a close look at the "Latest Gambling News" box: it's using almost an identical script to keep track of the gambling news XML feed from about.com.

Another example? Go to http://www.healthy-bookstore.com/ and look at the medicinenet news feed. Again, it's using curl and sed to turn the XML data into HTML data.

关闭本页
 
首页 | 投资与合作 | 服务条款 | 隐私政策 | 收藏本站 | 设为首页 | 新用户注册 | 免责声明 | 使用帮助
Copyright ©2005-2008 chinaitpower.com All rights reserved. www.chinaitpower.com 版权所有