说明:

  1. 本人极度讨厌中英文混夹,因为会增加理解难度,对读者极不友好,阅读体验差。所以,所有词汇全部中文化
  2. 第一次出现的中文名词都附带原英文,不喜欢翻译的,可以自动替换之。
  3. 版权:本文放弃所有版权,可以任意分发修改

致歉:总是做事比翻译快,等哪天有空,再完成翻译吧……

前言

一、名词介绍

1、草根计算集群或者叫草根计算机阵列(英文名为Beowulf cluster,也就是装逼的人说的“贝奥武夫机群”),是一种使用廉价个人电脑硬件组装而成的并行计算机集群,具有极好的性能/价格比,比较典型的大规模使用是谷歌的服务器。

注:这个集群在国内流行,一般认为是因为大家在研究谷歌成功经验的时候,发现其性能极高而又极其廉价的硬件系统,而开始获得比较广泛的关注和普及。其实国内早就有人这样做,不过都被人鄙视,觉得太粗陋。

2、个人小超算,或者叫小超算/小机群/小集群(英文名为Microwulf Cluster):主要是个人使用,注重性价比、高性能、可快速增减部件。一般提供高于260亿次的性能。

二、发展历史:

最早于1994年美航局(全称美国航天局,英文缩写NASA)的唐纳德?贝克尔等人所开发。

三、方式

四、追求:

  1. 追求:更便宜、运算速度更快,更小化。
  2. 衡量单位:元/亿次(英文为: $/Gflops=美元/giga floating point operations per second =美元/每秒10亿 (=10^9) 次的浮点运算——此处全部换算为中文的每秒亿次,而不是每秒十亿次)
  3. 节能
  4. 大小,这个主要要看主板和芯片: 基本要求是小型,可方便搬动。

五、衡量参照物:

  1. 价格参考:
  1. 性能参考:

2013年全球最快的大型超算 广州超算中心的天河二号(Tianhe-2)的结构和性能:

  1. 功耗参考:

第一部分、个人电脑阵列

此部分原文来自卡尔文学院的Joel Adams

一、作者简介:

乔尔 亚当斯 (Joel Adams)是卡尔文 学院(Calvin College)计算机科学(computer science)教授,1988年获得在匹次堡大学获得博士学位,主要研究超算的内部连接,是几本计算机编程教材的作者,两次获得Fulbright Scholar (毛里求斯1998, 冰岛 2005).

缇姆 布伦姆(Tim Brom)是卡耐基大学计算机科学的研究生,2007年五月在卡尔文 学院获得计算机科学学士学位。

二、说明:

此小超算运算超过260亿次,价格少于2500美元,重量少于31磅,外观规格为11” x 12” x 17”——刚好够小,足够放在桌面上或者柜子里上。

更新:2007年8月1日,这个小超算已经可以用1256美元构建成,使得其性价比达到4.8美元/亿次——这样的话,可以增加更多的芯片,以提升性能,让其更接近21世纪初的超算性能。

此小超算是由卡尔文学院的计算机系统教授乔尔 亚当斯和助教 缇姆 布伦姆设计和构建。

下面是原文的目录,可点击查看:

三、介绍

作为一个典型的超算用户,我原先需要到计算中心排队,而且要限定使用的计算资源。这个对于开发新的分布式软件来说,很麻烦。所以呢,我需要一个自己的,梦想中的小超算是可以小到放在我的桌面上,就像普通个人电脑一样。只需要普通的电源,不需要特殊的冷切装置就可以在室温下运行……

2006年末, 两个硬件发展,让我这个梦想接近了现实:

于是我就设想了一个小型的,4个节点,使用多核芯片,每个节点使用高速网线连接。

2006年秋天, 卡尔文学院计算机系给了我们一笔小钱——就是2500美元,去构建这么一个系统,我们当时设定的目标:

据我们参考了当时所知的一些小型超算,或者是性价比不错的超算,主要是下面几个:

下面是历年的性价比之王:

在同一时间,还有其他更廉价或者是更具性价比的超算集群,不过这些记录都在2007年被改变了,最具性价比的就是下文介绍的小超算(2007年一月,9.41美元/亿次),而其记录半年后就被自己打破(2007年8月 4.784美元/亿次)。

架构设计:

个人小超算一般做法是使用多核芯片,集中安装到一个小的空间里,集中供电。

1960年代末,吉恩 庵达郝乐(Gene Amdahl)提出了一个设计准则,叫 “庵达赫乐法则”(Amdahl’s Other Law),大意是:

为了兼容性考虑,下面几个特性应该相同:

  • 每片芯片的频率
  • 每根内存大小
  • 每处带宽

高性能计算一般有三个瓶颈:芯片运算速度,运算所需内存,吞吐带宽。 本小超算里面,带宽主要是指网络带宽。我们预算是2500美元,在设定了每核内存量,每核的带宽之后,其中芯片运算速度当然是越快越好。

  1. 内部使用千兆网络(GigE),则意味着我们的带宽只有1Gbps,如果要更快的,可以使用比如Myrinet,不过那会超预算了,此处核心1吉赫兹+每核1吉B内存+1吉bps,嗯,看起来比较完美,哈哈。最终决定是2.0GHz的双核芯片,每核1GB内存

  2. 芯片:AMD Athlon 64 X2 3800 AM2+。2007年一月时每片价格$165,这种2.0GHz的双核芯片,是当时可以找到的性价比最好的。(2007年8月更便宜了,每片只有$65.00)。一共用了4块。

  3. 为了尽量减少体积,主板选用 小型一体机主板:MSI Micro-ATX(和 Little Fe一样) 。此主板特点是小(9.6” × 8.2”) ,并且有一个AM2 socket,可支持AMD的Athlon多核芯片。其实如果预算更充分的话,可用AMD的四核Athlon64 CPU替代这个双核,这系统恰好还不用做改动。 同时主板对装,以减少体积。

  4. 此主板上已经内建一个千兆网卡,还有一个PCI-e扩展插槽,在PCI-e插槽插入另一块网卡(41美元),用于平衡芯片运算速度和网络带宽。这样,四块主板总共就有内嵌的4个网卡,外加PCI-e插槽的4张网卡,一共8个网络通道,用网线把它们都连接到8口交换机(100美元)上。

Our intent was to provide sufficient bandwidth for each core to have its own GigE channel, to make our system less imbalanced with respect to CPU speed (two x 2 GHz cores) and network bandwidth (two x 1 Gbps adaptors). This arrangement also let us experiment with channel bonding the two adaptors, experiment with HPL using various MPI libraries using one vs two NICs, experiment with using one adaptor for “computational” traffic and the other for “administrative/file-service” traffic, and so on.)

  1. 每块主板插两根内存,共2G,这8G内存消耗了预算的40%!!

  2. 为了更小化,本小超算没有使用机箱,而是用一个非封闭的外架,像Little Fe 和 这些集群,一共三层,把主板直接安装到有机玻璃上面,然后用几根小铁杆撑起来,最底部夹层,放着8口交换机,光驱,还有250GB的硬盘。

结构图如下:

小超算的硬件结构

图一: 小超算的硬件结构

如图所示,最顶层的下方放着一块主板,而中间层则两面都放主板,底层则上方放主板,这样做的目的是尽可能减少高度。

另外,它们距离也不能太近,PCI-e和风扇的高度不一,为了减少高度,两块对着的主板放置的时候使它们错开,最终两块相对着的主板最高部分的间距为 0.5”,每块主板间的间距为6”,如图所示:

主板间距

图二: Closeup of Microwulf’s Tower 主板之间的距离

(说明:这些主板还有一个单独 PCI-e x16插槽,以后想提升性能的时候,可以插上一块GPU)

电源:350瓦的电源供电(每块主板一个),使用双面胶固定在有机玻璃上,电源插座放在最上面的有机玻璃上,如图所示:

电源和风扇 Figure Three: Closeup of Microwulf’s Power Supplies 本小超算的电源和风扇

(此处用胶水固定硬盘、光驱、路由器)

最靠近夹层的底部主板作为“主节点”——主控主板,接有硬盘、光驱(可选)等,系统启动/关机/重启的时候也是从这个部分操作。其他的主板当作“分支节点”,使用网络启动方式(PXE方式)启动。

对最底部的主控主板做特殊设置,接上一块250GB硬盘,并且作为启动分区。插入光驱,主要是用于安装初始系统。

插入另一块网卡10/100 NIC到PCI插槽中,用于连接外部网络。

顶部三个节点都是无硬盘的,通过NFS 使用主控版上的 250 GB硬盘。

下图显示了本小超算各个部分的连接关系(节点0为重心,连接了硬盘、光驱、以及连接外部的接口,内部中心为千兆路由,用于连接其他节点):

节点模型

Microwulf schematic节点模型

说明:每个节点都有两条独立的通讯线路。

风扇:To accomplish this, we decided to purchase four Zalman 120mm case fans ($8 each) and grills ($1.50 each).一共四个风扇,两个送风,两个出风。如图:

四个风扇

目前效果还不错,运转状态下,测到的温度只比室温高4度。

统一一处的电源供应。

操作系统使用的是有奔头(Ubuntu Linux)

软件:开源通用信道(Open MPI)将自动识别每个节点的网络适配器,并让它们之间组成一个 环型的信息交流系统。为帮助 开源通用信道 识别收发端,把 内置的网卡配置为 192.168.2.x ,插PCI-e的网卡配置为192.168.3.x 。

价格参考(2007年一月):

部件 产品名称 单价 数量 总价  
主板 微星 K9N6PGM-F MicroATX $80.00 4 $320.00  
芯片 威盛Athlon 64 X2 3800+ AM2 CPU $165.00 4 $660.00  
内存 金士顿 DDR2-667 1GByte RAM $124.00 8 $992.00  
电源 Echo Star 325W MicroATX Power Supply $19.00 4 $76.00  
网卡 Intel PRO/1000 PT PCI-Express NIC (节点连接路由) $41.00 4 $164.00  
网卡 Intel PRO/100 S PCI NIC (主控主板连接外部网络) $15.00 1 $15.00  
交换机 Trendware TEG-S80TXE 8-port Gigabit Ethernet Switch $75.00 1 $75.00  
硬盘 希捷7200转 250GB SATA硬盘 $92.00 1 $92.00    
光驱 Liteon SHD-16S1S 16X $19.00 1 $19.00  
冷切系统 Zalman ZM-F3 120mm Case Fans $8.00 4 $32.00  
风扇 Generic NET12 Fan Grill (120mm) $1.50 + shipping 4 $10.00  
硬件支架 36” x 0.25” threaded rods $1.68 3 $5.00  
硬件固定 Lots of 0.25” nuts and washers     $10.00  
机箱或外壳 12” x 11” 有机玻璃(是我们物理实验室的废品) $0.00 4 $0.00
总价       $2,470.00  

非必须的硬件:监视器切换器

部件 产品名称 单价 数量 总价
KVM Switch Linkskey LKV-S04ASK $50.00 1 $50.00
总价       $50.00

除了技术支持还有硬件加固 (购买自Lowes), 风扇和转接器购买自 新蛋, 其他都购买自(量多有折扣,呵呵):

N F P Enterprises 1456 10 Mile Rd NE Comstock Park, MI 49321-9666 (616) 887-7385

这样省来省去的,总造价就控制在2500美元以下了,平均$308.75每核。

2007年8月配件价格:

各个部件的价格下降很快。芯片、内存、网络、硬盘等,都降了好多价格。2007年8月在 新蛋(Newegg) 中的价格:

部件 产品名称 单价 数量 总价  
主板 微星K9N6PGM-F MicroATX $50.32 4 $201.28  
芯片 威盛 Athlon 64 X2 3800+ AM2 CPU $65.00 4 $260.00  
内存 Corsair DDR2-667 2 x 1GByte RAM $75.99 4 $303.96  
电源 LOGISYS Computer PS350MA MicroATX 350W Power Supply $24.53 4 $98.12  
网卡 Intel PRO/1000 PT PCI-Express NIC (节点连接路由)   $34.99 4 $139.96
网卡 Intel PRO/100 S PCI NIC (主控主板连接外部网络) $15.30 1 $15.30  
交换机 SMC SMCGS8 10/100/1000Mbps 8-port Unmanaged Gigabit Switch $47.52 1 $47.52  
硬盘 希捷7200转 250GB SATA 硬盘 $64.99 1 $64.99  
光驱 Liteon SHD-16S1S 16X $23.83 1 $23.83  
制冷设备 Zalman ZM-F3 120mm Case Fans $14.98 4 $59.92  
风扇 Generic NET12 Fan Grill (120mm) $6.48 4 $25.92  
硬件支架 36” x 0.25” threaded rods $1.68 3 $5.00  
硬件加固 Lots of 0.25” nuts and washers     $10.00  
机箱或外壳 12” x 11” 有机玻璃(来自物理实验室的废物) $0.00 4 $0.00
总价     $1,255.80    

(现在价格应该更低了!而且性能方面应该更强悍了!!!)

2007年8月,这个性价比已经达到了4.784美元/亿次,突破5美元/亿次!!!!!

性耗比则保持不变。

如果融合价格、性能、功耗,则每百万次/瓦/美元为0.04645,是原来的小超算两倍。美元/瓦/百万次为 73,255,也是原来的两倍。

构建配置:

软件系统构建说明,有详细的介绍文件下载——建议想自己构建的人下载下来,然后按照其说明,逐步完成。

细节是魔鬼

首先是选用哪个你牛叉发行版 :曾经一度使用Gentoo,但后来觉得gentoo太消耗能量了(包括系统管理员的精力和系统的耗电),后来试了试有奔头,一开始安装的桌面是6.10版本,其内核是2.6.17,但美中不足的是he on-board NIC的驱动需要到2.6.18才内置,所以一开始两个月,我们的小超算就用的7.04的测试版(内核是2.6.20),直到最后稳定版发行就换了稳定版。

在其他三个计算节点上,安装的是有奔头的服务器版,因为它们不需要桌面功能。

也就是: 有奔头桌面版+3个有奔头服务器版

集群管理软件: 试过一些集群管理软件:ROCKS, Oscar, 和 Warewulf,但ROCKS和Oscar不支持无盘的节点;Warewulf工作良好,但因为本小超算实在太小,目前看不出其优势来。因为这篇论文,曾经想使用iSCSI;不过为了尽快让我们的集群运行起来,还是决定使用 NFSroot,因为其配置非常简单,只需要修改 /etc/initramfs.conf ,让其生成一个启动用内存镜像(initial ramdisk) that does NFSroot and then setting up DHCP/TFTP/PXELinux on the head node, as you would for any diskless boot situation.

网络配置:we gave each onboard NIC an address on a 192.168.2.x subnet, and gave each PCI-e NIC an address on a 192.168.3.x subnet. Then we routed the NFS traffic over the 192.168.2.x subnet, to try to separate “administrative” traffic from computational traffic. It turns out that OpenMPI will use both network interfaces (see below), so this served to spread communication across both NICs.

One of the problems we encountered is that the on-board NICs (Nvidia) present soem difficulties. After our record setting run (see the next section) we started to have trouble with the on-board NIC. After a little googling, we added the following option to the forcedeth module options:

forcedeth max_interrupt_work=35

The problem got better, but didn’t go away. Originally we had the onboard Nvidia GigE adaptor mounting the storage. Unfortunately, when the Nvidia adaptor started to act up, it reset itself, killing the NFS mount and hanging the “compute” nodes. We’re still working on fully resolving this problem, but it hasn’t kept us from benchmarking Microwulf.

效果图:直接点击上面目录连接,可查看

性能表现: 所获得的性能表现

Once Microwulf was built and functioning it’s fairly obvious that we wanted to find out how ‘fast’ it was. Fast can have many meanings, depending upon your definition. But since the HPL benchmark is the standard used for the Top500 list, we decided to use it as our first measure of performance.

本超算安装 Ubuntu当时的开发版(gcc-4.1.2),并编译安装了Open MPI和MPICH。 Initially we used OpenMPI as our MPI library of choice and we had both GigE NICs configured (the on-board adaptor and the Intel PCI-e NIC that was in the x16 PCIe slot). Then we built the GOTO BLAS library, and HPL, the High Performance Linpack benchmark.

The Goto BLAS library built fine, but when we tried to build HPL (which uses BLAS), we got a linking error indicating that someone had left a function named main() in a module named main.f in /usr/lib/libgfortranbegin.a. This conflicted with main() in HPL. Since a library should not need a main() function, we used ar to remove the offending module from /usr/lib/libgfortranbegin.a, after which everything built as expected.

Next, we started to experiment with the various parameters for running HPL - primarily problem size and process layout. We varied PxQ between {1x8, 2x4}, varied NB between {100, 120, 140, 160, 180, 200}, and used increasing values of N (problem size) until we ran out of memory. As an example of the tests we did, Figure Six below is a plot of the HPL performance in GFLOPS versus the problem size N. Figure Six: Microwulf Results for HPL WR00R2R24 (NB=160) Figure Six: Microwulf Results for HPL WR00R2R24 (NB=160)

For Figure Six we chose PxQ=2x4, NB=160, and varied N from a very small number up to 30,000. Notice that above N=10,000, Microwulf achieves 20 GLFOPS, and with N greater than 25,000, it exceeds 25 GFLOPS. Anything above N=30,000 produced “out of memory” errors.

此小超算测试获得峰值性能 26.25 GFLOPS,其理论上的峰值性能是32 GLFOPS. (Eight cores x 2 GHz x 2 double-precision units per core.),也就是说此超算对硬件的最大利用率可达82%。 注意:此高利用率是因为使用了Open MPI,which will use both GigE interfaces. It will round-robin data transfers over the various interfaces unless you explicitly tell it to just use certain interfaces.

It’s important to note that this performance occurred using the default system and Ethernet settings. In particular, we did not tweak any of Ethernet parameters mentioned in Doug Eadline and Jeff Layton’s article on cluster optimization. We were basically using “out of the box” settings for these runs.

To assess how well our NICs were performing, Tim did some followup HPL runs, and used netpipe to gauge our NICs latency. Netpipe reported 16-20 usecs (microseconds) latency on the onboard NICs, and 20-25 usecs latency on the PCI-e NICs, which was lower (better) than we were expecting.

As a check on performance we also tried another experiment. We channel bonded the two GigE interfaces to produce, effectively, a single interface. We then used MPICH2 with the channel bonded interface and used the same HPL parameters we found to be good for Open-MPI. The best performance we achieved was 24.89 GLOPS (77.8% efficiency). So it looks like Open MPI and multiple interfaces beats MPICH2 and a bonded interface.

Another experiment we tried was to use Open MPI and just the PCI-e GigE NIC. Using the same set of HPL parameters we have been using we achieved a performance of 26.03 GFLOPS (81.3% efficiency). This is fairly close to the performance we obtained when using both interfaces. This suggests that the on-board NIC isn’t doing as much work as we thought.

下面看看历年最强500超算里面的本小超算性能方面的排名:

1993年11月: #6
1994年11月: #12
1995年11月: #31
1996年11月: #60
1997年11月: #122
1998年11月: #275
1999年6月: #439
1999年11月: 被踢出名单了

1993年11月,本小超算可以排名世界第6。1999年6月,排名为第439,相比于一般超算放在一个大大的机房里,而且需要众多芯片,这个4片、8芯的集群,只有11” x 12” x 17”,能有如此表现,很不错了。

更进一步挖掘下这个列表: 1993年11月的排名中,排在第五位的超算是用了512片核芯的Thinking Machines CM-5/512,运算速度达到300亿次。本小超算的4核相当于当年的512核啊,哈哈。

1996年11月,此小超算排在第60位,下一个是用了256片核芯的Cray T3D MC256-8,现在8核的性能都超过11年前的256核了, 此处还没说价格差异呢,T3D花费了上百万美元!

超算性能一般以每秒浮算次数(flops)来衡量。早期超算使用百万次来衡量,随着硬件飞跃,十亿次已经是很落后的指标了,现在都流行用万亿次,甚至千万亿次来表示了。 (Mflops: 106 flops Gflops: 109 flops Tflops: 1012 flops Pflops: 1015 flops )。

另外,需要注意区分:

一般计算机生产商会标示峰值,但实际检测一般只有峰值的50%-60%左右。

另一个要注意的是精度,一般高性能运算都是用的双精度,所以不可混淆了单精度和双精度运算。

The standard benchmark (i.e., used by the top500.org supercomputer list) for measuring supercomputer performance is high performance Linpack (aka HPL), a program that exercises and reports a supercomputer’s double-precision floating point performance. To install and run HPL, you must first install a version of the Basic Linear Algebra Subprograms (BLAS) libraries, since HPL depends on them.

In March 2007, we benchmarked Microwulf using HPL and Goto BLAS. After compiling and installing each package, we ran the standard, double-precision version of HPL, varying its parameter values as follows: We varied PxQ between {1x8, 2x4}; varied NB between {100, 120, 140, 160, 180, 200}; and used increasing values of N, starting with 1,000. For the following parameter values:

PxQ = 2x4; NB = 160; N = 30,000

HPL reported 26.25 Gflops on its WR00R2R4 operation. Microwulf also exceeded 26 Gflops on other operations, but 26.25 Gflops was our maximum.

在最强500超算中,1996年的Cray T3D-256也才达到253亿次,所以我们这个260亿次的性能,是足够用来做很多事情的了。

Since we benchmarked Microwulf, Advanced Clustering Technologies has published a convenient web-based calculator that removes much of the trial and error from tuning HPL.

价能比:

本小超算价格 $2470,性能26.25 Gflops,其性价比为 $94.10/Gflop或者是少于$0.10/Mflop,这是第一个性价比突破$100/Gflop的超算。

下面列表可作为参考,了解下这个性价比的意义:

1976年,Cray-1 花费8百万元,峰值 250MFlops,平均 32000美元/MFlop,如果算持续运算值,整个价格会更高。

1985年, 花费1.7千万美元,运算峰值 3.9GFlops,平均 4350美元/Mflop(4358974美元/Gflop)。

1997年,打败西方象棋世界冠军卡斯帕罗夫的 IBM 深蓝。价格是5百万美元,性能是113.8亿次,其性价比是43936.7美元/亿次

In 2003, the U. of Kentucky’s Beowulf cluster KASY0 cost $39,454 to build, and produced 187.3 Gflops on the double-precision version of HPL, giving it a PPR of about $210/Gflop. Also in 2003, the University of Illinois at Urbana-Champaign’s National Center for Supercomputing Applications built the PS 2 Cluster for about $50,000. No measured performance numbers are available; which isn’t surprising, since the PS-2 has no hardware support for double precision floating point operations. This cluster’s theoretical peak performance is about 500 Gflops (single-precision); however, one study showed that the PS-2’s double-precision performance took over 17 times as long as its single-precision performance. Even using the inflated single-precision peak performance value, its PPR is more than $100/Gflop; it’s measured double-precision performance is probably more than 17 times that.

In 2004, Virginia Tech built System X, which cost 5.7 million dollars, and produced 12.25 Tflops of measured performance, giving it a PPR of about $465/Gflop.

In 2007, Sun’s Sparc Enterprice M9000 with a base price of $511,385, produced 1.03 Tflops of measured performance, making its PPR more than $496/Gflop. (The base price is for the 32 cpu model, the benchmark was run using a 64 cpu model, which is presumably more expensive.)

$9.41/亿次,我们的小超算可以说是超算里面性价比最好的一个了,不过呢,还没法提供千万亿次的运算,若有需要,或许可以突破这个价格限制,让性能方面获得更大的提升。

效能 - 世界记录 功耗:

以2007年一月的价格,本小超算用了2470美元,获得262.5亿次的运算速度,平均9.41美元/亿次。这个已经成为新的世界纪录了。

另外,节能方面的事情最近也比较敏感,电能比(耗电量/性能)也需要测量下了,性耗比对集群是非常重要的,尤其是成片的集群(比如谷歌的服务器场)。本小超算我们测试了下,

算了下运行时的性耗比就是1.714瓦/亿次。

对比下其他的超算。

专门进行节能设计的超算Green Destiny 使用了非常节能的芯片,只需要较低的冷切,240核消耗了3.2千瓦,获得的运算性能是1010亿次,性耗比为3.1瓦/亿次。是我们这个自制的小超算的两倍!!!

the Orion Multisystems clusters. Orion is no longer around, but a few years ago they sold two commercial clusters: a 12-node desktop cluster (the DS-12) and a 96-node deskside cluster (the DS-96). Both machines used Transmeta CPUs. The DS-12 used 170W under load, and its performance was about 13.8 GFLOPS. This gives it a performance/power ratio of 12.31W/GLFOP (much better than Microwulf). The DS-96 consumed 1580W under load, with a performance of 109.4 GFLOPS. This gives it a performance/power ratio of 14.44W/GFLOP, which again beats Microwulf.

绿色500强: 每瓦性能产出 MFLOPS/Watt (数字越大越好). Microwulf comes in at 58.33, the DS-12 is 81.18, and the deskside unit is 69.24. 初步可见,Orion系统比我们造的系统单位用电效率高。但我们更深入看下:

The Orion systems look great at Watts/GFLOP and considering the age of the Transmeta chips, that is no small feat. But let’s look at the price/performance metric. The DS-12 desktop model had a list price of about $10,000, giving it a price/performance ratio of $724/GFLOP. The DS-96 deskside unit had a list price of about $100,000, so it’s price/performance is about $914/GFLOP. That is, while the Orion systems were much more power efficient, their price per GFLOP is much higher than that of Microwulf, making them much less cost efficient than Microwulf.

基于Orion用电效率高,但Microwulf资金效率高,我们用一个统合的测算方法,来对比下: MFLOP/Watt/$ 单位价格单位用电下的产出。microwulf系统的追求就是最小的价格,最小的功耗,更大的运算能力!!!!

计算结果对比悬殊。

再来算算 $/Watt/MFLOP 单位运算能力单位功耗,所需的资金量。——用来测算下自己财力和能构建的系统运算能力。

结果还是很悬殊。

Some notable exceptions are:

Green Destiny, an experimental blade cluster built at Los Alamos National Labs in 2002. 使用 240 Transmeta TM560 CPUs,耗电3.2 kilowatts,运算速度101 Gflops (on Linpack),平均 耗电/性能 = 31 watts/Gflop. 上面那个自造的 Microwulf 才 17.14 watts/Gflop,比这好。

The (apparently defunct) Orion Multisystems DS-12 and DS-96 systems:

The DS-12 “desktop” system 消耗 170 watts under load, 速度为 13.8 Gflops (Linpack), 耗电/性能 = 12.31 watts/Gflop. (但其价格 $10,000,价格/性能 = $724/Gflop.)

The DS-96 “under desk” system consumed 1580 watts under load, and produced 109.4 Gflops (Linpack), for a power/performance ratio of 14.44 watts/Gflop. (The DS-96’s list price was about $100,000, making its price/performance ratio about $914/Gflop.)

Orion Multisystems DS-12 and DS-96

我们的小超算性价比上 远超这些商业机器,其性耗比也居于前流。

节能500超算名单,是基于最强500超算的(本小超算没有被列入,呵呵),排名按每瓦运算次数排列。我们的小超算是1.713瓦/亿次,换算如下:

1 / 17.14 W/Gflop * 1000 Mflops/Gflop= 58.34 Mflops/W

2007年8月,我们的小超算超越了节能500超算的第二位,Mare Nostrum (58.23 Mflops/W) – 可惜啊,和排名第一BlueGene/L (112.24 Mflops/W)的距离有点远。

结论

此小超算用了4块芯片、8核集群,大小为11” x 12” x 17”,适合放在桌面上,也适合打包放到飞机上运输。

除了小巧,HPL检测本超算有262.5亿次的运算性能,总花费是2470美元(2007年1月),性价比为9.41美元/亿次。

本小超算能有如此神力的原因是:

我们不打算保守我们的技术秘密,而是希望所有人都来尝试这玩玩,嗯,其实很多部件都是可以替换的。

比如,随着固态硬盘的降价,可以试试固态硬盘替换掉机械硬盘,看看对性能有何影响。

比如内存:因为内存降价,可以把内存换为2GB的,这样每核可以2GB内存。Recalling that HPL kept running out of memory when we increased N above 30,000, it would be interesting to see how many more FLOPS one could eke out with more RAM. The curve in Figure Six suggests that performance is beginning to plateau, but there still looks to be room for improvement there.

比如主板和芯片:此微星主板使用AM2插槽,这个插槽刚好支持威盛新的4核Athlon64芯片,这样就可以替换掉上文中的双核芯片,使得整个系统变成16核,性能更加强劲。有兴趣的同学可以测测这么做的结果性能提升多少?性价比因此而产生的变化?千兆内部网的效能变化等……

等等……

应用:

和其他超算一样,本小超算可以运行一些并行运算软件——需要特别设计,以利用系统的并行运算能力。

这些软件一般会使用 通用信道和并行虚拟机。这几个库提供了分布式计算的最基础功能,一是使得进程可以在网络间沟通和同步,二是提供了一个分布执行最后汇总的机制,使得程序可以被复制成多份,分别在各个节点上运行。

有很多应用软件已经可以在本小超算上使用,大部分是由特定领域的科学家写的,用于解决特定问题:

Parallel finite element analysis (FEA) programs, 包括:

这是我们使用小超算的领域:

常见问题回答:

  1. 是不是小超算运行 大型软件或者是游戏 比较快?

    不一定,如果该软件或者游戏是可以通过网络进行并行运行的,(例如该软件使用了message passing interface (MPI)),则会非常得益于此设计。不过现实中很少有公司会开发这样的游戏,因为超算一般不是普通人用的,应用软件则有一些,但也不是普通人会经常接触的。

    普通多核电脑,内存对每个处理器是共用的,每个运行在不同核心上的 进程或线程 可在这共享的内存层面进行 通讯。

    而本小超算不同,每块主板都有自己独用的 内存,所以程序无法在这些分散在不同主板上的内存层面 通讯, 它们是依赖网络进行通讯(使用MPI)。 Since its memory is distributed among the cluster’s CPUs, a cluster is a distributed memory multiprocessor.

  2. 可以使用视窗系统来驱动小超算么?

    小超算的关键是用让集群互联,并行协作,目前最常用的软件是 MPI。

    视窗系统下也有若干个版本的 MPI 可用,(可搜索 ‘windows mpi’.)因此,也可以用视窗操作系统来组小超算。微软已经发布有若干个用于 高性能运算的 操作系统版本,叫 Windows Compute Cluster Server (Windows CCS),里面带有所有相关集群用的软件,包括MPI。

  3. 我也要搞部小超算,可到哪里学习?

    下面链接可以学习:

    • Building a Beowulf System, by Jan Lindheim, provides a quick overview
    • Jacek Radajewski and Douglas Eadline’s HowTo provides a more detailed overview
    • Kurt Swendson’s HowTo provides step-by-step instructions for building a cluster using Redhat Linux and LAM-MPI
    • Engineering a Beowulf-style Compute Cluster, by Robert Brown, is an online book on building Beowulf clusters, with lots of useful information.
    • The Beowulf mailing list FAQ, by Don Becker, et al, is a list of answers to questions frequently posted to the Beowulf.org mailing list, which has a searchable Archive.
    • Beowulf.org’s Projects page provides a list of links to the first hundred or so Beowulf cluster project sites. Many of these sites provide information that is useful to someone building a Beowulf cluster.
  4. 怎么把主板按到有机玻璃上?

    Our vendor supplied screws and brass standoffs with our motherboards. The standoffs have a male/screw end, normally screwed into the case; and a female/nut end, to which the motherboard is screwed. To use these to mount the motherboards, we just had to:

    • drill holes in the plexiglass pieces in the same positions as the motherboard mounting holes;
    • screw the brass standoffs into the holes in the plexiglass pieces; and
    • screw the motherboards to the standoffs.

    To prepare each plexiglass piece, we laid a motherboard on top of it and then used a marker to color the plexiglass through the motherboard’s mounting holes. 技巧就是:

    • one piece of plexiglass has motherboards on both its top and its bottom, so you have to mark both sides; and
    • two motherboards hang upside down, and two sit right-side up, so you have to take that into account when marking the holes.

    We used a red marker to mark the positions of the holes on motherboards facing up, and a blue marker to mark the positions of the holes on motherboards facing down.

    With the plexiglass pieces marked, we took them to our campus machine shop and used a drill press to drill holes in each piece of plexiglass.

    When all the motherboard holes were drilled, we stacked the plexiglass pieces as they would appear in Microwulf and drilled holes in their corners for the threaded rods.

    We then screwed the standoffs into the plexiglass, taking care not to overtighten them. Being made of soft brass, they are very easy to shear off. If this happens to you, just take the piece of plexiglass back to the drill press and drill out the bit of brass screw that’s in the hole. (Or, if this is the only one, you can just leave it there and use one fewer screws to mount the motherboard.)

    With the standoffs in place, we then placed the motherboards on the standoffs, and used screws to secure them in place. That’s it!

    The only other detail worth mentioning is that before we screwed each motherboard tight to the standoffs, we chose one standoff on each motherboard to ground that motherboard against static. To do this grounding, we got some old phone wire, looped one end to the standoff, and then tightened the screw for that standoff. We then grounded each wire to one of the threaded rods, and grounded that threaded rod to one of the power supplies.

  5. 这小超算可以卖么?

    否,主要是我们对商业不在行。

    But we are trying to build an endowment to provide in-house funding for student projects like Microwulf, so if you’ve found this site to be useful, please consider making a (tax-deductible) donation to it:

    CS Hardware Endowment Fund Department of Computer Science Calvin College 3201 Burton SE Grand Rapids, MI 49546