您的当前位置:首页正文

Extending a multi-agent system for genomic annotation

来源:好兔宠物网
ExtendingaMulti-AgentSystemforGenomic

Annotation

KeithDecker,SalimKhan,CarlSchmidt,andDennisMichaud

ComputerandInformationSciencesDepartmentUniversityofDelaware,Newark,DE19716

decker@cis.udel.edu

Abstract.Theexplosivegrowthingenomic(andsoon,expressionandproteomic)data,exemplifiedbytheHumanGenomeProject,isafertiledomainfortheappli-cationofmulti-agentinformationgatheringtechnologies.Furthermore,hundredsofsmaller-profile,yetstilleconomicallyimportantorganismsarebeingstudiedthatrequiretheefficientandinexpensiveautomatedanalysistoolsthatmulti-agentapproachescanprovide.InthispaperwegiveaprogressreportontheuseoftheDECAFmulti-agenttoolkittobuildreusableinformationgatheringsys-temsforbioinformatics.Wewillbrieflysummarizewhybioinformaticsisaclas-sicapplicationforinformationgathering,howDECAFsupportsit,andrecentextensionsunderwaytosupportnewanalysispathsforgenomicinformation.

1Introduction

Massiveamountsofrawdataarecurrentlybeinggeneratedbybiologistswhilesequenc-ingorganisms.Mostofthisrawdatamustbeanalyzedthroughthepiecemealapplica-tionofvariouscomputerprogramsandhand-searchesofvariouspublicwebdatabases.Typicallyboththerawdataandanyvaluablederivedknowledgewillremaingenerallyunavailableexceptinpublishednaturallanguagetextssuchasjournalarticles.How-ever,itisimportanttonotethatatremendousamountofgeneticmaterialissimilarfromorganismtoorganism,evenwhentheyareasoutwardlydifferentasayeast,fruitfly,mouse,orhumanbeing.Thismeansthatifabiologiststudyingtheyeastcanfig-ureoutwhatacertaingenedoes—itsfunction—thatotherbiologistscanatleastguessthatsimilargenesinotherorganismsplaysimilarroles.Thushugedatabasesarebeingpopulatedwithsequencedataandfunctionalannotations[2].Allnewsequencesareroutinelycomparedtoknownsequencesforcluesastotheirfunctions.

Alargeamountofworkinbioinformaticsoverthepasttenyearshasgoneintodevelopingalgorithms(patternmatching,statistical,and/orheuristic/knowledge-based)tosupporttheworkofhypothesizinggenefunction.Manyoftheseareavailabletobiologistsinvariousimplementations,andnowmanyareavailableovertheweb.Meta-sitescombinemanypublishedalgorithms,andsitesspecializeininformationaboutparticulartopicssuchasproteinmotifs.

Fromacomputerscienceperspective,severalproblemshavearisen,aswehavedescribedelsewhere[6].Tosummarize,whatwehaveisalargesetofheterogeneousanddynamicallychangingdatabases,allofwhichhaveinformationtobringtobearonthebiologicalproblemofdetermininggenomicfunction.Wehavebiologistsproducingthousandsofpossiblegenes,forwhichfunctionsmustbehypothesized.Forthecaseofallbutthelargestandwell-fundedsequencingprojects,thismustbedonebyhandbyasingleresearcherandtheirstudents.

Multi-agentinformationgatheringsystemshavealottocontributetotheseefforts.Severalfeaturesmakeamulti-agentapproachtothisproblemparticularlyattractive:informationisavailablefrommanydistinctlocations;informationcontentishetero-geneous;informationcontentisconstantlychanging;muchoftheannotationworkforeachgenecanbedoneindependently;biologistswishtobothmaketheirfindingswidelyavailable,yetretaincontroloverthedata;newtypesofanalysisandsourcesofdataareappearingconstantly.

WehaveusedDECAF,amulti-agentsystemtoolkitbasedonRETSINA[21,10,7]:andTAEMS[9,23],toconstructaprototypemulti-agentsystemforautomatedan-notationanddatabasestorageofsequencingdataforherpesviruses[6].Theresultingsystemeliminatestediousandalwaysout-of-datehandanalyses,makesthedataandannotationsavailableforotherresearchers(oragentsystems),andprovidesalevelofqueryprocessingbeyondevensomehigh-profilewebsites.

Sincethatinitialsystem,wehaveusedthedistributed,opennatureofourmulti-agentsolutiontoexpandthesysteminseveralwaysthatwillmakeitusefulforbiolo-gistsstudyingmoreorganisms,andindifferentways.Thispaperwillbrieflydescribeourapproachtoinformationgathering,basedonourworkonRETSINA;theDECAFtoolkit;ourinitialannotationsystem;andournewextensionsforfunctionalannotation,ESTprocessing,andmetabolicpathwayreasoning.

2DECAF

DECAF(Distributed,Environment-CenteredAgentFramework)isaJava-basedtoolkitforcreatingmulti-agentsystems[13].Inparticular,severaltoolshavebeendevelopedspecificallyforprototypinginformationgatheringsystems.Also,theinternalarchitec-tureofeachDECAFagenthasbeendesignedmuchlikeanoperatingsystem—asasetofservicesforthe“intelligent”(resource-efficient,adaptively-scheduled,softreal-time,objective-persistent)executionofagentactions.DECAFconsistsofasetofwelldefinedcontrolmodules(initialization,dispatching,planning,scheduling,andexecu-tion,eachinaseparate,concurrentthread)thatworkinconcerttocontrolanagent’slifecycle.Thereisonecoretaskstructurerepresentationthatissharedbetweenallofthecontrolmodules.Thishasmeantthatevennon-reusabledomain-dependentagentscanbedevelopedmorequicklythanbytheAPIapproachwheretheprogrammerhasto,ineffect,createandorchestratetheagent’sarchitectureaswellasitsdomain-orientedagentactions.ThissectionwillfirstdiscusstheinternalarchitectureofagenericDE-CAFagent,andthendiscussthetools(suchasmiddleagents,systemdebuggingaids,andtheinformationextractionagentshell)wehavebuilttoimplementmulti-agentin-formationgatheringsystems.TheoverallinternalarchitectureofDECAFisshownin

Figure1.Thesemodulesrunconcurrently,eachintheirownthread.DetailsoftheDECAFimplementationcanbefoundelsewhere[13].

Plan FileIncoming KQML messagesDECAF Task and Control StructuresIncoming Message QueueObjectivesQueueTaskQueueAgendaQueueAgent InitializationDispatcherPlannerSchedulerExecutorTask TemplatesHashtablePendingAction QueueAction Results QueueDomain Facts and BeliefsOutgoingKQML MessagesAction ModulesFig.1.DECAFArchitectureOverview

2.1DECAFSupportforInfoGathering

DECAFprovidescoreinternalarchitecturalsupportforsecondaryuserutility.ThusDECAFplanscanincludealternatives,andthesealternativescanbechosendynami-callyatruntimedependingonuserconstraintsonanswertimelinessorotherresourceconstraints.DECAFalsosupportsbuildinginformationgatheringsystemsbyprovidingusefulmiddleagentsandashellforquicklybuildinginformationextractionagentsforwrappingwebsites.TheAgentNameServer(ANS)(“whitepages”)isanessentialcomponentforagentcommunication.ItworksinafashionsimilartoDNS(DomainNameService)byresolvingagentnamestohostandportaddresses.TheMatchmakerservesasa“yellowpages”toassistagentsinfindingservicesneededfortaskcom-pletion.TheBrokeragentactsasakindof“middlemanager”toassistanagentwithcollectionsofservices.Thebrokercannowprovidealargerservicethananysingleprovidercan,andoftenmanagealargegroupofagentsmoreeffectively[8].AProxyagentallowswebpageJavaappletstocommunicatewithDECAFagentsthatarenotlocatedonthesameserverastheapplet.TheAgentManagementAgent(AMA)al-lowsMASdesignersalookattheentirerunningsetofagentsspreadoutacrosstheInternetthatshareasingleagentnameserver.Thisallowsdesignerstoquerythestatusofindividualagentsandwatchorrecordmessagepassingtraffic.

InformationExtractionAgentShellThemainfunctionsofaninformationextractionagent(IEA)are[7]:Fulfillingrequestsfromexternalsourcesinresponsetoaoneshot

query(e.g.“WhatisthepriceofIBM?”).Monitoringexternalsourcesforperiodicinformation(e.g.“GivemethepriceofIBMevery30minutes.”).Monitoringsourcesforpatterns,calledinformationmonitoringrequests(e.g.“NotifymeifthepriceofIBMgoesbelow$50.”).”Thesefunctionscanbewritteninageneralwaysothatthecodecanbesharedforagentsinanydomain.

SinceourIEAoperatesontheWeb,theinformationgatheredisfromexternalinfor-mationsources.TheagentusesasetofwrappersandthewrapperinductionalgorithmSTALKER[18],toextractrelevantinformationfromthewebpagesafterbeingshownseveralmarked-upexamples.Whentheinformationisgathereditisstoredinthelo-calIEA“infobase”usingJavawrappersonaPARKA[15]knowledgebase.ThismakesnewIEA’sfairlyeasytocreate,andforcesthedifficultpartsofthisproblembackontoKBontologycreation,ratherthantheproductionoftoolstowrapwebpagesanddynamicallyanswerqueries.Currently,therearesomeproposalsforXML-basedpageannotationswhich,ifadopted,willmakesitewrappingeasiersyntactically(butstill,doesnotsolvetheontologyproblem—butseeprojectssuchasOIL).

3ADECAFMulti-AgentSystemforGenomicAnalysis

Thesetoolscanbeputtousetocreateaprototypemulti-agentsystemforvarioustypesofgenomicanalysis.Intheprototype,wehavechosentosimplifythequerysubsystembymaterializingallannotationslocally,thusremovingtheneedforsophisticatedqueryplanning(e.g.[16]).Thisisareasonablesimplificationsincemostofourworkiswithvirusesthathavefairlysmallgenomes(around100genesforaherpesvirusandaround30herpesviruses)orwithlargerorganisms(e.g.chickens)forwhichweareconstructingaconsensusdatabaseexplicitly.

Figure2showsanoverviewofthesystemasfouroverlappingmulti-agentorgani-zations.Thefirst,BasicSequenceAnnotation,ischargedwithintegratingremotegenesequenceannotationsfromvarioussourceswiththegenesequencesattheLocalKnowl-edgeBaseManagementAgent(LKBMA).Thesecond,Query,allowscomplexqueriesontheLKBMAsviaawebinterface.Thethird,FunctionalAnnotationisresponsi-bleforcollectinginformationneededtomakeaninformedguessastothefunctionofagene,specificallyusingthethree-partGeneOntology[22].Thefourthorganization,ESTProcessingenablestheanalysisofexpressedsequencetags(ESTs)toproducegenesequencesthatcanbeannotatedbytheotherorganizations.

Animportantfeaturetonoteisthatwearefocusingonannotationandanalysisser-vicesthatarenotorganismspecific.Inthisway,theresultingsystemcanbeusedtobuildandqueryknowledgebasesfromseveraldifferentorganisms.Theoriginalsubsys-tems(basicannotationandthesimplequerysystem)werebuilttoannotatethenewlysequencedHerpesvirusofTurkey(thebird),andthentocompareittotheotherknownsequencedherpesviruses.WorkisjustbeginningtobuildanewknowledgebasefromchickenESTs,andtoextendthedepthoftheherpesvirusKBforEpstein-BarrVirus(humanherpesvirus4)whichhasclinicalsignificanceforpediatricorgantransplantpatients.

Sequence AdditionApplet Functional AnnotationApplet User QueryApplet EST Entry[Chromatograph/FASTA] ProxyAgent ProxyAgent Proxy OntologyAgentAgent Ontology ReasoningAgent ProxyAgent Query ProcessingAgent Sequence SourceProcessing Agent AnnotationAgentBasicSequenceAnnotation ProDomainIEA SwissProt/ProSiteIEA PSortIEAFunctionalAnnotation GenBankInfo Extraction AgentQuery SequenceLKBMAChromatograph ConsensusProcessingSequence SNP-FinderESTProcessing ESTLKBMAFlybaseIEAMouse Genome DBIEASGD (yeast)IEAFig.2.OverviewofDECAFMulti-AgentSystemforGenomicAnalysis

3.1BasicSequenceAnnotationandQueryProcessing

Figure3showstheinteractiondetailsforthebasicsequenceannotationandquerysub-systems.WewilldescribetheagentsbytheirRETSINAclassification.

Sequence AdditionApplet User QueryAppletInterface AgentsDomain-IndependentTask Agents AnnotationAgent ProxyAgent Query ProcessingAgent MatchmakerAgent Agent Name ServerAgent Sequence SourceProcessing AgentTask Agents Local Knowledgebase Local Knowledgebase Local Knowledgebase Local KnowledgebaseManagement AgentsManagement AgentsManagement AgentsManagement Agents GenBank ProDomainInfo Extraction AgentInfo Extraction Agent SwissProt/ProSiteInfo Extraction Agent PSortInfo Extraction AgentInformationExtractionAgents

Fig.3.BasicAnnotationandQueryAgentOrganizations

InformationExtractionAgents.Currently4agentsbasedontheIEAshellwrappublicwebsites.TheGenbankwrapperprimarilysupplies“BLAST”services:giventhesequenceofaherpesvirusgene,whatarethemostsimilargenesknownintheworld(called“homologs”)?Theanswerherecangivethebiologistaclueastothepossiblefunctionofagene,andforanygenethatthebiologistdoesnotknowthefunctionof,a

changeintheanswertothisquerymightbesignificant.TheSwissProtwrapperprimaryprovidesproteinmotifpatternsearches.Ifweviewaproteinasaone-dimensionalstringofaminoacids,thenamotifisaregularexpressionmatchingpartofthestringthatmayindicateaparticularkindoffunctionfortheprotein(i.e.aprenylationmotifindicatesaplacewheretheproteinmaybemodifiedaftertranslationbytheadditionofanothergroupofmolecules)ThePSortwrapperaccessesaknowledge-basedsystemforesti-matingthelikelysub-cellularlocationthatasequence’sencodedproteinwillbeused.TheProDomainwrapperallowsaccesstootherinformationabouttheencodedprotein;aproteindomainissimilartoamotifbutlarger.Aswemovetoneworganisms,manymoreresourcescouldbewrappedatthislevel(almostallbiologistshavea“favorite”here).

Thelocalknowledgebasemanagementagent(KBMA)isaslightlydifferentmem-berofthisclassbecauseunlikemostIEAsitactuallystoresdataviaagentmessagesratherthanonlyqueryingexternaldatasources.Itisherethattheannotationsofthegeneticinformationarematerialized,andfromwhichmostqueriesareanswered.EachKBMAisupdatedwithrawsequencingdataindirectlyfromausersequenceadditioninterfacethatisthenautomaticallyannotatedunderthecontrolofanannotationtaskagent.KBMAscanbe“owned”bydifferentparties,andqueriedseparatelyortogether.Inthisway,researcherswithlimitedcomputerknowledgecancreatesharableannotatedsequencedatabasesusingtheexistingwrappersandotheranalysistoolsastheyarede-veloped,withouthavingtonecessarilydownloadandinstallthemthemselves.UsingaPARKA-DBknowledgebaseallowsefficient,modernrelationaldatastorageonthebackendandqueryaswellaslimitedKBinferencing[15].

TaskAgents.Therearetwodomaintaskagents;therestaregenericmiddleagentsdescribedearlier.TheAnnotationAgentdirectsexactlywhatinformationshouldbeannotatedforeachsequence.Itisresponsibleforstoringtherawsequencedata,mak-ingqueriestothevariouswrappedwebsites,storingthoseannotations,andalsoin-dicatingtheprovenanceofthedata(meta-informationregardingwhereanannotationcamefrom).TheSequenceSourceProcessingAgenttakesalmostrawsequencedatainASN.1formatasoutputbytypicalsequenceestimationprogramsorstoredinGenbank.Themainfunctionofthisagentistotestthisinputforinternalconsistency.

InterfaceAgents.Therearetwointerfaceappletsthatcommunicateviatheproxyagentwithotheragentsinthesystem.Oneisorientedtowardsaddingnewsequencestoalocalknowledgebase(securedbyapassword)andtheotherallowsanyonetoquerythecompleteannotatedKB(orevenmultipleKBs).Theinterfacehardlyscratchesthesurfaceofthequeriesthatareactuallypossible,butabigproblemisthatmostbiologistsarenotcomfortablewithcomplexquerylanguages.Indeed,thesimpleinterfacethatallowssimpleconjunctiveanddisjunctivequeriesoverdynamicmenusofannotations(constructedbytheappletatruntimefromtheactuallocalKB)isquiteadvancedascomparedtomostoftheexistingpublicsitesthatallowtextualkeywordsearchesonly.3.2FunctionalAnnotation

Thissubsystemisresponsibleforassistingthebiologistinthedifficultproblemofmak-ingfunctionalannotationsofeachgene.Unfortunately,manyofthemillionsofgenes

sequencedsofarhavefairlyhaphazard(fromacomputerscientist’sperspective)func-tionalannotation:simplyfreenaturallanguagedescriptions.Recently,afairlylargegrouprepresentingatleastsomeoftheprimaryorganismdatabaseshavecreatedacon-sortiumdedicatedtocreatingageneontologyforannotatinggenefunctioninthreebasicareas:thebiologicalprocessinwhichageneplaysapart,themolecularfunctionofthegeneproduct,andthecellularlocalization[22].Thesubsystemdescribedheresupportstheuseofthisontologybybiologistsassequencesareaddedtothesystem,eventuallyleadingtoevenmorepowerfulanalysisoftheresultingKBs.

InformationExtractionAgents.BesidesthegenesequenceLKBMAandtheGen-BankIEA,wearewrappingthreeneworganism-specificgenesequenceDBs,forDro-sophila(fruitfly),Mus(Mouse),andSaccrynomaecescervasie(yeast).EachoftheseorganismsispartoftheGeneOntology(GO)consortium,andhasspentconsiderabletimeinmakingtheproperfunctionalannotation.Eachoftheseagents,then,findsGO-annotated,closehomologsoftheunannotatedgeneandproposestheannotationofthehomologsfortheannotationofthenewgene.

TaskAgents.Therearetwonewtaskagents,oneisadomain-independentontol-ogyagentusingtheFIPAontologyagentspecificationasastartingpoint.TheontologyagentcontainsboththeGOontologiesandseveralmappingsfromothersymbologies(i.e.SwissProtterms)toGOterms.Infact,theMouseIEAusestheOntologyagenttomapsomenon-GOtermsforcertainrecordstoGOterms.Althoughnotindicatedonthefigure,someoftheotherorganismDBIEAagentsmustmapfromGOontologydescriptivestringstotheactualuniqueGOID.Theotherserviceprovidedbytheon-tologyagent(andnotexplicitlymentionedintheexperimentalFIPAOntologyAgentspecification)isfortheontologyreasoningagenttoaskhowtotermsarerelatedinanontology.TheOntologyReasoningAgentusesthisquerytobuildaminimumspanningtree(ineachofthethreeGOontologies)betweenallthetermsreturnedinalltheho-mologiesfromalloftheGOorganismdatabases.Thisinformationcanthenbeusedtoproposealikelyannotation,andtodisplayalloftheinformationgraphicallyforthebiologistviatheinterfaceagent.

InterfaceAgents.Thefunctionalinterfaceagent/appletconsistsoftwocolumnarpanes:ontheleft,thetoppanedisplaysthegenebeingannotated,andthebottomdis-playsthegeneralhomologiesfromGenBankwiththeirnaturallanguageannotations.Ontheright,threepanesdisplaythesubtreesfromthethreeGOontologies(biologicalprocess,molecularfunction,cellularlocation)markedincolorwiththehomologsfromthethreeorganismdatabases.3.3ESTProcessing

Onewaytobroadentheapplicabilityofthesystemistoacceptmorekindsofbasicin-putdatatotheannotationprocess.Forexample,wecouldbroadenthereachofthesys-tembystartingwithESTs(ExpressedSequenceTags)insteadofcompletesequences.Agentscouldwrapthestandardsoftwareforcreatingsequencesfromthisdata,atwhichpointtheexistingsystemcanbeused.TheuseofESTsispartofarelativelyinexpensiveapproachtosequencingwhereinsteadofdirectlysequencinggenomicDNA,weinsteaduseamethodthatproducesmanyshortsequencesthatpartiallyoverlap.Byfindingtheoverlapsintheshortsequences,wecaneventuallyreconstructtheentiresequenceof

eachexpressedgene.Essentially,thisisa“shotgun”approachthatreliesonstatisticsandthesheernumberofexperimentstoeventuallyproducecompletesequences.

Asasideeffectofthisprocessing,informationisproducedthatcanbeusedtofindSingleNucleotidePolymorphisms(SNPs).SNPsindicateachangeofonenucleotide(A,T,C,G)inasinglegenebetweendifferentindividuals(often,conservedacrossstrainsorsubspecies).Thesemarkersareveryimportantforidentificationeveniftheydonothavefunctionaleffects.

InformationExtractionAgents.TheprocessofconsensussequencebuildingandSNPidentificationdoesnotrequireanyexternalinformation,sotheonlyIEAsaretheLKBMAs.Upuntilnow,therehasonlybeenoneLKBMA,responsibleforthegenesequencesandannotations.ESTprocessingaddsasecondLKBMAresponsibleforstoringtheESTSthemselvesandtheassociatedinformationdiscussedbelow.Primarily,thisisbecause(especiallyearlyoninasequencingproject)therewillbethousandsofESTsthatdonotoverlaptoformcontiguoussequences,andthatESTsmaybeaddedandprocessedalmostdaily.

TaskAgents.Therearethreenewdomain-leveltaskagents.Thefirstdealswithpro-cessingchromatographs.Essentiallythechromatographisasetofsignalsthatindicatetherelativestrengthsofthewavelengthsassociatedwitheachluminousnucleotidetag.SeveralstandardUnixanalysisprogramsexisttoprocessthisdata,essentially“calling”thebestnucleotideforeachposition.Thechromatographprocessingagentwrapsthreeanalysisprograms:Phred,which“calls”thechromatographandalsoseparatelypro-ducesanuncertaintyscoreforeachnucleotideinthesequence;phd2fastawhichcon-vertsthisoutputintoastandard(FASTA)format;andx-matchwhichremovesapartofthesequencethatisabyproductofthesequencingmethod,andnotactuallypartoftheorganismsequence.Theconsensussequenceassemblyagentusestwomoreprograms(Phrapandconsed)onalltheESTsfoundsofartoproduceasetofcandidategenesbyappropriatelysplicingtogethertheshortESTsequences.Thisproducesasetofcandi-dategenesthatcanthenbeaddedtothegenesequenceLKBMAandfromwhichthevariousannotationprocessesdescribedearliermaycommence.Finally,aSNP-finderagentoperatesthePolyBayesprogramwhichusestheESTandSequenceKBsandtheuncertaintyscoresproducedbyPhredtonominatepossiblesinglenucleotidepolymor-phisms.

InterfaceAgents.Thereisonlyonesimpleinterfaceagent,toallowparticipantstoenterdatainthesystem.Preferably,thisischromatographdatafromthesequencers,becausetheoriginalchromatographallowsPhredtocalculatetheuncertaintyassociatedwitheachnucleotidecall.However,FASTA-format(simple“ATCG...”namedstrings)ESTscalledfromtheoriginalchromatographscanbeaccommodated.Thesecanbeusedtobuildconsensussequences,butnotforfindingSNPs.

4GeneExpressionProcessing

Anewkindofgenomicdataisnowbeingproduced,thatmayswampeventheamountofsequencingdata.Thisisso-calledgeneexpressiondata,andindicatesquantitativelyhowmuchageneproductisexpressedinsomelocation,undersomeconditions,atsomepointintime.Wearedevelopinganmulti-agentsystemthatusesavailableon-

linegenomicandmetabolicpathwayknowledgetoextendgeneexpressionanalysis.Byincorporatingknownrelationshipsbetweengenes,knowledge-basedanalysisofex-perimentalexpressiondataissignificantlyimprovedoverpurelystatisticalmethods.Althoughthissystemhasnotyetbeenintegratedintotheexistingagentcommunity,eventuallyrelevantgenomicinformationwillbemadeavailabletothesystemthroughtheexistingGenBankandSwissProtIEAs.Metabolicpathwaysofinteresttotheinves-tigatorareidentifiedthroughaKEGG(KyotoEncyclopediaofGenesandGenomes)databasewrapper.AnalysisofthegeneexpressiondataisperformedthroughanagentthatexecutesSAS,astatisticalpackagethatincludesclusteringandPCAanalysismeth-ods.Resultsaretobepresentedtotheuserthroughwebpageshyperlinkedtorelevantdatabaseentries.

Currenttechniquesforgeneexpressionanalysishaveprimarilyfocusedontheuseofclusteringalgorithms,whichgroupgenesofsimilarexpressionpatternstogether[12].However,experimentalgeneexpressiondatacanbeverynoisyandthecomplicatedpathwayswithinorganismscangeneratecoincidentalexpressionpatterns,whichcansignificantlylimitthebenefitsofstandardclusteranalysis.Inordertoseparategeneco-regulationpatternsfromco-expression,thegeneexpressionprocessingorganizationwasdevelopedtogatheravailablepathway-levelinformationinordertopresorttheex-pressiondataintofunctionalcategories.Thus,clusteringofthereduceddatasetismuchmorelikelytofindgenesthatareactuallyregulatedtogether.Thesystemalsopromisestobeusefulindiscoveringregulatoryconnectionsbetweendifferentpathways.OneadvantageofusingtheKEGGdatabaseisthatitsgene/enzymeentriesareorganizedbytheEC(EnzymeCommission)ontology,andsoareeasilymappedtogenenamesspecifictotheorganismofinterest.

5RelatedWork

Therehasbeensignificantworkongeneralalgorithmsforqueryplanning,selectivematerialization,andtheoptimizationofthesefromtheAIperspective,forexampleTSIMMIS[4],Infosleuth[19],SIMS[1],etc.,andofcourseonapplyingagentsasthewaytoembodythesealgorithms[16,21,10].

InBiology,comparedtotheworkbeingdonetocreatetherawdata,alltheworkonhowtoorganizeandretrieveitisrelativelysmall.Mostoftheworkincomputersciencedirectedtobiologicaldatahasbeenintheareaofheterogeneousdatabases,focusingonthesemi-structurednatureofmuchofthedatathatmakesitverydifficulttostoreuse-fullyincommercialrelationaldatabases[5].Someworkhasbeguninapplyingtheworkonwrappersandmediatorstobiologicaldatabases,forexampleTAMBIS[20].Thesesystemsdifferfromoursinthattheyarepureimplementationsofwrapper/mediatortechnologythatarecentralized,donotallowfordynamicchangesinsources,supportpersistentqueries,orconsidersecondaryuserutilityintheformoftimeorotherre-sourcelimitations.

Agenttechnologyhasbeenmakingsomeinroadsinthearea.Theword“agent”withthepopularconnotationofasinglecomputerprogramtodoauser’sbiddingisfoundinthepromotionalmaterialforDoubletwist(www.doubletwist.com).Here,an“agent”standsforapersistentquery(e.g.“tellmeifanewhomologisfoundinyour

databaseforthefollowingsequence”).Thereisnocollaborationorcommunicationbetweenagents.

Weknowofafewtrulymulti-agentprojectsinthisdomain.First,InfoSleuthhasbeenusedtoannotatelivestockgeneticsamples[11].Theflowofinformationisverysimilartooursystem.However,thesystemisnotsetupfornoticingchangesinthepublicdatabases,forintegratingnewdatasourcesonthefly,orforconsiderationofsecondaryuserutility.Second,theEDITtoTrEMBLsystem[17]isanotherautomatedannotationsystem,basedonthewrapperandmediatorconcept,forannotatingproteinsawaitingmanualannotationandentrytoSwissProt.Dispatcheragentscontroltheap-plicationofpotentiallycomplexsequencesofwrappers.Mostimportantly,thissystemsupportsthedetectionandpossiblerevisionofinconsistenciesrevealedbetweendiffer-entannotations.Third,theGeneWeaverproject[3]isanothertruemulti-agentsystemforannotationofgenomes.GeneWeaverhasasaprimarydesigncriteriontheobserva-tionthatthesourcedataisalwayschanging,andsoannotationsneedtobeconstantlyupdated.Theyalsoexpresstheideathatnewsourcesoranalysistoolsshouldbeeasytointegrateintothesystem,whichplaystotheopensystemsrequirement,althoughtheydonotdescribedetails.Theprimarydifferencesarethewayinwhichanopensystemisachieved(itisnotclearthattheyuseagent-levelmatchmaking,butratherpossiblyCORBAspecifications)andthatGeneWeaverisnotbasedonasharedarchitecturethatsupportsreasoningaboutsecondaryuserutility.IncomparisontotheDECAFimple-mentation,GeneWeaverusesCORBA/RMIratherthanTCP/IPcommunication,andasimplifiedKQML-likelanguagecalledBAL.

6Discussion

Thesystemdescribedhereisoperationalandnormallyavailableonthewebat

http://udgenome.ags.udel.edu/herpes/.Thisisaworkingprototype,andsotheinterfaceisstronglyorientedtobiologistsonly.Ingeneral,computationalsupportfortheprocessesthatbiologistsuseinanalyzingdataisprimitive(Perlscripts)ornon-existent.Inlessthan10min,wewereabletoannotatetheHVT-1sequence,aswellasstoreitinaqueryableandweb-publishableform.Thisimpressedthebiologistsweworkwith,comparedtomanualannotationandflatASCIIfiles.Furthermore,wehaverecentlyaddedapproximately15otherpubliclyavailableherpesvirussequences(e.g.severalstrainsofHumanherpesvirus,Africanswinefevervirus,etc.).Theresultingknowledgebasealmostimmediatelyresultedinqueriesbyourlocalbiologiststhatindi-catedpossibleinterestingrelationshipsthatmayresultinfuturebiologicalwork.Thissummerwewillbegintestingwithviralbiologistsfromotheruniversities.

Otherthingsaboutthesystemwhichhaveexcitedourbiologistco-workersaretherelativeeasebywhichwecanaddnewtypesofannotationoranalysisinformation,andthefactthatthesystemcanbeusedtobuildsimilarsystemsforotherorganisms,suchasthechicken.Forexample,theuseofopensystemconceptssuchasamatchmakerallowtheannotationagenttoaccessandusenewannotationservicesthatwerenotavailablewhenitwasinitiallywritten.Secondaryuserutilitywillbecomeusefulforthebiologistwhenfacedwithmakingasimpleofficequeryvs.checkingresultsbeforepublication.

TheunderlyingDECAFsystemhasbeenevaluatedinseveralways,especiallywithrespecttotheuseofparallelcomputationalresourcesbyasingleagent(alloftheDE-CAFcomponentsandalloftheexecutableactionsareruninparallelthreads),andtheefficacyoftheDRUschedulerwhichefficientlysolvesarestrictedsubsetofthedesign-to-criteriaschedulingproblem[14].Runningthegeneannotationsystemasatrulymulti-agentsystemresultsintruespeedups,althoughmostofthetimeiscurrentlyspentinremotedatabaseaccessParallelhardwareforeachagentwillbeusefulforsomeofthemorelocallycomputationallyintensivetasksinvolvingESTprocessing.

7ConclusionsandFutureWork

Inthispaperwehavediscussedtheveryrealproblemofmakingsomeuseofthetremen-dousamountsofgeneticsequenceinformationthatarebeingproduced.Whilethereismuchinformationpubliclyavailableovertheweb,accessingsuchinformationisdiffer-entforeachsourceandtheresultscanonlybeusedbyasingleresearcher.Furthermore,thecontentsoftheseprimarysourcesarechangingallthetime,andnewsourcesandtechniquesforanalysisareconstantlybeingdeveloped.

Wecastthissequenceannotationproblemasageneralinformationgatheringprob-lem,andproposedtheuseofmulti-agentsystemsforimplementation.Beyondthebasicheterogeneousdatabaseproblemthatthisproblemrepresents,anMASsolutiongivesusmechanismsfordealingwithchangingdata,thedynamicappearanceofnewsources,mindingsecondaryutilitycharacteristicsforusers,andofcoursetheobviousdistributedprocessingachievementsofparalleldevelopment,concurrentprocessing,andthepos-sibilityforhandlingcertainsecurityorotherorganizationalconcerns(wherepartoftheagentorganizationcanmirrorthehumanorganization).

Wecurrentlyareofferingthesystempubliclyontheweb,withtheknownher-pesvirussequences.AsecondsystembasedonchickenESTsshouldbeavailablebytheendof2001.Weintendtobroadentheannotationcoverageandaddmorecom-plexanalyses.Anexamplewouldbetheestimationofthephysicallocationofthegeneaswellasitsfunction.BecausebiologistshavelongrecordedcertainQTLs(Quanti-tativeTraitLoci)thatindicatethatacertainphysicalregionisresponsibleforatrait(suchaschickenswithresistancetoacertaindisease),beingabletoseewhatgenesarephysicallylocatedintheQTLregionisastrongindicatorastotheirhigh-levelgeneticfunction.

Ingeneral,wehavenotyetdesignedaninterfacethatallowsbiologiststotakefulladvantageofthematerializeddata—theyareuncomfortablewithcomplexquerylan-guages.Webelievethatitmaybepossibletobuildagraphicalinterfacetoallowabiologist,aftersometraining,tocreateacommonlyneededanalysisqueryandtothensavethisforuseinthefuturebythatscientist,orotherssharingtheagentnamespace.Finally,thenextmajorsubsystemwillbeagentstolinkandanalyzegeneexpres-siondata(whichwillinturninteroperatewiththemetabolicpathwayanalysissystemsdescribedabove).Thisdataneedstobelinkedwithsequenceandfunctiondata,toal-lowmorepowerfulanalysis.Forexample,linkedtoQTLdata,thisallowsustoaskquestionssuchas“whatchemicalsmightpreventclubrootdiseaseincabbage?”.

References

1.Y.ArensandC.A.Knoblock.Intelligentcaching:Selecting,representing,andreusingdatainaninformationserver.InProc.3rdIntl.Conf.onInfo.andKnow.Mgmt.,1994.2.D.A.Bensonandetal.Genbank.NucleicAcidsRes.,28:15–18,2000.

3.K.Bryson,M.Luck,M.Joy,andD.T.Jones.Applyingagentstobioinformaticsinge-neweaver.InProc.4thInt.Wksp.Collab.Info.Agents,2000.

4.S.Chawathe,H.Garcia-Molina,J.Hammer,K.Ireland,Y.Papakonstantinou,J.Ullman,andJ.Widom.TheTSIMMISproject:integrationofheterogeneousinformationsources.InProc.10thMtg.Info.Proc.Soc.Japan,Dec.1994.

5.S.B.Davidsonandetal.Biokleisli:adigitallibraryforbiomedicalresearchers.Intnl.J.onDigitalLibraries,1(1):36–53,1997.

6.K.Decker,X.Zheng,andC.Schmidt.Amulti-agentsystemforautomatedgenomicanno-tation.InProceedingsofthe5thIntl.Conf.onAutonomousAgents,Montreal,2001.

7.K.S.Decker,A.Pannu,K.Sycara,andM.Williamson.Designingbehaviorsforinformationagents.InProc.1stIntl.Conf.onAutonomousAgents,pages404–413,1997.

8.K.S.Decker,K.Sycara,andM.Williamson.Middle-agentsfortheinternet.InProc.15thIJCAI,pages578–583,1997.

9.K.S.DeckerandV.R.Lesser.Quantitativemodelingofcomplexcomputationaltaskenvi-ronments.InProc.11thAAAI,pages217–224,1993.

10.K.S.DeckerandK.Sycara.Intelligentadaptiveinformationagents.JournalofIntelligent

InformationSystems,9(3):239–260,1997.

11.L.Deschaine,R.Brice,andM.Nodine.Useofinfosleuthtocoordinateinformationacqui-sition,tracking,andanalysisincomplexapplications.MCC-INSL–008-00,2000.

12.M.B.Eisen,P.T.Spellman,P.O.Brown,andD.Botstein.Clusteranalysisanddisplayof

genome-wideexpressionpatterns.Proc.Nat.Acad.Sci.

13.J.GrahamandK.S.Decker.Towardsadistributed,environment-centeredagentframework.

InIntelligentAgentsVI,LNAI-1757,pages290–304.SpringerVerlag,2000.

14.J.Graham.Real-timeSchedulinginMulti-agentSystems.PhDthesis,Universityof

Delaware,2001.

15.J.HendlerandM.TaylorK.Stoffel.Advancesinhighperformanceknowledgerepresenta-tion.TechnicalReportCS-TR-3672,UniversityofMarylandInstituteforAdvancedCom-puterStudies,1996.Alsocross-referencedasUMIACS-TR-96-56.

16.C.A.Knoblock,Y.Arens,andC.Hsu.Cooperatingagentsforinformationretrieval.InProc.

2ndIntl.Conf.onCooperativeInformationSystems.Univ.ofTorontoPress,1994.17.S.M¨ollerandM.Schroeder.Consistentintegrationofnon-reliableheterogeneousinfor-mationappliedtotheannotationoftransmembraneproteins.JournalofComputingand

Chemistry,toappear,2001.

18.I.Muslea,S.Minton,andC.Knoblock.Stalker:Learningexpectationrulesforsimistruc-turedweb-basedinformationsources.InPapersfromthe1998WorkshoponAIandInforma-tionGathering,1998.alsoTechnicalReportws-98-14,UniversityofSouthernCalifornia.19.M.NodineandA.Unruh.Facilitatingopencommunicationinagentsystems:theinfosleuth

infrastructure.InIntelligentAgentsIV,pages281–295.Springer-Verlag,1998.

20.R.Stevensandetal.Tambis:Transparentaccesstomultiplebioinformaticsinformation

sources.Bioinformatics,16(2):184–185,2000.

21.K.Sycara,K.S.Decker,A.Pannu,M.Williamson,andD.Zeng.Distributedintelligent

agents.IEEEExpert,11(6):36–46,December1996.

22.TheGeneOntologyConsortium.Geneontolgy:toolfortheunificationofbiology.Nature

Genetics,25(1):25–29,May2000.

23.T.Wagner,A.Garvey,andV.Lesser.Complexgoalcriteriaanditsapplicationindesign-to-criteriascheduling.InProc.14thAAAI,1997.

因篇幅问题不能全部显示,请点此查看更多更全内容