您的当前位置：首页正文

Extending a multi-agent system for genomic annotation

来源：好兔宠物网

ExtendingaMulti-AgentSystemforGenomic

Annotation

KeithDecker,SalimKhan,CarlSchmidt,andDennisMichaud

ComputerandInformationSciencesDepartmentUniversityofDelaware,Newark,DE19716

decker@cis.udel.edu

Abstract.Theexplosivegrowthingenomic(andsoon,expressionandproteomic)data,exempliﬁedbytheHumanGenomeProject,isafertiledomainfortheappli-cationofmulti-agentinformationgatheringtechnologies.Furthermore,hundredsofsmaller-proﬁle,yetstilleconomicallyimportantorganismsarebeingstudiedthatrequiretheefﬁcientandinexpensiveautomatedanalysistoolsthatmulti-agentapproachescanprovide.InthispaperwegiveaprogressreportontheuseoftheDECAFmulti-agenttoolkittobuildreusableinformationgatheringsys-temsforbioinformatics.Wewillbrieﬂysummarizewhybioinformaticsisaclas-sicapplicationforinformationgathering,howDECAFsupportsit,andrecentextensionsunderwaytosupportnewanalysispathsforgenomicinformation.

1Introduction

Massiveamountsofrawdataarecurrentlybeinggeneratedbybiologistswhilesequenc-ingorganisms.Mostofthisrawdatamustbeanalyzedthroughthepiecemealapplica-tionofvariouscomputerprogramsandhand-searchesofvariouspublicwebdatabases.Typicallyboththerawdataandanyvaluablederivedknowledgewillremaingenerallyunavailableexceptinpublishednaturallanguagetextssuchasjournalarticles.How-ever,itisimportanttonotethatatremendousamountofgeneticmaterialissimilarfromorganismtoorganism,evenwhentheyareasoutwardlydifferentasayeast,fruitﬂy,mouse,orhumanbeing.Thismeansthatifabiologiststudyingtheyeastcanﬁg-ureoutwhatacertaingenedoes—itsfunction—thatotherbiologistscanatleastguessthatsimilargenesinotherorganismsplaysimilarroles.Thushugedatabasesarebeingpopulatedwithsequencedataandfunctionalannotations[2].Allnewsequencesareroutinelycomparedtoknownsequencesforcluesastotheirfunctions.

Alargeamountofworkinbioinformaticsoverthepasttenyearshasgoneintodevelopingalgorithms(patternmatching,statistical,and/orheuristic/knowledge-based)tosupporttheworkofhypothesizinggenefunction.Manyoftheseareavailabletobiologistsinvariousimplementations,andnowmanyareavailableovertheweb.Meta-sitescombinemanypublishedalgorithms,andsitesspecializeininformationaboutparticulartopicssuchasproteinmotifs.

Fromacomputerscienceperspective,severalproblemshavearisen,aswehavedescribedelsewhere[6].Tosummarize,whatwehaveisalargesetofheterogeneousanddynamicallychangingdatabases,allofwhichhaveinformationtobringtobearonthebiologicalproblemofdetermininggenomicfunction.Wehavebiologistsproducingthousandsofpossiblegenes,forwhichfunctionsmustbehypothesized.Forthecaseofallbutthelargestandwell-fundedsequencingprojects,thismustbedonebyhandbyasingleresearcherandtheirstudents.

Multi-agentinformationgatheringsystemshavealottocontributetotheseefforts.Severalfeaturesmakeamulti-agentapproachtothisproblemparticularlyattractive:informationisavailablefrommanydistinctlocations;informationcontentishetero-geneous;informationcontentisconstantlychanging;muchoftheannotationworkforeachgenecanbedoneindependently;biologistswishtobothmaketheirﬁndingswidelyavailable,yetretaincontroloverthedata;newtypesofanalysisandsourcesofdataareappearingconstantly.

WehaveusedDECAF,amulti-agentsystemtoolkitbasedonRETSINA[21,10,7]:andTAEMS[9,23],toconstructaprototypemulti-agentsystemforautomatedan-notationanddatabasestorageofsequencingdataforherpesviruses[6].Theresultingsystemeliminatestediousandalwaysout-of-datehandanalyses,makesthedataandannotationsavailableforotherresearchers(oragentsystems),andprovidesalevelofqueryprocessingbeyondevensomehigh-proﬁlewebsites.

Sincethatinitialsystem,wehaveusedthedistributed,opennatureofourmulti-agentsolutiontoexpandthesysteminseveralwaysthatwillmakeitusefulforbiolo-gistsstudyingmoreorganisms,andindifferentways.Thispaperwillbrieﬂydescribeourapproachtoinformationgathering,basedonourworkonRETSINA;theDECAFtoolkit;ourinitialannotationsystem;andournewextensionsforfunctionalannotation,ESTprocessing,andmetabolicpathwayreasoning.

2DECAF

DECAF(Distributed,Environment-CenteredAgentFramework)isaJava-basedtoolkitforcreatingmulti-agentsystems[13].Inparticular,severaltoolshavebeendevelopedspeciﬁcallyforprototypinginformationgatheringsystems.Also,theinternalarchitec-tureofeachDECAFagenthasbeendesignedmuchlikeanoperatingsystem—asasetofservicesforthe“intelligent”(resource-efﬁcient,adaptively-scheduled,softreal-time,objective-persistent)executionofagentactions.DECAFconsistsofasetofwelldeﬁnedcontrolmodules(initialization,dispatching,planning,scheduling,andexecu-tion,eachinaseparate,concurrentthread)thatworkinconcerttocontrolanagent’slifecycle.Thereisonecoretaskstructurerepresentationthatissharedbetweenallofthecontrolmodules.Thishasmeantthatevennon-reusabledomain-dependentagentscanbedevelopedmorequicklythanbytheAPIapproachwheretheprogrammerhasto,ineffect,createandorchestratetheagent’sarchitectureaswellasitsdomain-orientedagentactions.ThissectionwillﬁrstdiscusstheinternalarchitectureofagenericDE-CAFagent,andthendiscussthetools(suchasmiddleagents,systemdebuggingaids,andtheinformationextractionagentshell)wehavebuilttoimplementmulti-agentin-formationgatheringsystems.TheoverallinternalarchitectureofDECAFisshownin

Figure1.Thesemodulesrunconcurrently,eachintheirownthread.DetailsoftheDECAFimplementationcanbefoundelsewhere[13].

Plan FileIncoming KQML messagesDECAF Task and Control StructuresIncoming Message QueueObjectivesQueueTaskQueueAgendaQueueAgent InitializationDispatcherPlannerSchedulerExecutorTask TemplatesHashtablePendingAction QueueAction Results QueueDomain Facts and BeliefsOutgoingKQML MessagesAction ModulesFig.1.DECAFArchitectureOverview

2.1DECAFSupportforInfoGathering

DECAFprovidescoreinternalarchitecturalsupportforsecondaryuserutility.ThusDECAFplanscanincludealternatives,andthesealternativescanbechosendynami-callyatruntimedependingonuserconstraintsonanswertimelinessorotherresourceconstraints.DECAFalsosupportsbuildinginformationgatheringsystemsbyprovidingusefulmiddleagentsandashellforquicklybuildinginformationextractionagentsforwrappingwebsites.TheAgentNameServer(ANS)(“whitepages”)isanessentialcomponentforagentcommunication.ItworksinafashionsimilartoDNS(DomainNameService)byresolvingagentnamestohostandportaddresses.TheMatchmakerservesasa“yellowpages”toassistagentsinﬁndingservicesneededfortaskcom-pletion.TheBrokeragentactsasakindof“middlemanager”toassistanagentwithcollectionsofservices.Thebrokercannowprovidealargerservicethananysingleprovidercan,andoftenmanagealargegroupofagentsmoreeffectively[8].AProxyagentallowswebpageJavaappletstocommunicatewithDECAFagentsthatarenotlocatedonthesameserverastheapplet.TheAgentManagementAgent(AMA)al-lowsMASdesignersalookattheentirerunningsetofagentsspreadoutacrosstheInternetthatshareasingleagentnameserver.Thisallowsdesignerstoquerythestatusofindividualagentsandwatchorrecordmessagepassingtrafﬁc.

InformationExtractionAgentShellThemainfunctionsofaninformationextractionagent(IEA)are[7]:Fulﬁllingrequestsfromexternalsourcesinresponsetoaoneshot

query(e.g.“WhatisthepriceofIBM?”).Monitoringexternalsourcesforperiodicinformation(e.g.“GivemethepriceofIBMevery30minutes.”).Monitoringsourcesforpatterns,calledinformationmonitoringrequests(e.g.“NotifymeifthepriceofIBMgoesbelow$50.”).”Thesefunctionscanbewritteninageneralwaysothatthecodecanbesharedforagentsinanydomain.

SinceourIEAoperatesontheWeb,theinformationgatheredisfromexternalinfor-mationsources.TheagentusesasetofwrappersandthewrapperinductionalgorithmSTALKER[18],toextractrelevantinformationfromthewebpagesafterbeingshownseveralmarked-upexamples.Whentheinformationisgathereditisstoredinthelo-calIEA“infobase”usingJavawrappersonaPARKA[15]knowledgebase.ThismakesnewIEA’sfairlyeasytocreate,andforcesthedifﬁcultpartsofthisproblembackontoKBontologycreation,ratherthantheproductionoftoolstowrapwebpagesanddynamicallyanswerqueries.Currently,therearesomeproposalsforXML-basedpageannotationswhich,ifadopted,willmakesitewrappingeasiersyntactically(butstill,doesnotsolvetheontologyproblem—butseeprojectssuchasOIL).

3ADECAFMulti-AgentSystemforGenomicAnalysis

Thesetoolscanbeputtousetocreateaprototypemulti-agentsystemforvarioustypesofgenomicanalysis.Intheprototype,wehavechosentosimplifythequerysubsystembymaterializingallannotationslocally,thusremovingtheneedforsophisticatedqueryplanning(e.g.[16]).Thisisareasonablesimpliﬁcationsincemostofourworkiswithvirusesthathavefairlysmallgenomes(around100genesforaherpesvirusandaround30herpesviruses)orwithlargerorganisms(e.g.chickens)forwhichweareconstructingaconsensusdatabaseexplicitly.

Figure2showsanoverviewofthesystemasfouroverlappingmulti-agentorgani-zations.Theﬁrst,BasicSequenceAnnotation,ischargedwithintegratingremotegenesequenceannotationsfromvarioussourceswiththegenesequencesattheLocalKnowl-edgeBaseManagementAgent(LKBMA).Thesecond,Query,allowscomplexqueriesontheLKBMAsviaawebinterface.Thethird,FunctionalAnnotationisresponsi-bleforcollectinginformationneededtomakeaninformedguessastothefunctionofagene,speciﬁcallyusingthethree-partGeneOntology[22].Thefourthorganization,ESTProcessingenablestheanalysisofexpressedsequencetags(ESTs)toproducegenesequencesthatcanbeannotatedbytheotherorganizations.

Animportantfeaturetonoteisthatwearefocusingonannotationandanalysisser-vicesthatarenotorganismspeciﬁc.Inthisway,theresultingsystemcanbeusedtobuildandqueryknowledgebasesfromseveraldifferentorganisms.Theoriginalsubsys-tems(basicannotationandthesimplequerysystem)werebuilttoannotatethenewlysequencedHerpesvirusofTurkey(thebird),andthentocompareittotheotherknownsequencedherpesviruses.WorkisjustbeginningtobuildanewknowledgebasefromchickenESTs,andtoextendthedepthoftheherpesvirusKBforEpstein-BarrVirus(humanherpesvirus4)whichhasclinicalsigniﬁcanceforpediatricorgantransplantpatients.

Sequence AdditionApplet Functional AnnotationApplet User QueryApplet EST Entry[Chromatograph/FASTA] ProxyAgent ProxyAgent Proxy OntologyAgentAgent Ontology ReasoningAgent ProxyAgent Query ProcessingAgent Sequence SourceProcessing Agent AnnotationAgentBasicSequenceAnnotation ProDomainIEA SwissProt/ProSiteIEA PSortIEAFunctionalAnnotation GenBankInfo Extraction AgentQuery SequenceLKBMAChromatograph ConsensusProcessingSequence SNP-FinderESTProcessing ESTLKBMAFlybaseIEAMouse Genome DBIEASGD (yeast)IEAFig.2.OverviewofDECAFMulti-AgentSystemforGenomicAnalysis

3.1BasicSequenceAnnotationandQueryProcessing

Figure3showstheinteractiondetailsforthebasicsequenceannotationandquerysub-systems.WewilldescribetheagentsbytheirRETSINAclassiﬁcation.

Sequence AdditionApplet User QueryAppletInterface AgentsDomain-IndependentTask Agents AnnotationAgent ProxyAgent Query ProcessingAgent MatchmakerAgent Agent Name ServerAgent Sequence SourceProcessing AgentTask Agents Local Knowledgebase Local Knowledgebase Local Knowledgebase Local KnowledgebaseManagement AgentsManagement AgentsManagement AgentsManagement Agents GenBank ProDomainInfo Extraction AgentInfo Extraction Agent SwissProt/ProSiteInfo Extraction Agent PSortInfo Extraction AgentInformationExtractionAgents

Fig.3.BasicAnnotationandQueryAgentOrganizations

InformationExtractionAgents.Currently4agentsbasedontheIEAshellwrappublicwebsites.TheGenbankwrapperprimarilysupplies“BLAST”services:giventhesequenceofaherpesvirusgene,whatarethemostsimilargenesknownintheworld(called“homologs”)?Theanswerherecangivethebiologistaclueastothepossiblefunctionofagene,andforanygenethatthebiologistdoesnotknowthefunctionof,a

changeintheanswertothisquerymightbesigniﬁcant.TheSwissProtwrapperprimaryprovidesproteinmotifpatternsearches.Ifweviewaproteinasaone-dimensionalstringofaminoacids,thenamotifisaregularexpressionmatchingpartofthestringthatmayindicateaparticularkindoffunctionfortheprotein(i.e.aprenylationmotifindicatesaplacewheretheproteinmaybemodiﬁedaftertranslationbytheadditionofanothergroupofmolecules)ThePSortwrapperaccessesaknowledge-basedsystemforesti-matingthelikelysub-cellularlocationthatasequence’sencodedproteinwillbeused.TheProDomainwrapperallowsaccesstootherinformationabouttheencodedprotein;aproteindomainissimilartoamotifbutlarger.Aswemovetoneworganisms,manymoreresourcescouldbewrappedatthislevel(almostallbiologistshavea“favorite”here).

Thelocalknowledgebasemanagementagent(KBMA)isaslightlydifferentmem-berofthisclassbecauseunlikemostIEAsitactuallystoresdataviaagentmessagesratherthanonlyqueryingexternaldatasources.Itisherethattheannotationsofthegeneticinformationarematerialized,andfromwhichmostqueriesareanswered.EachKBMAisupdatedwithrawsequencingdataindirectlyfromausersequenceadditioninterfacethatisthenautomaticallyannotatedunderthecontrolofanannotationtaskagent.KBMAscanbe“owned”bydifferentparties,andqueriedseparatelyortogether.Inthisway,researcherswithlimitedcomputerknowledgecancreatesharableannotatedsequencedatabasesusingtheexistingwrappersandotheranalysistoolsastheyarede-veloped,withouthavingtonecessarilydownloadandinstallthemthemselves.UsingaPARKA-DBknowledgebaseallowsefﬁcient,modernrelationaldatastorageonthebackendandqueryaswellaslimitedKBinferencing[15].

TaskAgents.Therearetwodomaintaskagents;therestaregenericmiddleagentsdescribedearlier.TheAnnotationAgentdirectsexactlywhatinformationshouldbeannotatedforeachsequence.Itisresponsibleforstoringtherawsequencedata,mak-ingqueriestothevariouswrappedwebsites,storingthoseannotations,andalsoin-dicatingtheprovenanceofthedata(meta-informationregardingwhereanannotationcamefrom).TheSequenceSourceProcessingAgenttakesalmostrawsequencedatainASN.1formatasoutputbytypicalsequenceestimationprogramsorstoredinGenbank.Themainfunctionofthisagentistotestthisinputforinternalconsistency.

InterfaceAgents.Therearetwointerfaceappletsthatcommunicateviatheproxyagentwithotheragentsinthesystem.Oneisorientedtowardsaddingnewsequencestoalocalknowledgebase(securedbyapassword)andtheotherallowsanyonetoquerythecompleteannotatedKB(orevenmultipleKBs).Theinterfacehardlyscratchesthesurfaceofthequeriesthatareactuallypossible,butabigproblemisthatmostbiologistsarenotcomfortablewithcomplexquerylanguages.Indeed,thesimpleinterfacethatallowssimpleconjunctiveanddisjunctivequeriesoverdynamicmenusofannotations(constructedbytheappletatruntimefromtheactuallocalKB)isquiteadvancedascomparedtomostoftheexistingpublicsitesthatallowtextualkeywordsearchesonly.3.2FunctionalAnnotation

Thissubsystemisresponsibleforassistingthebiologistinthedifﬁcultproblemofmak-ingfunctionalannotationsofeachgene.Unfortunately,manyofthemillionsofgenes

sequencedsofarhavefairlyhaphazard(fromacomputerscientist’sperspective)func-tionalannotation:simplyfreenaturallanguagedescriptions.Recently,afairlylargegrouprepresentingatleastsomeoftheprimaryorganismdatabaseshavecreatedacon-sortiumdedicatedtocreatingageneontologyforannotatinggenefunctioninthreebasicareas:thebiologicalprocessinwhichageneplaysapart,themolecularfunctionofthegeneproduct,andthecellularlocalization[22].Thesubsystemdescribedheresupportstheuseofthisontologybybiologistsassequencesareaddedtothesystem,eventuallyleadingtoevenmorepowerfulanalysisoftheresultingKBs.

InformationExtractionAgents.BesidesthegenesequenceLKBMAandtheGen-BankIEA,wearewrappingthreeneworganism-speciﬁcgenesequenceDBs,forDro-sophila(fruitﬂy),Mus(Mouse),andSaccrynomaecescervasie(yeast).EachoftheseorganismsispartoftheGeneOntology(GO)consortium,andhasspentconsiderabletimeinmakingtheproperfunctionalannotation.Eachoftheseagents,then,ﬁndsGO-annotated,closehomologsoftheunannotatedgeneandproposestheannotationofthehomologsfortheannotationofthenewgene.

TaskAgents.Therearetwonewtaskagents,oneisadomain-independentontol-ogyagentusingtheFIPAontologyagentspeciﬁcationasastartingpoint.TheontologyagentcontainsboththeGOontologiesandseveralmappingsfromothersymbologies(i.e.SwissProtterms)toGOterms.Infact,theMouseIEAusestheOntologyagenttomapsomenon-GOtermsforcertainrecordstoGOterms.Althoughnotindicatedontheﬁgure,someoftheotherorganismDBIEAagentsmustmapfromGOontologydescriptivestringstotheactualuniqueGOID.Theotherserviceprovidedbytheon-tologyagent(andnotexplicitlymentionedintheexperimentalFIPAOntologyAgentspeciﬁcation)isfortheontologyreasoningagenttoaskhowtotermsarerelatedinanontology.TheOntologyReasoningAgentusesthisquerytobuildaminimumspanningtree(ineachofthethreeGOontologies)betweenallthetermsreturnedinalltheho-mologiesfromalloftheGOorganismdatabases.Thisinformationcanthenbeusedtoproposealikelyannotation,andtodisplayalloftheinformationgraphicallyforthebiologistviatheinterfaceagent.

InterfaceAgents.Thefunctionalinterfaceagent/appletconsistsoftwocolumnarpanes:ontheleft,thetoppanedisplaysthegenebeingannotated,andthebottomdis-playsthegeneralhomologiesfromGenBankwiththeirnaturallanguageannotations.Ontheright,threepanesdisplaythesubtreesfromthethreeGOontologies(biologicalprocess,molecularfunction,cellularlocation)markedincolorwiththehomologsfromthethreeorganismdatabases.3.3ESTProcessing

Onewaytobroadentheapplicabilityofthesystemistoacceptmorekindsofbasicin-putdatatotheannotationprocess.Forexample,wecouldbroadenthereachofthesys-tembystartingwithESTs(ExpressedSequenceTags)insteadofcompletesequences.Agentscouldwrapthestandardsoftwareforcreatingsequencesfromthisdata,atwhichpointtheexistingsystemcanbeused.TheuseofESTsispartofarelativelyinexpensiveapproachtosequencingwhereinsteadofdirectlysequencinggenomicDNA,weinsteaduseamethodthatproducesmanyshortsequencesthatpartiallyoverlap.Byﬁndingtheoverlapsintheshortsequences,wecaneventuallyreconstructtheentiresequenceof

eachexpressedgene.Essentially,thisisa“shotgun”approachthatreliesonstatisticsandthesheernumberofexperimentstoeventuallyproducecompletesequences.

Asasideeffectofthisprocessing,informationisproducedthatcanbeusedtoﬁndSingleNucleotidePolymorphisms(SNPs).SNPsindicateachangeofonenucleotide(A,T,C,G)inasinglegenebetweendifferentindividuals(often,conservedacrossstrainsorsubspecies).Thesemarkersareveryimportantforidentiﬁcationeveniftheydonothavefunctionaleffects.

InformationExtractionAgents.TheprocessofconsensussequencebuildingandSNPidentiﬁcationdoesnotrequireanyexternalinformation,sotheonlyIEAsaretheLKBMAs.Upuntilnow,therehasonlybeenoneLKBMA,responsibleforthegenesequencesandannotations.ESTprocessingaddsasecondLKBMAresponsibleforstoringtheESTSthemselvesandtheassociatedinformationdiscussedbelow.Primarily,thisisbecause(especiallyearlyoninasequencingproject)therewillbethousandsofESTsthatdonotoverlaptoformcontiguoussequences,andthatESTsmaybeaddedandprocessedalmostdaily.

TaskAgents.Therearethreenewdomain-leveltaskagents.Theﬁrstdealswithpro-cessingchromatographs.Essentiallythechromatographisasetofsignalsthatindicatetherelativestrengthsofthewavelengthsassociatedwitheachluminousnucleotidetag.SeveralstandardUnixanalysisprogramsexisttoprocessthisdata,essentially“calling”thebestnucleotideforeachposition.Thechromatographprocessingagentwrapsthreeanalysisprograms:Phred,which“calls”thechromatographandalsoseparatelypro-ducesanuncertaintyscoreforeachnucleotideinthesequence;phd2fastawhichcon-vertsthisoutputintoastandard(FASTA)format;andx-matchwhichremovesapartofthesequencethatisabyproductofthesequencingmethod,andnotactuallypartoftheorganismsequence.Theconsensussequenceassemblyagentusestwomoreprograms(Phrapandconsed)onalltheESTsfoundsofartoproduceasetofcandidategenesbyappropriatelysplicingtogethertheshortESTsequences.Thisproducesasetofcandi-dategenesthatcanthenbeaddedtothegenesequenceLKBMAandfromwhichthevariousannotationprocessesdescribedearliermaycommence.Finally,aSNP-ﬁnderagentoperatesthePolyBayesprogramwhichusestheESTandSequenceKBsandtheuncertaintyscoresproducedbyPhredtonominatepossiblesinglenucleotidepolymor-phisms.

InterfaceAgents.Thereisonlyonesimpleinterfaceagent,toallowparticipantstoenterdatainthesystem.Preferably,thisischromatographdatafromthesequencers,becausetheoriginalchromatographallowsPhredtocalculatetheuncertaintyassociatedwitheachnucleotidecall.However,FASTA-format(simple“ATCG...”namedstrings)ESTscalledfromtheoriginalchromatographscanbeaccommodated.Thesecanbeusedtobuildconsensussequences,butnotforﬁndingSNPs.

4GeneExpressionProcessing

Anewkindofgenomicdataisnowbeingproduced,thatmayswampeventheamountofsequencingdata.Thisisso-calledgeneexpressiondata,andindicatesquantitativelyhowmuchageneproductisexpressedinsomelocation,undersomeconditions,atsomepointintime.Wearedevelopinganmulti-agentsystemthatusesavailableon-

linegenomicandmetabolicpathwayknowledgetoextendgeneexpressionanalysis.Byincorporatingknownrelationshipsbetweengenes,knowledge-basedanalysisofex-perimentalexpressiondataissigniﬁcantlyimprovedoverpurelystatisticalmethods.Althoughthissystemhasnotyetbeenintegratedintotheexistingagentcommunity,eventuallyrelevantgenomicinformationwillbemadeavailabletothesystemthroughtheexistingGenBankandSwissProtIEAs.Metabolicpathwaysofinteresttotheinves-tigatorareidentiﬁedthroughaKEGG(KyotoEncyclopediaofGenesandGenomes)databasewrapper.AnalysisofthegeneexpressiondataisperformedthroughanagentthatexecutesSAS,astatisticalpackagethatincludesclusteringandPCAanalysismeth-ods.Resultsaretobepresentedtotheuserthroughwebpageshyperlinkedtorelevantdatabaseentries.

Currenttechniquesforgeneexpressionanalysishaveprimarilyfocusedontheuseofclusteringalgorithms,whichgroupgenesofsimilarexpressionpatternstogether[12].However,experimentalgeneexpressiondatacanbeverynoisyandthecomplicatedpathwayswithinorganismscangeneratecoincidentalexpressionpatterns,whichcansigniﬁcantlylimitthebeneﬁtsofstandardclusteranalysis.Inordertoseparategeneco-regulationpatternsfromco-expression,thegeneexpressionprocessingorganizationwasdevelopedtogatheravailablepathway-levelinformationinordertopresorttheex-pressiondataintofunctionalcategories.Thus,clusteringofthereduceddatasetismuchmorelikelytoﬁndgenesthatareactuallyregulatedtogether.Thesystemalsopromisestobeusefulindiscoveringregulatoryconnectionsbetweendifferentpathways.OneadvantageofusingtheKEGGdatabaseisthatitsgene/enzymeentriesareorganizedbytheEC(EnzymeCommission)ontology,andsoareeasilymappedtogenenamesspeciﬁctotheorganismofinterest.

5RelatedWork

Therehasbeensigniﬁcantworkongeneralalgorithmsforqueryplanning,selectivematerialization,andtheoptimizationofthesefromtheAIperspective,forexampleTSIMMIS[4],Infosleuth[19],SIMS[1],etc.,andofcourseonapplyingagentsasthewaytoembodythesealgorithms[16,21,10].

InBiology,comparedtotheworkbeingdonetocreatetherawdata,alltheworkonhowtoorganizeandretrieveitisrelativelysmall.Mostoftheworkincomputersciencedirectedtobiologicaldatahasbeenintheareaofheterogeneousdatabases,focusingonthesemi-structurednatureofmuchofthedatathatmakesitverydifﬁculttostoreuse-fullyincommercialrelationaldatabases[5].Someworkhasbeguninapplyingtheworkonwrappersandmediatorstobiologicaldatabases,forexampleTAMBIS[20].Thesesystemsdifferfromoursinthattheyarepureimplementationsofwrapper/mediatortechnologythatarecentralized,donotallowfordynamicchangesinsources,supportpersistentqueries,orconsidersecondaryuserutilityintheformoftimeorotherre-sourcelimitations.

Agenttechnologyhasbeenmakingsomeinroadsinthearea.Theword“agent”withthepopularconnotationofasinglecomputerprogramtodoauser’sbiddingisfoundinthepromotionalmaterialforDoubletwist(www.doubletwist.com).Here,an“agent”standsforapersistentquery(e.g.“tellmeifanewhomologisfoundinyour

databaseforthefollowingsequence”).Thereisnocollaborationorcommunicationbetweenagents.

Weknowofafewtrulymulti-agentprojectsinthisdomain.First,InfoSleuthhasbeenusedtoannotatelivestockgeneticsamples[11].Theﬂowofinformationisverysimilartooursystem.However,thesystemisnotsetupfornoticingchangesinthepublicdatabases,forintegratingnewdatasourcesontheﬂy,orforconsiderationofsecondaryuserutility.Second,theEDITtoTrEMBLsystem[17]isanotherautomatedannotationsystem,basedonthewrapperandmediatorconcept,forannotatingproteinsawaitingmanualannotationandentrytoSwissProt.Dispatcheragentscontroltheap-plicationofpotentiallycomplexsequencesofwrappers.Mostimportantly,thissystemsupportsthedetectionandpossiblerevisionofinconsistenciesrevealedbetweendiffer-entannotations.Third,theGeneWeaverproject[3]isanothertruemulti-agentsystemforannotationofgenomes.GeneWeaverhasasaprimarydesigncriteriontheobserva-tionthatthesourcedataisalwayschanging,andsoannotationsneedtobeconstantlyupdated.Theyalsoexpresstheideathatnewsourcesoranalysistoolsshouldbeeasytointegrateintothesystem,whichplaystotheopensystemsrequirement,althoughtheydonotdescribedetails.Theprimarydifferencesarethewayinwhichanopensystemisachieved(itisnotclearthattheyuseagent-levelmatchmaking,butratherpossiblyCORBAspeciﬁcations)andthatGeneWeaverisnotbasedonasharedarchitecturethatsupportsreasoningaboutsecondaryuserutility.IncomparisontotheDECAFimple-mentation,GeneWeaverusesCORBA/RMIratherthanTCP/IPcommunication,andasimpliﬁedKQML-likelanguagecalledBAL.

6Discussion

Thesystemdescribedhereisoperationalandnormallyavailableonthewebat

http://udgenome.ags.udel.edu/herpes/.Thisisaworkingprototype,andsotheinterfaceisstronglyorientedtobiologistsonly.Ingeneral,computationalsupportfortheprocessesthatbiologistsuseinanalyzingdataisprimitive(Perlscripts)ornon-existent.Inlessthan10min,wewereabletoannotatetheHVT-1sequence,aswellasstoreitinaqueryableandweb-publishableform.Thisimpressedthebiologistsweworkwith,comparedtomanualannotationandﬂatASCIIﬁles.Furthermore,wehaverecentlyaddedapproximately15otherpubliclyavailableherpesvirussequences(e.g.severalstrainsofHumanherpesvirus,Africanswinefevervirus,etc.).Theresultingknowledgebasealmostimmediatelyresultedinqueriesbyourlocalbiologiststhatindi-catedpossibleinterestingrelationshipsthatmayresultinfuturebiologicalwork.Thissummerwewillbegintestingwithviralbiologistsfromotheruniversities.

Otherthingsaboutthesystemwhichhaveexcitedourbiologistco-workersaretherelativeeasebywhichwecanaddnewtypesofannotationoranalysisinformation,andthefactthatthesystemcanbeusedtobuildsimilarsystemsforotherorganisms,suchasthechicken.Forexample,theuseofopensystemconceptssuchasamatchmakerallowtheannotationagenttoaccessandusenewannotationservicesthatwerenotavailablewhenitwasinitiallywritten.Secondaryuserutilitywillbecomeusefulforthebiologistwhenfacedwithmakingasimpleofﬁcequeryvs.checkingresultsbeforepublication.

TheunderlyingDECAFsystemhasbeenevaluatedinseveralways,especiallywithrespecttotheuseofparallelcomputationalresourcesbyasingleagent(alloftheDE-CAFcomponentsandalloftheexecutableactionsareruninparallelthreads),andtheefﬁcacyoftheDRUschedulerwhichefﬁcientlysolvesarestrictedsubsetofthedesign-to-criteriaschedulingproblem[14].Runningthegeneannotationsystemasatrulymulti-agentsystemresultsintruespeedups,althoughmostofthetimeiscurrentlyspentinremotedatabaseaccessParallelhardwareforeachagentwillbeusefulforsomeofthemorelocallycomputationallyintensivetasksinvolvingESTprocessing.

7ConclusionsandFutureWork

Inthispaperwehavediscussedtheveryrealproblemofmakingsomeuseofthetremen-dousamountsofgeneticsequenceinformationthatarebeingproduced.Whilethereismuchinformationpubliclyavailableovertheweb,accessingsuchinformationisdiffer-entforeachsourceandtheresultscanonlybeusedbyasingleresearcher.Furthermore,thecontentsoftheseprimarysourcesarechangingallthetime,andnewsourcesandtechniquesforanalysisareconstantlybeingdeveloped.

Wecastthissequenceannotationproblemasageneralinformationgatheringprob-lem,andproposedtheuseofmulti-agentsystemsforimplementation.Beyondthebasicheterogeneousdatabaseproblemthatthisproblemrepresents,anMASsolutiongivesusmechanismsfordealingwithchangingdata,thedynamicappearanceofnewsources,mindingsecondaryutilitycharacteristicsforusers,andofcoursetheobviousdistributedprocessingachievementsofparalleldevelopment,concurrentprocessing,andthepos-sibilityforhandlingcertainsecurityorotherorganizationalconcerns(wherepartoftheagentorganizationcanmirrorthehumanorganization).

Wecurrentlyareofferingthesystempubliclyontheweb,withtheknownher-pesvirussequences.AsecondsystembasedonchickenESTsshouldbeavailablebytheendof2001.Weintendtobroadentheannotationcoverageandaddmorecom-plexanalyses.Anexamplewouldbetheestimationofthephysicallocationofthegeneaswellasitsfunction.BecausebiologistshavelongrecordedcertainQTLs(Quanti-tativeTraitLoci)thatindicatethatacertainphysicalregionisresponsibleforatrait(suchaschickenswithresistancetoacertaindisease),beingabletoseewhatgenesarephysicallylocatedintheQTLregionisastrongindicatorastotheirhigh-levelgeneticfunction.

Ingeneral,wehavenotyetdesignedaninterfacethatallowsbiologiststotakefulladvantageofthematerializeddata—theyareuncomfortablewithcomplexquerylan-guages.Webelievethatitmaybepossibletobuildagraphicalinterfacetoallowabiologist,aftersometraining,tocreateacommonlyneededanalysisqueryandtothensavethisforuseinthefuturebythatscientist,orotherssharingtheagentnamespace.Finally,thenextmajorsubsystemwillbeagentstolinkandanalyzegeneexpres-siondata(whichwillinturninteroperatewiththemetabolicpathwayanalysissystemsdescribedabove).Thisdataneedstobelinkedwithsequenceandfunctiondata,toal-lowmorepowerfulanalysis.Forexample,linkedtoQTLdata,thisallowsustoaskquestionssuchas“whatchemicalsmightpreventclubrootdiseaseincabbage?”.

References

1.Y.ArensandC.A.Knoblock.Intelligentcaching:Selecting,representing,andreusingdatainaninformationserver.InProc.3rdIntl.Conf.onInfo.andKnow.Mgmt.,1994.2.D.A.Bensonandetal.Genbank.NucleicAcidsRes.,28:15–18,2000.

3.K.Bryson,M.Luck,M.Joy,andD.T.Jones.Applyingagentstobioinformaticsinge-neweaver.InProc.4thInt.Wksp.Collab.Info.Agents,2000.

4.S.Chawathe,H.Garcia-Molina,J.Hammer,K.Ireland,Y.Papakonstantinou,J.Ullman,andJ.Widom.TheTSIMMISproject:integrationofheterogeneousinformationsources.InProc.10thMtg.Info.Proc.Soc.Japan,Dec.1994.

5.S.B.Davidsonandetal.Biokleisli:adigitallibraryforbiomedicalresearchers.Intnl.J.onDigitalLibraries,1(1):36–53,1997.

6.K.Decker,X.Zheng,andC.Schmidt.Amulti-agentsystemforautomatedgenomicanno-tation.InProceedingsofthe5thIntl.Conf.onAutonomousAgents,Montreal,2001.

7.K.S.Decker,A.Pannu,K.Sycara,andM.Williamson.Designingbehaviorsforinformationagents.InProc.1stIntl.Conf.onAutonomousAgents,pages404–413,1997.

8.K.S.Decker,K.Sycara,andM.Williamson.Middle-agentsfortheinternet.InProc.15thIJCAI,pages578–583,1997.

9.K.S.DeckerandV.R.Lesser.Quantitativemodelingofcomplexcomputationaltaskenvi-ronments.InProc.11thAAAI,pages217–224,1993.

10.K.S.DeckerandK.Sycara.Intelligentadaptiveinformationagents.JournalofIntelligent

InformationSystems,9(3):239–260,1997.

11.L.Deschaine,R.Brice,andM.Nodine.Useofinfosleuthtocoordinateinformationacqui-sition,tracking,andanalysisincomplexapplications.MCC-INSL–008-00,2000.

12.M.B.Eisen,P.T.Spellman,P.O.Brown,andD.Botstein.Clusteranalysisanddisplayof

genome-wideexpressionpatterns.Proc.Nat.Acad.Sci.

13.J.GrahamandK.S.Decker.Towardsadistributed,environment-centeredagentframework.

InIntelligentAgentsVI,LNAI-1757,pages290–304.SpringerVerlag,2000.

14.J.Graham.Real-timeSchedulinginMulti-agentSystems.PhDthesis,Universityof

Delaware,2001.

15.J.HendlerandM.TaylorK.Stoffel.Advancesinhighperformanceknowledgerepresenta-tion.TechnicalReportCS-TR-3672,UniversityofMarylandInstituteforAdvancedCom-puterStudies,1996.Alsocross-referencedasUMIACS-TR-96-56.

16.C.A.Knoblock,Y.Arens,andC.Hsu.Cooperatingagentsforinformationretrieval.InProc.

2ndIntl.Conf.onCooperativeInformationSystems.Univ.ofTorontoPress,1994.17.S.M¨ollerandM.Schroeder.Consistentintegrationofnon-reliableheterogeneousinfor-mationappliedtotheannotationoftransmembraneproteins.JournalofComputingand

Chemistry,toappear,2001.

18.I.Muslea,S.Minton,andC.Knoblock.Stalker:Learningexpectationrulesforsimistruc-turedweb-basedinformationsources.InPapersfromthe1998WorkshoponAIandInforma-tionGathering,1998.alsoTechnicalReportws-98-14,UniversityofSouthernCalifornia.19.M.NodineandA.Unruh.Facilitatingopencommunicationinagentsystems:theinfosleuth

infrastructure.InIntelligentAgentsIV,pages281–295.Springer-Verlag,1998.

20.R.Stevensandetal.Tambis:Transparentaccesstomultiplebioinformaticsinformation

sources.Bioinformatics,16(2):184–185,2000.

21.K.Sycara,K.S.Decker,A.Pannu,M.Williamson,andD.Zeng.Distributedintelligent

agents.IEEEExpert,11(6):36–46,December1996.

22.TheGeneOntologyConsortium.Geneontolgy:toolfortheuniﬁcationofbiology.Nature

Genetics,25(1):25–29,May2000.

23.T.Wagner,A.Garvey,andV.Lesser.Complexgoalcriteriaanditsapplicationindesign-to-criteriascheduling.InProc.14thAAAI,1997.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文

首页

行业资讯

宠物日常

宠物养护

宠物健康

宠物故事

Extending a multi-agent system for genomic annotation