Extending a multi-agent system for genomic annotation
Annotation
KeithDecker,SalimKhan,CarlSchmidt,andDennisMichaud
ComputerandInformationSciencesDepartmentUniversityofDelaware,Newark,DE19716
decker@cis.udel.edu
Abstract.Theexplosivegrowthingenomic(andsoon,expressionandproteomic)data,exemplifiedbytheHumanGenomeProject,isafertiledomainfortheappli-cationofmulti-agentinformationgatheringtechnologies.Furthermore,hundredsofsmaller-profile,yetstilleconomicallyimportantorganismsarebeingstudiedthatrequiretheefficientandinexpensiveautomatedanalysistoolsthatmulti-agentapproachescanprovide.InthispaperwegiveaprogressreportontheuseoftheDECAFmulti-agenttoolkittobuildreusableinformationgatheringsys-temsforbioinformatics.Wewillbrieflysummarizewhybioinformaticsisaclas-sicapplicationforinformationgathering,howDECAFsupportsit,andrecentextensionsunderwaytosupportnewanalysispathsforgenomicinformation.
1Introduction
Massiveamountsofrawdataarecurrentlybeinggeneratedbybiologistswhilesequenc-ingorganisms.Mostofthisrawdatamustbeanalyzedthroughthepiecemealapplica-tionofvariouscomputerprogramsandhand-searchesofvariouspublicwebdatabases.Typicallyboththerawdataandanyvaluablederivedknowledgewillremaingenerallyunavailableexceptinpublishednaturallanguagetextssuchasjournalarticles.How-ever,itisimportanttonotethatatremendousamountofgeneticmaterialissimilarfromorganismtoorganism,evenwhentheyareasoutwardlydifferentasayeast,fruitfly,mouse,orhumanbeing.Thismeansthatifabiologiststudyingtheyeastcanfig-ureoutwhatacertaingenedoes—itsfunction—thatotherbiologistscanatleastguessthatsimilargenesinotherorganismsplaysimilarroles.Thushugedatabasesarebeingpopulatedwithsequencedataandfunctionalannotations[2].Allnewsequencesareroutinelycomparedtoknownsequencesforcluesastotheirfunctions.
Alargeamountofworkinbioinformaticsoverthepasttenyearshasgoneintodevelopingalgorithms(patternmatching,statistical,and/orheuristic/knowledge-based)tosupporttheworkofhypothesizinggenefunction.Manyoftheseareavailabletobiologistsinvariousimplementations,andnowmanyareavailableovertheweb.Meta-sitescombinemanypublishedalgorithms,andsitesspecializeininformationaboutparticulartopicssuchasproteinmotifs.
Fromacomputerscienceperspective,severalproblemshavearisen,aswehavedescribedelsewhere[6].Tosummarize,whatwehaveisalargesetofheterogeneousanddynamicallychangingdatabases,allofwhichhaveinformationtobringtobearonthebiologicalproblemofdetermininggenomicfunction.Wehavebiologistsproducingthousandsofpossiblegenes,forwhichfunctionsmustbehypothesized.Forthecaseofallbutthelargestandwell-fundedsequencingprojects,thismustbedonebyhandbyasingleresearcherandtheirstudents.
Multi-agentinformationgatheringsystemshavealottocontributetotheseefforts.Severalfeaturesmakeamulti-agentapproachtothisproblemparticularlyattractive:informationisavailablefrommanydistinctlocations;informationcontentishetero-geneous;informationcontentisconstantlychanging;muchoftheannotationworkforeachgenecanbedoneindependently;biologistswishtobothmaketheirfindingswidelyavailable,yetretaincontroloverthedata;newtypesofanalysisandsourcesofdataareappearingconstantly.
WehaveusedDECAF,amulti-agentsystemtoolkitbasedonRETSINA[21,10,7]:andTAEMS[9,23],toconstructaprototypemulti-agentsystemforautomatedan-notationanddatabasestorageofsequencingdataforherpesviruses[6].Theresultingsystemeliminatestediousandalwaysout-of-datehandanalyses,makesthedataandannotationsavailableforotherresearchers(oragentsystems),andprovidesalevelofqueryprocessingbeyondevensomehigh-profilewebsites.
Sincethatinitialsystem,wehaveusedthedistributed,opennatureofourmulti-agentsolutiontoexpandthesysteminseveralwaysthatwillmakeitusefulforbiolo-gistsstudyingmoreorganisms,andindifferentways.Thispaperwillbrieflydescribeourapproachtoinformationgathering,basedonourworkonRETSINA;theDECAFtoolkit;ourinitialannotationsystem;andournewextensionsforfunctionalannotation,ESTprocessing,andmetabolicpathwayreasoning.
2DECAF
DECAF(Distributed,Environment-CenteredAgentFramework)isaJava-basedtoolkitforcreatingmulti-agentsystems[13].Inparticular,severaltoolshavebeendevelopedspecificallyforprototypinginformationgatheringsystems.Also,theinternalarchitec-tureofeachDECAFagenthasbeendesignedmuchlikeanoperatingsystem—asasetofservicesforthe“intelligent”(resource-efficient,adaptively-scheduled,softreal-time,objective-persistent)executionofagentactions.DECAFconsistsofasetofwelldefinedcontrolmodules(initialization,dispatching,planning,scheduling,andexecu-tion,eachinaseparate,concurrentthread)thatworkinconcerttocontrolanagent’slifecycle.Thereisonecoretaskstructurerepresentationthatissharedbetweenallofthecontrolmodules.Thishasmeantthatevennon-reusabledomain-dependentagentscanbedevelopedmorequicklythanbytheAPIapproachwheretheprogrammerhasto,ineffect,createandorchestratetheagent’sarchitectureaswellasitsdomain-orientedagentactions.ThissectionwillfirstdiscusstheinternalarchitectureofagenericDE-CAFagent,andthendiscussthetools(suchasmiddleagents,systemdebuggingaids,andtheinformationextractionagentshell)wehavebuilttoimplementmulti-agentin-formationgatheringsystems.TheoverallinternalarchitectureofDECAFisshownin
Figure1.Thesemodulesrunconcurrently,eachintheirownthread.DetailsoftheDECAFimplementationcanbefoundelsewhere[13].
Plan FileIncoming KQML messagesDECAF Task and Control StructuresIncoming Message QueueObjectivesQueueTaskQueueAgendaQueueAgent InitializationDispatcherPlannerSchedulerExecutorTask TemplatesHashtablePendingAction QueueAction Results QueueDomain Facts and BeliefsOutgoingKQML MessagesAction ModulesFig.1.DECAFArchitectureOverview
2.1DECAFSupportforInfoGathering
DECAFprovidescoreinternalarchitecturalsupportforsecondaryuserutility.ThusDECAFplanscanincludealternatives,andthesealternativescanbechosendynami-callyatruntimedependingonuserconstraintsonanswertimelinessorotherresourceconstraints.DECAFalsosupportsbuildinginformationgatheringsystemsbyprovidingusefulmiddleagentsandashellforquicklybuildinginformationextractionagentsforwrappingwebsites.TheAgentNameServer(ANS)(“whitepages”)isanessentialcomponentforagentcommunication.ItworksinafashionsimilartoDNS(DomainNameService)byresolvingagentnamestohostandportaddresses.TheMatchmakerservesasa“yellowpages”toassistagentsinfindingservicesneededfortaskcom-pletion.TheBrokeragentactsasakindof“middlemanager”toassistanagentwithcollectionsofservices.Thebrokercannowprovidealargerservicethananysingleprovidercan,andoftenmanagealargegroupofagentsmoreeffectively[8].AProxyagentallowswebpageJavaappletstocommunicatewithDECAFagentsthatarenotlocatedonthesameserverastheapplet.TheAgentManagementAgent(AMA)al-lowsMASdesignersalookattheentirerunningsetofagentsspreadoutacrosstheInternetthatshareasingleagentnameserver.Thisallowsdesignerstoquerythestatusofindividualagentsandwatchorrecordmessagepassingtraffic.
InformationExtractionAgentShellThemainfunctionsofaninformationextractionagent(IEA)are[7]:Fulfillingrequestsfromexternalsourcesinresponsetoaoneshot
query(e.g.“WhatisthepriceofIBM?”).Monitoringexternalsourcesforperiodicinformation(e.g.“GivemethepriceofIBMevery30minutes.”).Monitoringsourcesforpatterns,calledinformationmonitoringrequests(e.g.“NotifymeifthepriceofIBMgoesbelow$50.”).”Thesefunctionscanbewritteninageneralwaysothatthecodecanbesharedforagentsinanydomain.
SinceourIEAoperatesontheWeb,theinformationgatheredisfromexternalinfor-mationsources.TheagentusesasetofwrappersandthewrapperinductionalgorithmSTALKER[18],toextractrelevantinformationfromthewebpagesafterbeingshownseveralmarked-upexamples.Whentheinformationisgathereditisstoredinthelo-calIEA“infobase”usingJavawrappersonaPARKA[15]knowledgebase.ThismakesnewIEA’sfairlyeasytocreate,andforcesthedifficultpartsofthisproblembackontoKBontologycreation,ratherthantheproductionoftoolstowrapwebpagesanddynamicallyanswerqueries.Currently,therearesomeproposalsforXML-basedpageannotationswhich,ifadopted,willmakesitewrappingeasiersyntactically(butstill,doesnotsolvetheontologyproblem—butseeprojectssuchasOIL).
3ADECAFMulti-AgentSystemforGenomicAnalysis
Thesetoolscanbeputtousetocreateaprototypemulti-agentsystemforvarioustypesofgenomicanalysis.Intheprototype,wehavechosentosimplifythequerysubsystembymaterializingallannotationslocally,thusremovingtheneedforsophisticatedqueryplanning(e.g.[16]).Thisisareasonablesimplificationsincemostofourworkiswithvirusesthathavefairlysmallgenomes(around100genesforaherpesvirusandaround30herpesviruses)orwithlargerorganisms(e.g.chickens)forwhichweareconstructingaconsensusdatabaseexplicitly.
Figure2showsanoverviewofthesystemasfouroverlappingmulti-agentorgani-zations.Thefirst,BasicSequenceAnnotation,ischargedwithintegratingremotegenesequenceannotationsfromvarioussourceswiththegenesequencesattheLocalKnowl-edgeBaseManagementAgent(LKBMA).Thesecond,Query,allowscomplexqueriesontheLKBMAsviaawebinterface.Thethird,FunctionalAnnotationisresponsi-bleforcollectinginformationneededtomakeaninformedguessastothefunctionofagene,specificallyusingthethree-partGeneOntology[22].Thefourthorganization,ESTProcessingenablestheanalysisofexpressedsequencetags(ESTs)toproducegenesequencesthatcanbeannotatedbytheotherorganizations.
Animportantfeaturetonoteisthatwearefocusingonannotationandanalysisser-vicesthatarenotorganismspecific.Inthisway,theresultingsystemcanbeusedtobuildandqueryknowledgebasesfromseveraldifferentorganisms.Theoriginalsubsys-tems(basicannotationandthesimplequerysystem)werebuilttoannotatethenewlysequencedHerpesvirusofTurkey(thebird),andthentocompareittotheotherknownsequencedherpesviruses.WorkisjustbeginningtobuildanewknowledgebasefromchickenESTs,andtoextendthedepthoftheherpesvirusKBforEpstein-BarrVirus(humanherpesvirus4)whichhasclinicalsignificanceforpediatricorgantransplantpatients.
Sequence AdditionApplet Functional AnnotationApplet User QueryApplet EST Entry[Chromatograph/FASTA] ProxyAgent ProxyAgent Proxy OntologyAgentAgent Ontology ReasoningAgent ProxyAgent Query ProcessingAgent Sequence SourceProcessing Agent AnnotationAgentBasicSequenceAnnotation ProDomainIEA SwissProt/ProSiteIEA PSortIEAFunctionalAnnotation GenBankInfo Extraction AgentQuery SequenceLKBMAChromatograph ConsensusProcessingSequence SNP-FinderESTProcessing ESTLKBMAFlybaseIEAMouse Genome DBIEASGD (yeast)IEAFig.2.OverviewofDECAFMulti-AgentSystemforGenomicAnalysis
3.1BasicSequenceAnnotationandQueryProcessing
Figure3showstheinteractiondetailsforthebasicsequenceannotationandquerysub-systems.WewilldescribetheagentsbytheirRETSINAclassification.
Sequence AdditionApplet User QueryAppletInterface AgentsDomain-IndependentTask Agents AnnotationAgent ProxyAgent Query ProcessingAgent MatchmakerAgent Agent Name ServerAgent Sequence SourceProcessing AgentTask Agents Local Knowledgebase Local Knowledgebase Local Knowledgebase Local KnowledgebaseManagement AgentsManagement AgentsManagement AgentsManagement Agents GenBank ProDomainInfo Extraction AgentInfo Extraction Agent SwissProt/ProSiteInfo Extraction Agent PSortInfo Extraction AgentInformationExtractionAgents
Fig.3.BasicAnnotationandQueryAgentOrganizations
InformationExtractionAgents.Currently4agentsbasedontheIEAshellwrappublicwebsites.TheGenbankwrapperprimarilysupplies“BLAST”services:giventhesequenceofaherpesvirusgene,whatarethemostsimilargenesknownintheworld(called“homologs”)?Theanswerherecangivethebiologistaclueastothepossiblefunctionofagene,andforanygenethatthebiologistdoesnotknowthefunctionof,a
changeintheanswertothisquerymightbesignificant.TheSwissProtwrapperprimaryprovidesproteinmotifpatternsearches.Ifweviewaproteinasaone-dimensionalstringofaminoacids,thenamotifisaregularexpressionmatchingpartofthestringthatmayindicateaparticularkindoffunctionfortheprotein(i.e.aprenylationmotifindicatesaplacewheretheproteinmaybemodifiedaftertranslationbytheadditionofanothergroupofmolecules)ThePSortwrapperaccessesaknowledge-basedsystemforesti-matingthelikelysub-cellularlocationthatasequence’sencodedproteinwillbeused.TheProDomainwrapperallowsaccesstootherinformationabouttheencodedprotein;aproteindomainissimilartoamotifbutlarger.Aswemovetoneworganisms,manymoreresourcescouldbewrappedatthislevel(almostallbiologistshavea“favorite”here).
Thelocalknowledgebasemanagementagent(KBMA)isaslightlydifferentmem-berofthisclassbecauseunlikemostIEAsitactuallystoresdataviaagentmessagesratherthanonlyqueryingexternaldatasources.Itisherethattheannotationsofthegeneticinformationarematerialized,andfromwhichmostqueriesareanswered.EachKBMAisupdatedwithrawsequencingdataindirectlyfromausersequenceadditioninterfacethatisthenautomaticallyannotatedunderthecontrolofanannotationtaskagent.KBMAscanbe“owned”bydifferentparties,andqueriedseparatelyortogether.Inthisway,researcherswithlimitedcomputerknowledgecancreatesharableannotatedsequencedatabasesusingtheexistingwrappersandotheranalysistoolsastheyarede-veloped,withouthavingtonecessarilydownloadandinstallthemthemselves.UsingaPARKA-DBknowledgebaseallowsefficient,modernrelationaldatastorageonthebackendandqueryaswellaslimitedKBinferencing[15].
TaskAgents.Therearetwodomaintaskagents;therestaregenericmiddleagentsdescribedearlier.TheAnnotationAgentdirectsexactlywhatinformationshouldbeannotatedforeachsequence.Itisresponsibleforstoringtherawsequencedata,mak-ingqueriestothevariouswrappedwebsites,storingthoseannotations,andalsoin-dicatingtheprovenanceofthedata(meta-informationregardingwhereanannotationcamefrom).TheSequenceSourceProcessingAgenttakesalmostrawsequencedatainASN.1formatasoutputbytypicalsequenceestimationprogramsorstoredinGenbank.Themainfunctionofthisagentistotestthisinputforinternalconsistency.
InterfaceAgents.Therearetwointerfaceappletsthatcommunicateviatheproxyagentwithotheragentsinthesystem.Oneisorientedtowardsaddingnewsequencestoalocalknowledgebase(securedbyapassword)andtheotherallowsanyonetoquerythecompleteannotatedKB(orevenmultipleKBs).Theinterfacehardlyscratchesthesurfaceofthequeriesthatareactuallypossible,butabigproblemisthatmostbiologistsarenotcomfortablewithcomplexquerylanguages.Indeed,thesimpleinterfacethatallowssimpleconjunctiveanddisjunctivequeriesoverdynamicmenusofannotations(constructedbytheappletatruntimefromtheactuallocalKB)isquiteadvancedascomparedtomostoftheexistingpublicsitesthatallowtextualkeywordsearchesonly.3.2FunctionalAnnotation
Thissubsystemisresponsibleforassistingthebiologistinthedifficultproblemofmak-ingfunctionalannotationsofeachgene.Unfortunately,manyofthemillionsofgenes
sequencedsofarhavefairlyhaphazard(fromacomputerscientist’sperspective)func-tionalannotation:simplyfreenaturallanguagedescriptions.Recently,afairlylargegrouprepresentingatleastsomeoftheprimaryorganismdatabaseshavecreatedacon-sortiumdedicatedtocreatingageneontologyforannotatinggenefunctioninthreebasicareas:thebiologicalprocessinwhichageneplaysapart,themolecularfunctionofthegeneproduct,andthecellularlocalization[22].Thesubsystemdescribedheresupportstheuseofthisontologybybiologistsassequencesareaddedtothesystem,eventuallyleadingtoevenmorepowerfulanalysisoftheresultingKBs.
InformationExtractionAgents.BesidesthegenesequenceLKBMAandtheGen-BankIEA,wearewrappingthreeneworganism-specificgenesequenceDBs,forDro-sophila(fruitfly),Mus(Mouse),andSaccrynomaecescervasie(yeast).EachoftheseorganismsispartoftheGeneOntology(GO)consortium,andhasspentconsiderabletimeinmakingtheproperfunctionalannotation.Eachoftheseagents,then,findsGO-annotated,closehomologsoftheunannotatedgeneandproposestheannotationofthehomologsfortheannotationofthenewgene.
TaskAgents.Therearetwonewtaskagents,oneisadomain-independentontol-ogyagentusingtheFIPAontologyagentspecificationasastartingpoint.TheontologyagentcontainsboththeGOontologiesandseveralmappingsfromothersymbologies(i.e.SwissProtterms)toGOterms.Infact,theMouseIEAusestheOntologyagenttomapsomenon-GOtermsforcertainrecordstoGOterms.Althoughnotindicatedonthefigure,someoftheotherorganismDBIEAagentsmustmapfromGOontologydescriptivestringstotheactualuniqueGOID.Theotherserviceprovidedbytheon-tologyagent(andnotexplicitlymentionedintheexperimentalFIPAOntologyAgentspecification)isfortheontologyreasoningagenttoaskhowtotermsarerelatedinanontology.TheOntologyReasoningAgentusesthisquerytobuildaminimumspanningtree(ineachofthethreeGOontologies)betweenallthetermsreturnedinalltheho-mologiesfromalloftheGOorganismdatabases.Thisinformationcanthenbeusedtoproposealikelyannotation,andtodisplayalloftheinformationgraphicallyforthebiologistviatheinterfaceagent.
InterfaceAgents.Thefunctionalinterfaceagent/appletconsistsoftwocolumnarpanes:ontheleft,thetoppanedisplaysthegenebeingannotated,andthebottomdis-playsthegeneralhomologiesfromGenBankwiththeirnaturallanguageannotations.Ontheright,threepanesdisplaythesubtreesfromthethreeGOontologies(biologicalprocess,molecularfunction,cellularlocation)markedincolorwiththehomologsfromthethreeorganismdatabases.3.3ESTProcessing
Onewaytobroadentheapplicabilityofthesystemistoacceptmorekindsofbasicin-putdatatotheannotationprocess.Forexample,wecouldbroadenthereachofthesys-tembystartingwithESTs(ExpressedSequenceTags)insteadofcompletesequences.Agentscouldwrapthestandardsoftwareforcreatingsequencesfromthisdata,atwhichpointtheexistingsystemcanbeused.TheuseofESTsispartofarelativelyinexpensiveapproachtosequencingwhereinsteadofdirectlysequencinggenomicDNA,weinsteaduseamethodthatproducesmanyshortsequencesthatpartiallyoverlap.Byfindingtheoverlapsintheshortsequences,wecaneventuallyreconstructtheentiresequenceof
eachexpressedgene.Essentially,thisisa“shotgun”approachthatreliesonstatisticsandthesheernumberofexperimentstoeventuallyproducecompletesequences.
Asasideeffectofthisprocessing,informationisproducedthatcanbeusedtofindSingleNucleotidePolymorphisms(SNPs).SNPsindicateachangeofonenucleotide(A,T,C,G)inasinglegenebetweendifferentindividuals(often,conservedacrossstrainsorsubspecies).Thesemarkersareveryimportantforidentificationeveniftheydonothavefunctionaleffects.
InformationExtractionAgents.TheprocessofconsensussequencebuildingandSNPidentificationdoesnotrequireanyexternalinformation,sotheonlyIEAsaretheLKBMAs.Upuntilnow,therehasonlybeenoneLKBMA,responsibleforthegenesequencesandannotations.ESTprocessingaddsasecondLKBMAresponsibleforstoringtheESTSthemselvesandtheassociatedinformationdiscussedbelow.Primarily,thisisbecause(especiallyearlyoninasequencingproject)therewillbethousandsofESTsthatdonotoverlaptoformcontiguoussequences,andthatESTsmaybeaddedandprocessedalmostdaily.
TaskAgents.Therearethreenewdomain-leveltaskagents.Thefirstdealswithpro-cessingchromatographs.Essentiallythechromatographisasetofsignalsthatindicatetherelativestrengthsofthewavelengthsassociatedwitheachluminousnucleotidetag.SeveralstandardUnixanalysisprogramsexisttoprocessthisdata,essentially“calling”thebestnucleotideforeachposition.Thechromatographprocessingagentwrapsthreeanalysisprograms:Phred,which“calls”thechromatographandalsoseparatelypro-ducesanuncertaintyscoreforeachnucleotideinthesequence;phd2fastawhichcon-vertsthisoutputintoastandard(FASTA)format;andx-matchwhichremovesapartofthesequencethatisabyproductofthesequencingmethod,andnotactuallypartoftheorganismsequence.Theconsensussequenceassemblyagentusestwomoreprograms(Phrapandconsed)onalltheESTsfoundsofartoproduceasetofcandidategenesbyappropriatelysplicingtogethertheshortESTsequences.Thisproducesasetofcandi-dategenesthatcanthenbeaddedtothegenesequenceLKBMAandfromwhichthevariousannotationprocessesdescribedearliermaycommence.Finally,aSNP-finderagentoperatesthePolyBayesprogramwhichusestheESTandSequenceKBsandtheuncertaintyscoresproducedbyPhredtonominatepossiblesinglenucleotidepolymor-phisms.
InterfaceAgents.Thereisonlyonesimpleinterfaceagent,toallowparticipantstoenterdatainthesystem.Preferably,thisischromatographdatafromthesequencers,becausetheoriginalchromatographallowsPhredtocalculatetheuncertaintyassociatedwitheachnucleotidecall.However,FASTA-format(simple“ATCG...”namedstrings)ESTscalledfromtheoriginalchromatographscanbeaccommodated.Thesecanbeusedtobuildconsensussequences,butnotforfindingSNPs.
4GeneExpressionProcessing
Anewkindofgenomicdataisnowbeingproduced,thatmayswampeventheamountofsequencingdata.Thisisso-calledgeneexpressiondata,andindicatesquantitativelyhowmuchageneproductisexpressedinsomelocation,undersomeconditions,atsomepointintime.Wearedevelopinganmulti-agentsystemthatusesavailableon-
linegenomicandmetabolicpathwayknowledgetoextendgeneexpressionanalysis.Byincorporatingknownrelationshipsbetweengenes,knowledge-basedanalysisofex-perimentalexpressiondataissignificantlyimprovedoverpurelystatisticalmethods.Althoughthissystemhasnotyetbeenintegratedintotheexistingagentcommunity,eventuallyrelevantgenomicinformationwillbemadeavailabletothesystemthroughtheexistingGenBankandSwissProtIEAs.Metabolicpathwaysofinteresttotheinves-tigatorareidentifiedthroughaKEGG(KyotoEncyclopediaofGenesandGenomes)databasewrapper.AnalysisofthegeneexpressiondataisperformedthroughanagentthatexecutesSAS,astatisticalpackagethatincludesclusteringandPCAanalysismeth-ods.Resultsaretobepresentedtotheuserthroughwebpageshyperlinkedtorelevantdatabaseentries.
Currenttechniquesforgeneexpressionanalysishaveprimarilyfocusedontheuseofclusteringalgorithms,whichgroupgenesofsimilarexpressionpatternstogether[12].However,experimentalgeneexpressiondatacanbeverynoisyandthecomplicatedpathwayswithinorganismscangeneratecoincidentalexpressionpatterns,whichcansignificantlylimitthebenefitsofstandardclusteranalysis.Inordertoseparategeneco-regulationpatternsfromco-expression,thegeneexpressionprocessingorganizationwasdevelopedtogatheravailablepathway-levelinformationinordertopresorttheex-pressiondataintofunctionalcategories.Thus,clusteringofthereduceddatasetismuchmorelikelytofindgenesthatareactuallyregulatedtogether.Thesystemalsopromisestobeusefulindiscoveringregulatoryconnectionsbetweendifferentpathways.OneadvantageofusingtheKEGGdatabaseisthatitsgene/enzymeentriesareorganizedbytheEC(EnzymeCommission)ontology,andsoareeasilymappedtogenenamesspecifictotheorganismofinterest.
5RelatedWork
Therehasbeensignificantworkongeneralalgorithmsforqueryplanning,selectivematerialization,andtheoptimizationofthesefromtheAIperspective,forexampleTSIMMIS[4],Infosleuth[19],SIMS[1],etc.,andofcourseonapplyingagentsasthewaytoembodythesealgorithms[16,21,10].
InBiology,comparedtotheworkbeingdonetocreatetherawdata,alltheworkonhowtoorganizeandretrieveitisrelativelysmall.Mostoftheworkincomputersciencedirectedtobiologicaldatahasbeenintheareaofheterogeneousdatabases,focusingonthesemi-structurednatureofmuchofthedatathatmakesitverydifficulttostoreuse-fullyincommercialrelationaldatabases[5].Someworkhasbeguninapplyingtheworkonwrappersandmediatorstobiologicaldatabases,forexampleTAMBIS[20].Thesesystemsdifferfromoursinthattheyarepureimplementationsofwrapper/mediatortechnologythatarecentralized,donotallowfordynamicchangesinsources,supportpersistentqueries,orconsidersecondaryuserutilityintheformoftimeorotherre-sourcelimitations.
Agenttechnologyhasbeenmakingsomeinroadsinthearea.Theword“agent”withthepopularconnotationofasinglecomputerprogramtodoauser’sbiddingisfoundinthepromotionalmaterialforDoubletwist(www.doubletwist.com).Here,an“agent”standsforapersistentquery(e.g.“tellmeifanewhomologisfoundinyour
databaseforthefollowingsequence”).Thereisnocollaborationorcommunicationbetweenagents.
Weknowofafewtrulymulti-agentprojectsinthisdomain.First,InfoSleuthhasbeenusedtoannotatelivestockgeneticsamples[11].Theflowofinformationisverysimilartooursystem.However,thesystemisnotsetupfornoticingchangesinthepublicdatabases,forintegratingnewdatasourcesonthefly,orforconsiderationofsecondaryuserutility.Second,theEDITtoTrEMBLsystem[17]isanotherautomatedannotationsystem,basedonthewrapperandmediatorconcept,forannotatingproteinsawaitingmanualannotationandentrytoSwissProt.Dispatcheragentscontroltheap-plicationofpotentiallycomplexsequencesofwrappers.Mostimportantly,thissystemsupportsthedetectionandpossiblerevisionofinconsistenciesrevealedbetweendiffer-entannotations.Third,theGeneWeaverproject[3]isanothertruemulti-agentsystemforannotationofgenomes.GeneWeaverhasasaprimarydesigncriteriontheobserva-tionthatthesourcedataisalwayschanging,andsoannotationsneedtobeconstantlyupdated.Theyalsoexpresstheideathatnewsourcesoranalysistoolsshouldbeeasytointegrateintothesystem,whichplaystotheopensystemsrequirement,althoughtheydonotdescribedetails.Theprimarydifferencesarethewayinwhichanopensystemisachieved(itisnotclearthattheyuseagent-levelmatchmaking,butratherpossiblyCORBAspecifications)andthatGeneWeaverisnotbasedonasharedarchitecturethatsupportsreasoningaboutsecondaryuserutility.IncomparisontotheDECAFimple-mentation,GeneWeaverusesCORBA/RMIratherthanTCP/IPcommunication,andasimplifiedKQML-likelanguagecalledBAL.
6Discussion
Thesystemdescribedhereisoperationalandnormallyavailableonthewebat
http://udgenome.ags.udel.edu/herpes/.Thisisaworkingprototype,andsotheinterfaceisstronglyorientedtobiologistsonly.Ingeneral,computationalsupportfortheprocessesthatbiologistsuseinanalyzingdataisprimitive(Perlscripts)ornon-existent.Inlessthan10min,wewereabletoannotatetheHVT-1sequence,aswellasstoreitinaqueryableandweb-publishableform.Thisimpressedthebiologistsweworkwith,comparedtomanualannotationandflatASCIIfiles.Furthermore,wehaverecentlyaddedapproximately15otherpubliclyavailableherpesvirussequences(e.g.severalstrainsofHumanherpesvirus,Africanswinefevervirus,etc.).Theresultingknowledgebasealmostimmediatelyresultedinqueriesbyourlocalbiologiststhatindi-catedpossibleinterestingrelationshipsthatmayresultinfuturebiologicalwork.Thissummerwewillbegintestingwithviralbiologistsfromotheruniversities.
Otherthingsaboutthesystemwhichhaveexcitedourbiologistco-workersaretherelativeeasebywhichwecanaddnewtypesofannotationoranalysisinformation,andthefactthatthesystemcanbeusedtobuildsimilarsystemsforotherorganisms,suchasthechicken.Forexample,theuseofopensystemconceptssuchasamatchmakerallowtheannotationagenttoaccessandusenewannotationservicesthatwerenotavailablewhenitwasinitiallywritten.Secondaryuserutilitywillbecomeusefulforthebiologistwhenfacedwithmakingasimpleofficequeryvs.checkingresultsbeforepublication.
TheunderlyingDECAFsystemhasbeenevaluatedinseveralways,especiallywithrespecttotheuseofparallelcomputationalresourcesbyasingleagent(alloftheDE-CAFcomponentsandalloftheexecutableactionsareruninparallelthreads),andtheefficacyoftheDRUschedulerwhichefficientlysolvesarestrictedsubsetofthedesign-to-criteriaschedulingproblem[14].Runningthegeneannotationsystemasatrulymulti-agentsystemresultsintruespeedups,althoughmostofthetimeiscurrentlyspentinremotedatabaseaccessParallelhardwareforeachagentwillbeusefulforsomeofthemorelocallycomputationallyintensivetasksinvolvingESTprocessing.
7ConclusionsandFutureWork
Inthispaperwehavediscussedtheveryrealproblemofmakingsomeuseofthetremen-dousamountsofgeneticsequenceinformationthatarebeingproduced.Whilethereismuchinformationpubliclyavailableovertheweb,accessingsuchinformationisdiffer-entforeachsourceandtheresultscanonlybeusedbyasingleresearcher.Furthermore,thecontentsoftheseprimarysourcesarechangingallthetime,andnewsourcesandtechniquesforanalysisareconstantlybeingdeveloped.
Wecastthissequenceannotationproblemasageneralinformationgatheringprob-lem,andproposedtheuseofmulti-agentsystemsforimplementation.Beyondthebasicheterogeneousdatabaseproblemthatthisproblemrepresents,anMASsolutiongivesusmechanismsfordealingwithchangingdata,thedynamicappearanceofnewsources,mindingsecondaryutilitycharacteristicsforusers,andofcoursetheobviousdistributedprocessingachievementsofparalleldevelopment,concurrentprocessing,andthepos-sibilityforhandlingcertainsecurityorotherorganizationalconcerns(wherepartoftheagentorganizationcanmirrorthehumanorganization).
Wecurrentlyareofferingthesystempubliclyontheweb,withtheknownher-pesvirussequences.AsecondsystembasedonchickenESTsshouldbeavailablebytheendof2001.Weintendtobroadentheannotationcoverageandaddmorecom-plexanalyses.Anexamplewouldbetheestimationofthephysicallocationofthegeneaswellasitsfunction.BecausebiologistshavelongrecordedcertainQTLs(Quanti-tativeTraitLoci)thatindicatethatacertainphysicalregionisresponsibleforatrait(suchaschickenswithresistancetoacertaindisease),beingabletoseewhatgenesarephysicallylocatedintheQTLregionisastrongindicatorastotheirhigh-levelgeneticfunction.
Ingeneral,wehavenotyetdesignedaninterfacethatallowsbiologiststotakefulladvantageofthematerializeddata—theyareuncomfortablewithcomplexquerylan-guages.Webelievethatitmaybepossibletobuildagraphicalinterfacetoallowabiologist,aftersometraining,tocreateacommonlyneededanalysisqueryandtothensavethisforuseinthefuturebythatscientist,orotherssharingtheagentnamespace.Finally,thenextmajorsubsystemwillbeagentstolinkandanalyzegeneexpres-siondata(whichwillinturninteroperatewiththemetabolicpathwayanalysissystemsdescribedabove).Thisdataneedstobelinkedwithsequenceandfunctiondata,toal-lowmorepowerfulanalysis.Forexample,linkedtoQTLdata,thisallowsustoaskquestionssuchas“whatchemicalsmightpreventclubrootdiseaseincabbage?”.
References
1.Y.ArensandC.A.Knoblock.Intelligentcaching:Selecting,representing,andreusingdatainaninformationserver.InProc.3rdIntl.Conf.onInfo.andKnow.Mgmt.,1994.2.D.A.Bensonandetal.Genbank.NucleicAcidsRes.,28:15–18,2000.
3.K.Bryson,M.Luck,M.Joy,andD.T.Jones.Applyingagentstobioinformaticsinge-neweaver.InProc.4thInt.Wksp.Collab.Info.Agents,2000.
4.S.Chawathe,H.Garcia-Molina,J.Hammer,K.Ireland,Y.Papakonstantinou,J.Ullman,andJ.Widom.TheTSIMMISproject:integrationofheterogeneousinformationsources.InProc.10thMtg.Info.Proc.Soc.Japan,Dec.1994.
5.S.B.Davidsonandetal.Biokleisli:adigitallibraryforbiomedicalresearchers.Intnl.J.onDigitalLibraries,1(1):36–53,1997.
6.K.Decker,X.Zheng,andC.Schmidt.Amulti-agentsystemforautomatedgenomicanno-tation.InProceedingsofthe5thIntl.Conf.onAutonomousAgents,Montreal,2001.
7.K.S.Decker,A.Pannu,K.Sycara,andM.Williamson.Designingbehaviorsforinformationagents.InProc.1stIntl.Conf.onAutonomousAgents,pages404–413,1997.
8.K.S.Decker,K.Sycara,andM.Williamson.Middle-agentsfortheinternet.InProc.15thIJCAI,pages578–583,1997.
9.K.S.DeckerandV.R.Lesser.Quantitativemodelingofcomplexcomputationaltaskenvi-ronments.InProc.11thAAAI,pages217–224,1993.
10.K.S.DeckerandK.Sycara.Intelligentadaptiveinformationagents.JournalofIntelligent
InformationSystems,9(3):239–260,1997.
11.L.Deschaine,R.Brice,andM.Nodine.Useofinfosleuthtocoordinateinformationacqui-sition,tracking,andanalysisincomplexapplications.MCC-INSL–008-00,2000.
12.M.B.Eisen,P.T.Spellman,P.O.Brown,andD.Botstein.Clusteranalysisanddisplayof
genome-wideexpressionpatterns.Proc.Nat.Acad.Sci.
13.J.GrahamandK.S.Decker.Towardsadistributed,environment-centeredagentframework.
InIntelligentAgentsVI,LNAI-1757,pages290–304.SpringerVerlag,2000.
14.J.Graham.Real-timeSchedulinginMulti-agentSystems.PhDthesis,Universityof
Delaware,2001.
15.J.HendlerandM.TaylorK.Stoffel.Advancesinhighperformanceknowledgerepresenta-tion.TechnicalReportCS-TR-3672,UniversityofMarylandInstituteforAdvancedCom-puterStudies,1996.Alsocross-referencedasUMIACS-TR-96-56.
16.C.A.Knoblock,Y.Arens,andC.Hsu.Cooperatingagentsforinformationretrieval.InProc.
2ndIntl.Conf.onCooperativeInformationSystems.Univ.ofTorontoPress,1994.17.S.M¨ollerandM.Schroeder.Consistentintegrationofnon-reliableheterogeneousinfor-mationappliedtotheannotationoftransmembraneproteins.JournalofComputingand
Chemistry,toappear,2001.
18.I.Muslea,S.Minton,andC.Knoblock.Stalker:Learningexpectationrulesforsimistruc-turedweb-basedinformationsources.InPapersfromthe1998WorkshoponAIandInforma-tionGathering,1998.alsoTechnicalReportws-98-14,UniversityofSouthernCalifornia.19.M.NodineandA.Unruh.Facilitatingopencommunicationinagentsystems:theinfosleuth
infrastructure.InIntelligentAgentsIV,pages281–295.Springer-Verlag,1998.
20.R.Stevensandetal.Tambis:Transparentaccesstomultiplebioinformaticsinformation
sources.Bioinformatics,16(2):184–185,2000.
21.K.Sycara,K.S.Decker,A.Pannu,M.Williamson,andD.Zeng.Distributedintelligent
agents.IEEEExpert,11(6):36–46,December1996.
22.TheGeneOntologyConsortium.Geneontolgy:toolfortheunificationofbiology.Nature
Genetics,25(1):25–29,May2000.
23.T.Wagner,A.Garvey,andV.Lesser.Complexgoalcriteriaanditsapplicationindesign-to-criteriascheduling.InProc.14thAAAI,1997.
因篇幅问题不能全部显示,请点此查看更多更全内容