{"id":22748,"date":"2022-11-07T11:42:09","date_gmt":"2022-11-07T10:42:09","guid":{"rendered":"https:\/\/kinit.sk\/multimodal-processing-can-artificial-intelligence-learn-the-meaning-and-relationship-between-several-different-modalities\/"},"modified":"2022-11-07T11:54:07","modified_gmt":"2022-11-07T10:54:07","slug":"multimodalne-spracovanie","status":"publish","type":"post","link":"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/","title":{"rendered":"Multimod\u00e1lne spracovanie: Dok\u00e1\u017ee sa umel\u00e1 inteligencia nau\u010di\u0165 v\u00fdznam a vz\u0165ahy medzi viacer\u00fdmi r\u00f4znymi modalitami?"},"content":{"rendered":"<div id=\"\" class=\"element core-paragraph\">\n<p>Umel\u00e1 inteligencia sa v posledn\u00fdch rokoch stala hor\u00facou t\u00e9mou v mnoh\u00fdch odvetviach, ke\u010f\u017ee jej vyu\u017eitie ka\u017ed\u00fdm d\u0148om rastie. Aj ke\u010f m\u00e1me \u010faleko od v\u0161eobecnej umelej inteligencie, ktor\u00e1 by bola na nerozoznanie od \u013eudskej bytosti, dnes u\u017e dok\u00e1\u017eeme jednoducho a celkom spo\u013eahlivo rie\u0161i\u0165 zlo\u017eit\u00e9 \u00falohy pomocou po\u010d\u00edta\u010da. Medzi tieto \u00falohy patr\u00ed napr.:<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-list\">\n<ul class=\"wp-block-list\"><li><strong>preklad<\/strong> komplikovan\u00e9ho textu do zvolen\u00e9ho jazyka,<\/li><li><strong>rozpozn\u00e1vanie tv\u00e1re<\/strong> a automatick\u00e9 otvorenie dver\u00ed len tej osobe, ktorej fotografia je v datab\u00e1ze zamestnancov,<\/li><li><strong>detekcia<\/strong> a <strong>lokaliz\u00e1cia<\/strong> vo\u013en\u00e9ho parkovacieho miesta pomocou videoz\u00e1znamu z auta.<\/li><\/ul>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>V\u0161etky uveden\u00e9 pr\u00edklady vyu\u017e\u00edvaj\u00fa <strong>komplexn\u00e9 \u0161pecifick\u00e9 modely hlbok\u00e9ho u\u010denia<\/strong> na predikovanie po\u017eadovan\u00e9ho v\u00fdstupu s \u010do najv\u00e4\u010d\u0161ou presnos\u0165ou (spr\u00e1vne prelo\u017een\u00e9 vety, povolenie alebo z\u00e1kaz otv\u00e1rania dver\u00ed, ulokalita vo\u013en\u00e9ho parkovacieho miesta,&#8230;). Tieto modely pou\u017e\u00edvaj\u00fa ako <strong>vstup v danom \u010dase iba jednu modalitu<\/strong>, preto ich naz\u00fdvame <strong>unimod\u00e1lne<\/strong>.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>Modalita sa vz\u0165ahuje na konkr\u00e9tny sp\u00f4sob alebo mechanizmus k\u00f3dovania inform\u00e1ci\u00ed. <strong>Pod r\u00f4znymi modalitami rozumieme obraz, video, text alebo zvuk.<\/strong><\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>V re\u00e1lnom svete sa v\u0161ak tieto vstupy zvy\u010dajne vyskytuj\u00fa s\u00fa\u010dasne. Ich spolo\u010dn\u00fdm spracovan\u00edm m\u00f4\u017eeme z\u00edska\u0165 spo\u013eahlivej\u0161ie inform\u00e1cie o svete. Tento pr\u00edklad m\u00f4\u017eeme vidie\u0165 na Obr\u00e1zku 1. Ak chceme automaticky z\u00edska\u0165 sp\u00e4tn\u00fa v\u00e4zbu od n\u00e1v\u0161tevn\u00edkov kina, m\u00f4\u017eeme sa zamera\u0165 bu\u010f na ich v\u00fdrazy tv\u00e1re, alebo sa ich op\u00fdta\u0165 a zap\u00edsa\u0165 si ich n\u00e1zor, alebo urobi\u0165 oboje. V pr\u00edpadoch, ke\u010f je vstup z jednej modality nejednozna\u010dn\u00fd, m\u00f4\u017eeme sa spo\u013eahn\u00fa\u0165 na vstup druhej modality.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-image\">\n<figure class=\"wp-block-image is-resized\"><img decoding=\"async\" data-src=\"https:\/\/lh5.googleusercontent.com\/RIEWUUAGd05tHsNhnMjuwJjMv7gxrHLsOosimEFhdSp_6DzcEi9Q97xycaSl9hHOukJBFdVwy5XOmhdmweQTUpJVbu1lhpCC_45FQJn0fg6yLo9hOklOsiS6qEuBga-17gGThXYHcIN61LbY6RDZSb3-T-hBps13JkGySaxharelwFOOXbeF-4WDjMwTVQ\" alt=\"\" width=\"467\" height=\"433\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" class=\"lazyload\" style=\"--smush-placeholder-width: 467px; --smush-placeholder-aspect-ratio: 467\/433;\" \/><figcaption><em><strong>Obr\u00e1zok 1: <\/strong>Tento obr\u00e1zok n\u00e1zorne ukazuje v\u00fdhody multimod\u00e1lnych modelov v porovnan\u00ed s unimod\u00e1lnymi na pr\u00edklade anal\u00fdzy sentimentu. V unimod\u00e1lnom modeli pou\u017e\u00edvame naraz iba jednu modalitu a predikujeme sentiment. Modr\u00fd vstup predstavuje text, oran\u017eov\u00fd vstup predstavuje v\u00fdraz tv\u00e1re alebo obr\u00e1zok a \u017elt\u00fd vstup predstavuje zvukov\u00fd z\u00e1znam. V bimod\u00e1lnom modeli pou\u017e\u00edvame dve modality a v trimod\u00e1lnom modeli pou\u017e\u00edvame tri modality na predikovanie sentimentu. Je jasne vidie\u0165, ako pou\u017e\u00edvanie viacer\u00fdch modal\u00edt pom\u00e1ha lep\u0161ie predpoveda\u0165 sp\u00e4tn\u00fa v\u00e4zbu n\u00e1v\u0161tevn\u00edkov kina.<br>Zdroj: <a href=\"https:\/\/arxiv.org\/pdf\/1707.07250.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">[1]<\/a><\/em><\/figcaption><\/figure>\n<\/div>\n\n<div id=\"\" class=\"element core-heading\">\n<h3 class=\"wp-block-heading\">\u010co je multimodalita a kedy je model multimod\u00e1lny?<\/h3>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>Multimodalita v strojovom u\u010den\u00ed nast\u00e1va, ke\u010f s\u00fa dva alebo viac vstupov zaznamenan\u00fdch na r\u00f4znych typoch m\u00e9di\u00ed a ktor\u00e9 nie je mo\u017en\u00e9 na seba jednozna\u010dne napoji\u0165 pomocou algoritmu, spracovan\u00e9 rovnak\u00fdm modelom strojov\u00e9ho u\u010denia. To znamen\u00e1, \u017ee <strong>model<\/strong> hlbok\u00e9ho u\u010denia, ktor\u00fd <strong>spracov\u00e1va video a p\u00edsan\u00fd text<\/strong>, je <strong>multimod\u00e1lny. Model<\/strong>, ktor\u00fd <strong>spracov\u00e1va obr\u00e1zky<\/strong> vo form\u00e1toch PDF a JPG (ktor\u00e9 mo\u017eno previes\u0165 z jedn\u00e9ho form\u00e1tu na druh\u00fd bez straty inform\u00e1cie), je <strong>unimod\u00e1lny<\/strong>.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><strong>\u013dudia s\u00fa prirodzene dobr\u00ed v ch\u00e1pan\u00ed a sp\u00e1jan\u00ed viacer\u00fdch modal\u00edt s\u00fa\u010dasne bez toho, aby si to uvedomovali<\/strong>. Po\u010das pozerania filmu sme schopn\u00ed lokalizova\u0165 objekty, rozpozn\u00e1va\u0165 sc\u00e9ny alebo \u010dinnosti. M\u00f4\u017eeme \u010d\u00edta\u0165 titulky, vn\u00edma\u0165 vz\u0165ahy medzi postavami a s\u00fastredi\u0165 sa na v\u00fdznam slov, ktor\u00e9 postavy hovoria. Em\u00f3cie alebo d\u00f4le\u017eitos\u0165 situ\u00e1cie m\u00f4\u017eeme pochopi\u0165 napr\u00edklad prostredn\u00edctvom intenzity a d\u00f4razu hlasu. V\u0161etky tieto vstupy a modality s\u00fa s\u00fa\u010dasne spracovan\u00e9 v na\u0161om mozgu a vytv\u00e1rame z nich zmyslupln\u00e9 a zrozumite\u013en\u00e9 predstavy. Jednoducho povedan\u00e9, <strong>\u013eudsk\u00fd mozog je multimod\u00e1lny<\/strong>.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>V KInITe sme sa za\u010dali zameriava\u0165 na modelovanie obrazu a jazyka s\u00fa\u010dasne, konkr\u00e9tnej\u0161ie na <strong>spracovanie obr\u00e1zkov a textu pomocou jedn\u00e9ho modelu<\/strong>. T\u00e1to oblas\u0165 multimod\u00e1lneho spracovania m\u00e1 mnoho aplik\u00e1ci\u00ed, ako je napr. pomoc \u013eu\u010fom so zrakov\u00fdm postihnut\u00edm, zefekt\u00edvnenie zdravotnej starostlivosti pomocou automatick\u00fdch opisov r\u00f6ntgenov alebo CT skenov, \u010di pochopenie obr\u00e1zkov zverejnen\u00fdch na soci\u00e1lnych sie\u0165ach a zak\u00e1zanie t\u00fdch, ktor\u00e9 maj\u00fa nevhodn\u00fd obsah.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>V s\u00fa\u010dasnosti rozbiehame <strong>projekt <\/strong><a href=\"https:\/\/kinit.sk\/project\/disai-improving-scientific-excellence-of-kinit\/\">DisAI<\/a> <strong>v r\u00e1mci programu Horizont Eur\u00f3pa<\/strong> . V projekte sa zameriavame na <strong>boj proti dezinform\u00e1ci\u00e1m pomocou umelej inteligencie<\/strong>, ktor\u00fd zah\u0155\u0148a aj v\u00fdskum dezinform\u00e1ci\u00ed, ktor\u00e9 pou\u017e\u00edvaj\u00fa kombin\u00e1ciu obrazu a textu.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-heading\">\n<h3 class=\"wp-block-heading\">V\u00fdvoj modelov pou\u017e\u00edvan\u00fdch na spracovanie obr\u00e1zkov a textu s\u00fa\u010dasne: od \u0161tatistick\u00fdch modelov k transformerom<\/h3>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>V poslednom \u010dase do\u0161lo k ve\u013ek\u00e9mu zlep\u0161eniu vo v\u00fdskume spracovania obrazu a jazyka s\u00fa\u010dasne. Jedn\u00fdm z prv\u00fdch pokusov o tak\u00e9to modely pred neur\u00f3nov\u00fdmi sie\u0165ami boli <strong>\u0161tatistick\u00e9 algoritmy<\/strong>, ako je kanonick\u00e1 korela\u010dn\u00e1 anal\u00fdza. Kanonick\u00e1 korela\u010dn\u00e1 anal\u00fdza je met\u00f3da na n\u00e1jdenie spolo\u010dnej reprezent\u00e1cie dvoch vektorov ako line\u00e1rnej kombin\u00e1cie.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><strong>Po vzniku neur\u00f3nov\u00fdch siet\u00ed<\/strong> boli zaveden\u00e9 prepracovanej\u0161ie met\u00f3dy. Ako prv\u00e9 bolo pou\u017eit\u00e9 spojenie <strong>CNN<\/strong> (<strong>spracovanie obr\u00e1zkov<\/strong> konvolu\u010dnou neur\u00f3novou sie\u0165ou) a <strong>LSTM alebo inej embedingovej techniky <\/strong>(<strong>spracovanie textu<\/strong> rekurentnou neur\u00f3novou sie\u0165ou) pomocou zre\u0165azenia (konkaten\u00e1cie), element\u00e1rneho vektorov\u00e9ho n\u00e1sobenia alebo nesk\u00f4r pomocou mechanizmu pozornosti. Jedna z t\u00fdchto met\u00f3d je zn\u00e1zornen\u00e1 na Obr\u00e1zku 2.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-image\">\n<figure class=\"wp-block-image is-resized\"><img decoding=\"async\" data-src=\"https:\/\/lh6.googleusercontent.com\/13IJfdRF0A_GLjy07ijXtp9BJB2oZ6hqyTKls3epZ97hxA6RJFKBBXqc9v36nl4FkvJhHN7VVf9Euxifcj5-RSU-xAFamlm4qKc-Lwd-cJJp6cgxPUlP9LpcIoVLSzlzcqIMvn8bHmen_5g76U7OYImwH87F-zW-3Jgdva6qILUm2kHm-DDzzdQ5JWCzsA\" alt=\"\" width=\"627\" height=\"365\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" class=\"lazyload\" style=\"--smush-placeholder-width: 627px; --smush-placeholder-aspect-ratio: 627\/365;\" \/><figcaption><em><strong>Obr\u00e1zok 2:<\/strong> Pr\u00edklad jednej z prv\u00fdch multimod\u00e1lnych reprezent\u00e1ci\u00ed vytvoren\u00fdch pomocou neur\u00f3nov\u00fdch siet\u00ed. Zelen\u00fd model vysvet\u013euje spracovanie obr\u00e1zka pomocou CNN, modr\u00fd model pou\u017e\u00edva pre z\u00edskanie textov\u00fdch prvkov pr\u00edstup skip-gramu. Tieto dve reprezent\u00e1cie s\u00fa potom zre\u0165azen\u00e9, aby vytvorili multimod\u00e1lny slovn\u00fd vektor.<br>Zdroj: <a href=\"https:\/\/aclanthology.org\/D14-1005.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">[2]<\/a><\/em><\/figcaption><\/figure>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>Po tom, \u010do Vaswani predstavil <strong>architekt\u00faru transformer<\/strong> <a href=\"https:\/\/papers.nips.cc\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">[3]<\/a>, ktor\u00e1 z\u00edskala obrovsk\u00fd \u00faspech a state-of-the-art v\u00fdsledky (najlep\u0161ie v porovnan\u00ed s ostatn\u00fdmi modelmi) pre \u00falohy NLP, za\u010dal sa mechanizmus pozornosti pou\u017e\u00edva\u0165 na kombinovan\u00e9 spracovanie jazyka a obrazu.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>Na modelovanie kr\u00ed\u017eovej interakcie medzi modalitami existuj\u00fa dva typy transformerov: <strong>jednopr\u00fadov\u00e9 a dvojpr\u00fadov\u00e9<\/strong>.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>V jednopr\u00fadovom transformeri sa pou\u017e\u00edva architekt\u00fara podobn\u00e1 BERT. To znamen\u00e1, \u017ee vektor predstavuj\u00faci textov\u00fa reprezent\u00e1ciu a vektor predstavuj\u00faci reprezent\u00e1ciu obr\u00e1zku (so \u0161peci\u00e1lnymi prvkami na ozna\u010denie polohy, napr\u00edklad poradie slova vo vete) s\u00fa zre\u0165azen\u00e9 do jednej reprezent\u00e1cie a t\u00e1 je vstupom do enk\u00f3dera transformera. Pr\u00edkladmi t\u00fdchto modelov s\u00fa VisualBERT <a href=\"https:\/\/arxiv.org\/pdf\/1908.03557.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">[4]<\/a>, V-L BERT <a href=\"https:\/\/arxiv.org\/pdf\/1908.08530.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">[5]<\/a> alebo OSCAR <a href=\"https:\/\/arxiv.org\/pdf\/2004.06165.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">[6]<\/a>.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>Na druhej strane, dvojpr\u00fadov\u00e9 transformery najprv spracuj\u00fa obidve reprezent\u00e1cie samostatn\u00fdmi transformermi a potom ich kombinuj\u00fa pomocou kr\u00ed\u017eovej pozornosti (cross-attention), kde query-vektory s\u00fa z jednej modality, zatia\u013e \u010do key-vektory a value-vektory s\u00fa z druhej. Pr\u00edkladmi t\u00fdchto modelov s\u00fa ViLBERT <a href=\"https:\/\/proceedings.neurips.cc\/paper\/2019\/file\/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">[7]<\/a>, LXMERT <a href=\"https:\/\/arxiv.org\/pdf\/1908.07490.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">[8]<\/a> alebo ALBERT <a href=\"https:\/\/arxiv.org\/pdf\/1909.11942.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">[9]<\/a>. Rozdiel medzi jednopr\u00fadovou architekt\u00farou a dvojpr\u00fadovou architekt\u00farou je zobrazen\u00fd na obr\u00e1zku 3.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-image\">\n<figure class=\"wp-block-image is-resized\"><img decoding=\"async\" data-src=\"https:\/\/lh3.googleusercontent.com\/SvMQGNKpK4dUAJTJzVlGNX-ASiOVTqFU5URKSLDCkXBqgdWTsV1mojfbsZZWw4SirnXrrGDIv5p6BZ23wV2qHlG8Xui_srzH5KYhiiaOVGGLj_sK4kD15plDrq9hGCtsCpgZYeavKTQ6hBrWPdgiYuf-JvSHI_TMhDRK7BzOz1pji73P9VZJ4MXaYEg9Yw\" alt=\"\" width=\"759\" height=\"218\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" class=\"lazyload\" style=\"--smush-placeholder-width: 759px; --smush-placeholder-aspect-ratio: 759\/218;\" \/><figcaption><em><strong>Obr\u00e1zok 3<\/strong>: Porovnanie jednopr\u00fadov\u00e9ho (v\u013eavo) a dvojpr\u00fadov\u00e9ho (vpravo) transformera na spracovanie obrazu a textu.\u00a0<br>Zdroj: <a href=\"https:\/\/arxiv.org\/pdf\/2005.07310.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">[10]<\/a><\/em><\/figcaption><\/figure>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>Okrem t\u00fdchto transformerov existuj\u00fa aj takzvan\u00e9 <strong>du\u00e1lne enk\u00f3dery<\/strong>, ktor\u00e9 pou\u017e\u00edvaj\u00fa dva samostatn\u00e9 enk\u00f3dery, v ktor\u00fdch sa spracov\u00e1va ka\u017ed\u00e1 modalita zvl\u00e1\u0161\u0165. Potom sa tieto reprezent\u00e1cie premietnu do spolo\u010dn\u00e9ho s\u00e9mantick\u00e9ho priestoru a pomocou mechanizmu pozornosti alebo skal\u00e1rneho s\u00fa\u010dinu sa spo\u010d\u00edta a maximalizuje sk\u00f3re podobnosti medzi p\u00e1rmi. Reprezent\u00e1cia opisu a obr\u00e1zku, ktor\u00fd opisuje, bud\u00fa v priestore bl\u00edzko seba, zatia\u013e\u010do od reprezent\u00e1cie obr\u00e1zku, ktor\u00fd s opisom nes\u00favis\u00ed, bude \u010faleko. Najzn\u00e1mej\u0161\u00edm du\u00e1lnym enkod\u00e9rom je CLIP <a href=\"http:\/\/proceedings.mlr.press\/v139\/radford21a\/radford21a.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">[11]<\/a> a jeho predtr\u00e9novanie m\u00f4\u017eeme vidie\u0165 na Obr\u00e1zku 4. Podrobnej\u0161ie popisy rozdielov, v\u00fdhod a nev\u00fdhod jednotliv\u00fdch pr\u00edstupov spracovania modal\u00edt mo\u017eno n\u00e1js\u0165 napr\u00edklad v tomto v\u00fdskumnom \u010dl\u00e1nku <a href=\"https:\/\/arxiv.org\/pdf\/2202.10936.pdf\">[1<\/a><a href=\"https:\/\/arxiv.org\/pdf\/2202.10936.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">2<\/a><a href=\"https:\/\/arxiv.org\/pdf\/2202.10936.pdf\">]<\/a>.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-image\">\n<figure class=\"wp-block-image is-resized\"><img decoding=\"async\" data-src=\"https:\/\/lh5.googleusercontent.com\/gU6Zsp13u2JHWPw58UfhPR11U21qmzkRhZPhUX7BM2-HfbQa3uabEQIlAoFyQ9nFUZqrPTk6XAa6RVeFt0EnMctnFli_-UnyE0IhX4vtsbEzJg8s8wKPDX7sPty9H_JrU6KiQdLPYlG7zuNIfz5YHqPVaVxyg9PXFiA7riwZetY59rd6acX8nC5uMkV6aQ\" alt=\"\" width=\"631\" height=\"410\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" class=\"lazyload\" style=\"--smush-placeholder-width: 631px; --smush-placeholder-aspect-ratio: 631\/410;\" \/><figcaption><em><strong>Obr\u00e1zok 4:<\/strong> Architekt\u00fara CLIP &#8211; najsk\u00f4r sa obr\u00e1zok a text enk\u00f3duj\u00fa oddelene a potom sa reprezent\u00e1cie premietnu do rovnak\u00e9ho s\u00e9mantick\u00e9ho priestoru pomocou skal\u00e1rneho s\u00fa\u010dinu. Toto sk\u00f3re podobnosti je maximalizovan\u00e9 pre popis a obr\u00e1zok, ktor\u00e9 sa zhoduj\u00fa (na diagon\u00e1le) a minimalizovan\u00e9 pre tie dvojice, ktor\u00e9 sa nezhoduj\u00fa.<br>Zdroj: <a href=\"http:\/\/proceedings.mlr.press\/v139\/radford21a\/radford21a.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">[11]<\/a><\/em><\/figcaption><\/figure>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-heading\">\n<h3 class=\"wp-block-heading\">Na \u010do sa pou\u017e\u00edvaj\u00fa modely sprac\u00favaj\u00face obraz a text?<\/h3>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>Existuje ove\u013ea viac modelov ako tie, ktor\u00e9 sme spomenuli vy\u0161\u0161ie, ale \u010dasto je ich architekt\u00fara \u0161pecifick\u00e1 pre konkr\u00e9tne \u00falohy, na ktor\u00e9 sa pou\u017e\u00edva. Vo vedeck\u00fdch pr\u00e1cach sa uv\u00e1dzaj\u00fa r\u00f4zne viac \u010di menej n\u00e1ro\u010dn\u00e9 \u00falohy spojen\u00e9 s jazykom a obr\u00e1zkom. Prvou, ve\u013emi popul\u00e1rnou \u00falohou, je <strong>generovanie popisu obr\u00e1zka<\/strong>, kde hlavn\u00fdm cie\u013eom je vygenerova\u0165 zmyslupln\u00fd a gramaticky spr\u00e1vny popis cel\u00e9ho obr\u00e1zka. Zlo\u017eitej\u0161ou \u00falohou je <strong>generovanie pr\u00edbehu<\/strong> o slede obr\u00e1zkov, kde pre s\u00fabor nieko\u013ek\u00fdch obr\u00e1zkov treba v porad\u00ed vygenerova\u0165 popis ka\u017ed\u00e9ho z nich tak, aby spolu vytvorili kr\u00e1tky pr\u00edbeh.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>\u010eal\u0161ie dve \u00falohy s\u00fa zameran\u00e9 na odkazuj\u00face v\u00fdrazy (referring expression). Ide o fr\u00e1zy tvoren\u00e9 preva\u017ene podstatn\u00fdmi menami a atrib\u00fatmi, ktor\u00e9 jednozna\u010dne vystihuj\u00fa dan\u00fd predmet alebo osobu na obr\u00e1zku (napr\u00edklad \u017eena v \u010dervenom klob\u00faku ved\u013ea mu\u017ea so psom). Tie mo\u017eno generova\u0165 (<strong>generovanie odkazuj\u00faceho v\u00fdrazu<\/strong>) v pr\u00edpade, ke\u010f sa vyberie objekt na obr\u00e1zku a vygeneruje sa jednozna\u010dn\u00e1 fr\u00e1za. \u00daloha m\u00f4\u017ee by\u0165 aj opa\u010dn\u00e1, ke\u010f ide o pochopenie (<strong>porozumenie odkazuj\u00faceho v\u00fdrazu<\/strong>) a je potrebn\u00e9 n\u00e1js\u0165 polohu predmetu alebo osoby, ke\u010f je zadan\u00fd obr\u00e1zok a fr\u00e1za.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>Zn\u00e1mou \u00falohou je aj <strong>zodpovedanie ot\u00e1zok o obr\u00e1zku<\/strong>. M\u00f4\u017ee \u00eds\u0165 o \u00falohu, kde sa odpove\u010f d\u00e1 vybra\u0165 z viacer\u00fdch mo\u017enost\u00ed, alebo ide o odpove\u010f na otvoren\u00fa ot\u00e1zku a odpove\u010f sa generuje slovo po slove. Zd\u00f4vod\u0148ovanie t\u00fdchto odpoved\u00ed je s\u0165a\u017een\u00edm tejto \u00falohy, preto\u017ee na ve\u013emi sofistikovan\u00e9 ot\u00e1zky je potrebn\u00e9 vedie\u0165 uva\u017eova\u0165 o vizu\u00e1lnom svete. Prirodzen\u00fdm pokra\u010dovan\u00edm \u00falohy o zodpovedan\u00ed ot\u00e1zok je (presne ako je generovanie pr\u00edbehu pokra\u010dovan\u00edm ku generovaniu popisu obrazu) <strong>obrazov\u00fd dial\u00f3g<\/strong>, kde model odpoved\u00e1 na ot\u00e1zky, no st\u00e1le si pam\u00e4t\u00e1 predch\u00e1dzaj\u00face odpovede a sp\u00e1ja ich do dial\u00f3gu.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>Ve\u013emi zauj\u00edmavou, ale asi nie tak \u010dasto diskutovanou \u00falohou je <strong>image entailing<\/strong>. Model sa mus\u00ed rozhodn\u00fa\u0165, \u010di veta alebo sk\u00f4r hypot\u00e9za o obr\u00e1zku, \u00faplne podporuje sc\u00e9nu zobrazen\u00fa na obr\u00e1zku, \u00faplne jej odporuje alebo sa ned\u00e1 presne rozhodn\u00fa\u0165.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>V posledn\u00fdch rokoch sa <strong>generovanie obr\u00e1zkov<\/strong> stalo ve\u013emi popul\u00e1rnou \u00falohou. Je to inverzn\u00e1 \u00faloha ku generovaniu popisu obr\u00e1zka. Je presnej\u0161ia ne\u017e kedyko\u013evek predt\u00fdm aj v\u010faka nov\u00e9mu modelu DALL-E 2 <a href=\"https:\/\/arxiv.org\/pdf\/2204.06125.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">[13]<\/a>. \u00dalohou modelu je pomocou uveden\u00e9ho textov\u00e9ho popisu vygenerova\u0165 nov\u00fd, uverite\u013en\u00fd obr\u00e1zok.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>Samozrejme, existuj\u00fa aj in\u00e9 \u00falohy, v ktor\u00fdch sa spracov\u00e1va obraz a text. M\u00f4\u017eeme ich n\u00e1js\u0165 v r\u00f4znych zhr\u0148uj\u00facich \u010dl\u00e1nkoch (ako je napr\u00edklad tento <a href=\"https:\/\/arxiv.org\/pdf\/1907.09358.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">[14]<\/a>), ka\u017edop\u00e1dne vy\u0161\u0161ie uveden\u00e9 \u00falohy s\u00fa tie najzn\u00e1mej\u0161ie.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-heading\">\n<h3 class=\"wp-block-heading\">Existuj\u00face probl\u00e9my a otvoren\u00e9 ot\u00e1zky pri spracovan\u00ed obrazu a jazyka<\/h3>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>Av\u0161ak aj s najlep\u0161\u00edmi transformermi ur\u010den\u00fdmi pre \u0161pecifick\u00e9 \u00falohy sme st\u00e1le \u010faleko od dokonal\u00e9ho modelu, ktor\u00fd by rozumel obrazu a textu z\u00e1rove\u0148. Br\u00e1ni tomu existencia r\u00f4znych <strong>otvoren\u00fdch probl\u00e9mov<\/strong> v tejto oblasti, a preto sa tieto probl\u00e9my v s\u00fa\u010dasnosti ve\u013emi poctivo sk\u00famaj\u00fa.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>Jedn\u00fdm z nich je, \u017ee <strong>ve\u013ekos\u0165 modelov<\/strong> (po\u010det parametrov) a <strong>ve\u013ekos\u0165 datasetov<\/strong> <strong>rastie neuverite\u013enou r\u00fdchlos\u0165ou<\/strong>. To znamen\u00e1, \u017ee niektor\u00ed v\u00fdskumn\u00edci u\u017e nie s\u00fa schopn\u00ed tr\u00e9nova\u0165 konkurencieschopn\u00e9 modely, preto\u017ee nemaj\u00fa pr\u00edstup k dostato\u010dnej v\u00fdpo\u010dtovej sile. Tento pr\u00edstup (viac parametrov &#8211; v\u00e4\u010d\u0161ia presnos\u0165 modelu) je <a href=\"https:\/\/kinit.sk\/sk\/vplyv-umelej-inteligencie-na-uhlikovu-stopu\/\">ne\u0161etrn\u00fd aj pre \u017eivotn\u00e9 prostredie<\/a>, ke\u010f\u017ee na tr\u00e9novanie tak\u00fdchto modelov je potrebn\u00e9 st\u00e1le viac energie.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>\u010eal\u0161\u00ed probl\u00e9m je naz\u00fdvan\u00fd <strong>halucin\u00e1cia objektov<\/strong>. Toto sa deje napr. pri generovan\u00ed popisu obr\u00e1zka, ke\u010f vygenerovan\u00fd popis obsahuje slovo popisuj\u00face objekt, ktor\u00fd sa na obr\u00e1zku v\u00f4bec nenach\u00e1dza. St\u00e1va sa to preto, \u017ee model je zvyknut\u00fd vidie\u0165 objekt v danom kontexte a spolieha sa na nau\u010den\u00fd kontex viac ako na skuto\u010dn\u00fd vizu\u00e1lny vstup. S t\u00fdmto probl\u00e9mom s\u00favis\u00ed aj <strong>problematick\u00e9 vyhodnocovanie vygenerovan\u00e9ho textu<\/strong>, ke\u010f\u017ee jeden obr\u00e1zok m\u00f4\u017ee ma\u0165 viacero r\u00f4znych, ale spr\u00e1vnych popisov.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>M\u00f4\u017eeme spomen\u00fa\u0165 aj in\u00e9 probl\u00e9my. Napr\u00edklad <strong>datasety \u010dasto obsahuj\u00fa \u0161tatistick\u00e9 odch\u00fdlky.<\/strong> \u00dalohy, ktor\u00e9 vy\u017eaduj\u00fa inform\u00e1cie z oboch modal\u00edt s rovnakou d\u00f4le\u017eitos\u0165ou, sa stan\u00fa rie\u0161ite\u013en\u00fdmi tak, \u017ee modely vyu\u017eij\u00fa zaujatos\u0165 (bias) v d\u00e1tach v jedinej modalite a robia predikcie len na z\u00e1klade nej.&nbsp;<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>\u010eal\u0161\u00edm probl\u00e9mom, ktor\u00fd sme u\u017e okrajovo spomenuli, je, \u017ee ve\u013ek\u00e9 <strong>transformery maj\u00fa stovky mili\u00f3nov parametrov.<\/strong> To znamen\u00e1, \u017ee s\u00fa pre \u013eud\u00ed <strong>nezrozumite\u013en\u00e9<\/strong> a ich rozhodnutia a v\u00fdsledky nemo\u017eno priamo vysvetli\u0165. Tie\u017e maj\u00fa probl\u00e9m so <strong>zov\u0161eobec\u0148ovan\u00edm<\/strong>, ke\u010f\u017ee modely sa u\u010dia len to, \u010do vidia v tr\u00e9novacej mno\u017eine, a tak \u010dasto nevedia vedomosti aplikova\u0165 v inom nastaven\u00ed.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p>Probl\u00e9my zaujatosti, <a href=\"https:\/\/kinit.sk\/sk\/research\/explainable-artificial-intelligence\/\">vysvetlite\u013enosti<\/a> a robustnosti modelov s\u00fa niektor\u00e9 z t\u00e9m, ktor\u00fdm sa v KInITe venujeme popri priamom nasaden\u00ed modelov na praktick\u00e9 pou\u017eitie, napr. v spom\u00ednanom boji proti dezinform\u00e1ci\u00e1m.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-heading\">\n<h4 class=\"wp-block-heading\">Zdroje:<\/h4>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><a href=\"https:\/\/arxiv.org\/pdf\/1707.07250.pdf\">[1]<\/a> Zadeh, A., Chen, M., Poria, S., Cambria, E., &amp; Morency, L. P. (2017). Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><a href=\"https:\/\/aclanthology.org\/D14-1005.pdf\">[2]<\/a> Kiela, D., &amp; Bottou, L. (2014, October). Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP) (pp. 36-45).<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><a href=\"https:\/\/papers.nips.cc\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf\">[3]<\/a> Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., &#8230; &amp; Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><a href=\"https:\/\/arxiv.org\/pdf\/1908.03557.pdf\">[4]<\/a> Li, L. H., Yatskar, M., Yin, D., Hsieh, C. J., &amp; Chang, K. W. (2019). Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><a href=\"https:\/\/arxiv.org\/pdf\/1908.08530.pdf\">[5]<\/a> Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., &amp; Dai, J. (2019). Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><a href=\"https:\/\/arxiv.org\/pdf\/2004.06165.pdf\">[6]<\/a> Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., &#8230; &amp; Gao, J. (2020, August). Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision (pp. 121-137). Springer, Cham.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><a href=\"https:\/\/proceedings.neurips.cc\/paper\/2019\/file\/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf\">[7]<\/a> Lu, J., Batra, D., Parikh, D., &amp; Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><a href=\"https:\/\/arxiv.org\/pdf\/1908.07490.pdf\">[8]<\/a> Tan, H., &amp; Bansal, M. (2019). Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><a href=\"https:\/\/arxiv.org\/pdf\/1909.11942.pdf\">[9]<\/a> Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., &amp; Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><a href=\"https:\/\/arxiv.org\/pdf\/2005.07310.pdf\">[10]<\/a> Cao, J., Gan, Z., Cheng, Y., Yu, L., Chen, Y. C., &amp; Liu, J. (2020, August). Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In European Conference on Computer Vision (pp. 565-580). Springer, Cham.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><a href=\"http:\/\/proceedings.mlr.press\/v139\/radford21a\/radford21a.pdf\">[11]<\/a> Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., &#8230; &amp; Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (pp. 8748-8763). PMLR.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><a href=\"https:\/\/arxiv.org\/pdf\/2202.10936.pdf\">[12]<\/a> Du, Y., Liu, Z., Li, J., &amp; Zhao, W. X. (2022). A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><a href=\"https:\/\/arxiv.org\/pdf\/2204.06125.pdf\">[13]<\/a> Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., &amp; Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.<\/p>\n<\/div>\n\n<div id=\"\" class=\"element core-paragraph\">\n<p><a href=\"https:\/\/arxiv.org\/pdf\/1907.09358.pdf\">[14]<\/a> Mogadala, A., Kalimuthu, M., &amp; Klakow, D. (2021). Trends in integration of vision and language research: A survey of tasks, datasets, and methods. Journal of Artificial Intelligence Research, 71, 1183-1317.<\/p>\n<\/div>","protected":false},"excerpt":{"rendered":"<p>Umel\u00e1 inteligencia sa v posledn\u00fdch rokoch stala hor\u00facou t\u00e9mou v mnoh\u00fdch odvetviach, ke\u010f\u017ee jej vyu\u017eitie ka\u017ed\u00fdm d\u0148om rastie. Aj ke\u010f m\u00e1me \u010faleko od v\u0161eobecnej umelej inteligencie, ktor\u00e1 by bola na&#8230;<\/p>\n","protected":false},"author":26,"featured_media":22739,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[76,88,349],"tags":[187],"class_list":["post-22748","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-natural-language-processing-sk","category-pop-science-sk","category-2022-sk","tag-nlp-sk"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.5 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Multimod\u00e1lne spracovanie: Dok\u00e1\u017ee sa umel\u00e1 inteligencia nau\u010di\u0165 v\u00fdznam a vz\u0165ahy medzi viacer\u00fdmi r\u00f4znymi modalitami? - KInIT<\/title>\n<meta name=\"description\" content=\"Multimod\u00e1lne spracovanie je jednou z na\u0161ich t\u00e9m v KInIT, preto\u017ee v s\u00fa\u010dasnosti predstavuje v\u00fdznamn\u00fa v\u00fdzvu vo svete v\u00fdskumu. Ver\u00edme, \u017ee multimodalita je bud\u00facnos\u0165ou umelej inteligencie, preto sa na \u0148u zameriavame v r\u00f4znych oblastiach v\u00fdskumu, napr. v oblasti rie\u0161enia dezinform\u00e1ci\u00ed. Tento \u010dl\u00e1nok vysvet\u013euje, \u010do to vlastne multimod\u00e1lne spracovanie je. Poskytuje tie\u017e preh\u013ead o tom, ako AI v s\u00fa\u010dasnosti spracov\u00e1va obr\u00e1zky a text, a opisuje prek\u00e1\u017eky multimod\u00e1lneho spracovania.Translated with www.DeepL.com\/Translator (free version)\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/\" \/>\n<meta property=\"og:locale\" content=\"sk_SK\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Multimod\u00e1lne spracovanie: Dok\u00e1\u017ee sa umel\u00e1 inteligencia nau\u010di\u0165 v\u00fdznam a vz\u0165ahy medzi viacer\u00fdmi r\u00f4znymi modalitami? - KInIT\" \/>\n<meta property=\"og:description\" content=\"Multimod\u00e1lne spracovanie je jednou z na\u0161ich t\u00e9m v KInIT, preto\u017ee v s\u00fa\u010dasnosti predstavuje v\u00fdznamn\u00fa v\u00fdzvu vo svete v\u00fdskumu. Ver\u00edme, \u017ee multimodalita je bud\u00facnos\u0165ou umelej inteligencie, preto sa na \u0148u zameriavame v r\u00f4znych oblastiach v\u00fdskumu, napr. v oblasti rie\u0161enia dezinform\u00e1ci\u00ed. Tento \u010dl\u00e1nok vysvet\u013euje, \u010do to vlastne multimod\u00e1lne spracovanie je. Poskytuje tie\u017e preh\u013ead o tom, ako AI v s\u00fa\u010dasnosti spracov\u00e1va obr\u00e1zky a text, a opisuje prek\u00e1\u017eky multimod\u00e1lneho spracovania.Translated with www.DeepL.com\/Translator (free version)\" \/>\n<meta property=\"og:url\" content=\"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/\" \/>\n<meta property=\"og:site_name\" content=\"KInIT\" \/>\n<meta property=\"article:published_time\" content=\"2022-11-07T10:42:09+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-11-07T10:54:07+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/kinit.sk\/wp-content\/uploads\/2022\/11\/202209_web_news_multimodal_processing_Feature.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1500\" \/>\n\t<meta property=\"og:image:height\" content=\"785\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Marianna Palkova\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@kinit\" \/>\n<meta name=\"twitter:site\" content=\"@kinit\" \/>\n<meta name=\"twitter:label1\" content=\"Autor\" \/>\n\t<meta name=\"twitter:data1\" content=\"Marianna Palkova\" \/>\n\t<meta name=\"twitter:label2\" content=\"Predpokladan\u00fd \u010das \u010d\u00edtania\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 min\u00fat\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/kinit.sk\\\/sk\\\/multimodalne-spracovanie\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/kinit.sk\\\/sk\\\/multimodalne-spracovanie\\\/\"},\"author\":{\"name\":\"Marianna Palkova\",\"@id\":\"https:\\\/\\\/kinit.sk\\\/#\\\/schema\\\/person\\\/8b175aaaf3267b5bbbbb97e4a6db8cea\"},\"headline\":\"Multimod\u00e1lne spracovanie: Dok\u00e1\u017ee sa umel\u00e1 inteligencia nau\u010di\u0165 v\u00fdznam a vz\u0165ahy medzi viacer\u00fdmi r\u00f4znymi modalitami?\",\"datePublished\":\"2022-11-07T10:42:09+00:00\",\"dateModified\":\"2022-11-07T10:54:07+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/kinit.sk\\\/sk\\\/multimodalne-spracovanie\\\/\"},\"wordCount\":2646,\"image\":{\"@id\":\"https:\\\/\\\/kinit.sk\\\/sk\\\/multimodalne-spracovanie\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/kinit.sk\\\/wp-content\\\/uploads\\\/2022\\\/11\\\/202209_web_news_multimodal_processing_Feature.jpg\",\"keywords\":[\"nlp\"],\"articleSection\":[\"Natural Language Processing\",\"Pop science\",\"2022\"],\"inLanguage\":\"sk-SK\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/kinit.sk\\\/sk\\\/multimodalne-spracovanie\\\/\",\"url\":\"https:\\\/\\\/kinit.sk\\\/sk\\\/multimodalne-spracovanie\\\/\",\"name\":\"Multimod\u00e1lne spracovanie: Dok\u00e1\u017ee sa umel\u00e1 inteligencia nau\u010di\u0165 v\u00fdznam a vz\u0165ahy medzi viacer\u00fdmi r\u00f4znymi modalitami? - KInIT\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/kinit.sk\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/kinit.sk\\\/sk\\\/multimodalne-spracovanie\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/kinit.sk\\\/sk\\\/multimodalne-spracovanie\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/kinit.sk\\\/wp-content\\\/uploads\\\/2022\\\/11\\\/202209_web_news_multimodal_processing_Feature.jpg\",\"datePublished\":\"2022-11-07T10:42:09+00:00\",\"dateModified\":\"2022-11-07T10:54:07+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/kinit.sk\\\/#\\\/schema\\\/person\\\/8b175aaaf3267b5bbbbb97e4a6db8cea\"},\"description\":\"Multimod\u00e1lne spracovanie je jednou z na\u0161ich t\u00e9m v KInIT, preto\u017ee v s\u00fa\u010dasnosti predstavuje v\u00fdznamn\u00fa v\u00fdzvu vo svete v\u00fdskumu. Ver\u00edme, \u017ee multimodalita je bud\u00facnos\u0165ou umelej inteligencie, preto sa na \u0148u zameriavame v r\u00f4znych oblastiach v\u00fdskumu, napr. v oblasti rie\u0161enia dezinform\u00e1ci\u00ed. Tento \u010dl\u00e1nok vysvet\u013euje, \u010do to vlastne multimod\u00e1lne spracovanie je. Poskytuje tie\u017e preh\u013ead o tom, ako AI v s\u00fa\u010dasnosti spracov\u00e1va obr\u00e1zky a text, a opisuje prek\u00e1\u017eky multimod\u00e1lneho spracovania.Translated with www.DeepL.com\\\/Translator (free version)\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/kinit.sk\\\/sk\\\/multimodalne-spracovanie\\\/#breadcrumb\"},\"inLanguage\":\"sk-SK\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/kinit.sk\\\/sk\\\/multimodalne-spracovanie\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"sk-SK\",\"@id\":\"https:\\\/\\\/kinit.sk\\\/sk\\\/multimodalne-spracovanie\\\/#primaryimage\",\"url\":\"https:\\\/\\\/kinit.sk\\\/wp-content\\\/uploads\\\/2022\\\/11\\\/202209_web_news_multimodal_processing_Feature.jpg\",\"contentUrl\":\"https:\\\/\\\/kinit.sk\\\/wp-content\\\/uploads\\\/2022\\\/11\\\/202209_web_news_multimodal_processing_Feature.jpg\",\"width\":1500,\"height\":785},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/kinit.sk\\\/sk\\\/multimodalne-spracovanie\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/kinit.sk\\\/sk\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Pop science\",\"item\":\"https:\\\/\\\/kinit.sk\\\/sk\\\/category\\\/pop-science-sk\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Multimod\u00e1lne spracovanie: Dok\u00e1\u017ee sa umel\u00e1 inteligencia nau\u010di\u0165 v\u00fdznam a vz\u0165ahy medzi viacer\u00fdmi r\u00f4znymi modalitami?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/kinit.sk\\\/#website\",\"url\":\"https:\\\/\\\/kinit.sk\\\/\",\"name\":\"KInIT\",\"description\":\"Vyu\u017e\u00edvame v\u00fdskum pre \u013eud\u00ed a priemysel\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/kinit.sk\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"sk-SK\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/kinit.sk\\\/#\\\/schema\\\/person\\\/8b175aaaf3267b5bbbbb97e4a6db8cea\",\"name\":\"Marianna Palkova\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Multimod\u00e1lne spracovanie: Dok\u00e1\u017ee sa umel\u00e1 inteligencia nau\u010di\u0165 v\u00fdznam a vz\u0165ahy medzi viacer\u00fdmi r\u00f4znymi modalitami? - KInIT","description":"Multimod\u00e1lne spracovanie je jednou z na\u0161ich t\u00e9m v KInIT, preto\u017ee v s\u00fa\u010dasnosti predstavuje v\u00fdznamn\u00fa v\u00fdzvu vo svete v\u00fdskumu. Ver\u00edme, \u017ee multimodalita je bud\u00facnos\u0165ou umelej inteligencie, preto sa na \u0148u zameriavame v r\u00f4znych oblastiach v\u00fdskumu, napr. v oblasti rie\u0161enia dezinform\u00e1ci\u00ed. Tento \u010dl\u00e1nok vysvet\u013euje, \u010do to vlastne multimod\u00e1lne spracovanie je. Poskytuje tie\u017e preh\u013ead o tom, ako AI v s\u00fa\u010dasnosti spracov\u00e1va obr\u00e1zky a text, a opisuje prek\u00e1\u017eky multimod\u00e1lneho spracovania.Translated with www.DeepL.com\/Translator (free version)","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/","og_locale":"sk_SK","og_type":"article","og_title":"Multimod\u00e1lne spracovanie: Dok\u00e1\u017ee sa umel\u00e1 inteligencia nau\u010di\u0165 v\u00fdznam a vz\u0165ahy medzi viacer\u00fdmi r\u00f4znymi modalitami? - KInIT","og_description":"Multimod\u00e1lne spracovanie je jednou z na\u0161ich t\u00e9m v KInIT, preto\u017ee v s\u00fa\u010dasnosti predstavuje v\u00fdznamn\u00fa v\u00fdzvu vo svete v\u00fdskumu. Ver\u00edme, \u017ee multimodalita je bud\u00facnos\u0165ou umelej inteligencie, preto sa na \u0148u zameriavame v r\u00f4znych oblastiach v\u00fdskumu, napr. v oblasti rie\u0161enia dezinform\u00e1ci\u00ed. Tento \u010dl\u00e1nok vysvet\u013euje, \u010do to vlastne multimod\u00e1lne spracovanie je. Poskytuje tie\u017e preh\u013ead o tom, ako AI v s\u00fa\u010dasnosti spracov\u00e1va obr\u00e1zky a text, a opisuje prek\u00e1\u017eky multimod\u00e1lneho spracovania.Translated with www.DeepL.com\/Translator (free version)","og_url":"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/","og_site_name":"KInIT","article_published_time":"2022-11-07T10:42:09+00:00","article_modified_time":"2022-11-07T10:54:07+00:00","og_image":[{"width":1500,"height":785,"url":"https:\/\/kinit.sk\/wp-content\/uploads\/2022\/11\/202209_web_news_multimodal_processing_Feature.jpg","type":"image\/jpeg"}],"author":"Marianna Palkova","twitter_card":"summary_large_image","twitter_creator":"@kinit","twitter_site":"@kinit","twitter_misc":{"Autor":"Marianna Palkova","Predpokladan\u00fd \u010das \u010d\u00edtania":"10 min\u00fat"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/#article","isPartOf":{"@id":"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/"},"author":{"name":"Marianna Palkova","@id":"https:\/\/kinit.sk\/#\/schema\/person\/8b175aaaf3267b5bbbbb97e4a6db8cea"},"headline":"Multimod\u00e1lne spracovanie: Dok\u00e1\u017ee sa umel\u00e1 inteligencia nau\u010di\u0165 v\u00fdznam a vz\u0165ahy medzi viacer\u00fdmi r\u00f4znymi modalitami?","datePublished":"2022-11-07T10:42:09+00:00","dateModified":"2022-11-07T10:54:07+00:00","mainEntityOfPage":{"@id":"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/"},"wordCount":2646,"image":{"@id":"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/#primaryimage"},"thumbnailUrl":"https:\/\/kinit.sk\/wp-content\/uploads\/2022\/11\/202209_web_news_multimodal_processing_Feature.jpg","keywords":["nlp"],"articleSection":["Natural Language Processing","Pop science","2022"],"inLanguage":"sk-SK"},{"@type":"WebPage","@id":"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/","url":"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/","name":"Multimod\u00e1lne spracovanie: Dok\u00e1\u017ee sa umel\u00e1 inteligencia nau\u010di\u0165 v\u00fdznam a vz\u0165ahy medzi viacer\u00fdmi r\u00f4znymi modalitami? - KInIT","isPartOf":{"@id":"https:\/\/kinit.sk\/#website"},"primaryImageOfPage":{"@id":"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/#primaryimage"},"image":{"@id":"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/#primaryimage"},"thumbnailUrl":"https:\/\/kinit.sk\/wp-content\/uploads\/2022\/11\/202209_web_news_multimodal_processing_Feature.jpg","datePublished":"2022-11-07T10:42:09+00:00","dateModified":"2022-11-07T10:54:07+00:00","author":{"@id":"https:\/\/kinit.sk\/#\/schema\/person\/8b175aaaf3267b5bbbbb97e4a6db8cea"},"description":"Multimod\u00e1lne spracovanie je jednou z na\u0161ich t\u00e9m v KInIT, preto\u017ee v s\u00fa\u010dasnosti predstavuje v\u00fdznamn\u00fa v\u00fdzvu vo svete v\u00fdskumu. Ver\u00edme, \u017ee multimodalita je bud\u00facnos\u0165ou umelej inteligencie, preto sa na \u0148u zameriavame v r\u00f4znych oblastiach v\u00fdskumu, napr. v oblasti rie\u0161enia dezinform\u00e1ci\u00ed. Tento \u010dl\u00e1nok vysvet\u013euje, \u010do to vlastne multimod\u00e1lne spracovanie je. Poskytuje tie\u017e preh\u013ead o tom, ako AI v s\u00fa\u010dasnosti spracov\u00e1va obr\u00e1zky a text, a opisuje prek\u00e1\u017eky multimod\u00e1lneho spracovania.Translated with www.DeepL.com\/Translator (free version)","breadcrumb":{"@id":"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/#breadcrumb"},"inLanguage":"sk-SK","potentialAction":[{"@type":"ReadAction","target":["https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/"]}]},{"@type":"ImageObject","inLanguage":"sk-SK","@id":"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/#primaryimage","url":"https:\/\/kinit.sk\/wp-content\/uploads\/2022\/11\/202209_web_news_multimodal_processing_Feature.jpg","contentUrl":"https:\/\/kinit.sk\/wp-content\/uploads\/2022\/11\/202209_web_news_multimodal_processing_Feature.jpg","width":1500,"height":785},{"@type":"BreadcrumbList","@id":"https:\/\/kinit.sk\/sk\/multimodalne-spracovanie\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/kinit.sk\/sk\/"},{"@type":"ListItem","position":2,"name":"Pop science","item":"https:\/\/kinit.sk\/sk\/category\/pop-science-sk\/"},{"@type":"ListItem","position":3,"name":"Multimod\u00e1lne spracovanie: Dok\u00e1\u017ee sa umel\u00e1 inteligencia nau\u010di\u0165 v\u00fdznam a vz\u0165ahy medzi viacer\u00fdmi r\u00f4znymi modalitami?"}]},{"@type":"WebSite","@id":"https:\/\/kinit.sk\/#website","url":"https:\/\/kinit.sk\/","name":"KInIT","description":"Vyu\u017e\u00edvame v\u00fdskum pre \u013eud\u00ed a priemysel","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/kinit.sk\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"sk-SK"},{"@type":"Person","@id":"https:\/\/kinit.sk\/#\/schema\/person\/8b175aaaf3267b5bbbbb97e4a6db8cea","name":"Marianna Palkova"}]}},"_links":{"self":[{"href":"https:\/\/kinit.sk\/sk\/wp-json\/wp\/v2\/posts\/22748","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/kinit.sk\/sk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kinit.sk\/sk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kinit.sk\/sk\/wp-json\/wp\/v2\/users\/26"}],"replies":[{"embeddable":true,"href":"https:\/\/kinit.sk\/sk\/wp-json\/wp\/v2\/comments?post=22748"}],"version-history":[{"count":3,"href":"https:\/\/kinit.sk\/sk\/wp-json\/wp\/v2\/posts\/22748\/revisions"}],"predecessor-version":[{"id":22754,"href":"https:\/\/kinit.sk\/sk\/wp-json\/wp\/v2\/posts\/22748\/revisions\/22754"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/kinit.sk\/sk\/wp-json\/wp\/v2\/media\/22739"}],"wp:attachment":[{"href":"https:\/\/kinit.sk\/sk\/wp-json\/wp\/v2\/media?parent=22748"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kinit.sk\/sk\/wp-json\/wp\/v2\/categories?post=22748"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kinit.sk\/sk\/wp-json\/wp\/v2\/tags?post=22748"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}