Quality of ACL “Findings”: analysis of citations

ACL’s “Findings” tend to receive less citations than “main conference” proceedings (as measured by proceedings h-index and average citations per paper). Still, Findings do have solid citation records and should be considered worthy venues. Findings get cited on par with venues ranked one level below the venue to which the findings are attached (CORE ranking taken as reference). For example: while main ACL and EMNLP are A* ranked, their Findings receive citations comparable with A ranked venues.

The Association of Computational Linguistics (ACL) is nowadays hosting top-level scientific events in the NLP field. Conferences such as ACL, EMNLP, or NAACL are widely recognized by NLP researchers and practitioners, as well as international conference rankings such as CORE, which ranks the aforementioned venues as A* (highest category). Thanks to their selective peer-review process, having a paper published in one of these venues is a prestigious feat for authors, ensuring fair attention to their work. For paper readers, it warrants a degree of quality of presented research.

The recent boom of NLP research, prompted by advances in neural language models, has led to an increased number of submissions to ACL’s (and other NLP) venues. The venues have naturally grown larger (e.g., in 2016 ACL’s long paper proceedings featured 232 papers; in 2024 it was 865 papers), but this growth has its limits, mostly due to the conference organization reasons (ACL and EMNLP both welcomed more than 3000 attendees in their recent editions). 

In 2020, ACL introduced a new series of proceedings, called the Findings of the Association for Computational Linguistics (or just Findings). They did it to satisfy the increased demand for scientific publishing space in NLP and to accommodate many still very good papers that just didn’t make it to their main proceedings.

The Findings work as a sort of adjunct (or companion) proceedings and are always attached to an ACL conference (so far: ACL, EMNLP, NAACL, EACL and AACL). However, unlike other adjunct proceedings series, Findings does not contain workshop, demo or industrial papers. Instead, it contains full or short papers originally submitted to the main conference, which didn’t make it through the sieve of reviewers and chairs for main track, but were still good enough to be included in the companion proceedings.

The papers published in Findings are treated every bit the same as the main track papers (e.g., they get published in the ACL Anthology), except they are (typically) not orally presented at the given conference. Similarly as main track papers, they tend to be prized by authors and recognized by the community. Anecdotally, if one sees ACL Findings as the publishing venue when searching for papers on Google Scholar, it reinforces the notion of quality, rather than suspicion. Some claim that basically the same level of quality can be expected from both main proceedings and findings.

On the other hand, many researchers remain unconvinced about the quality of papers published in Findings. In ongoing debates, people argue with acceptance rates (usually above 30%) and bring up the notion of low quality of other adjunct proceedings series (as they perceive it). A general skepticism of conference papers sometimes adds to this as well (in many academic funding schemes, they do not translate into much money, which in turn biases the faculty members).

Ultimately, the judgments of Findings quality (at least the ones we encountered so far), are mostly based on subjective assessment or anecdotal evidence. There is also no separate grading of Findings in recognized rankings such as CORE.

As we are frequently targeting ACL venues with our works, we decided to do a little scientometric study to quantitatively assess where the ACL Findings stand among ranked venues.

Method

In our study, we compared individual conference proceedings (i.e. collections of main papers of conference editions) on their citation performance. 

We examined a selection of conferences in NLP/AI/CS, that took place from 2020 to 2023. If a conference took place in a given year, we indexed its papers and for each, retrieved its current number of citations (using Google Scholar citation data). We did the same for all Findings published within these 4 years.

The selection of conferences was primarily oriented towards ACL venues (ACL, EMNLP, NAACL, EACL, COLING). We also included other major AI/ML conferences (NeurIPS, ICML, IJCAI, ECAI). Finally, we complemented the list with other computer science venues (RecSys, UMAP, Hypertext), mostly for legacy reasons, as we used to publish there in the past. Together, we got a list, where CORE A*, A, and B conferences were represented. As for the Findings, we identified 9 editions, attached to various ACL venues in the 2020-2023 period (4x EMNLP, 3x ACL, 1x NAACL, 1x EACL).

The data scraping took place in late July, 2024. We indexed 34643 papers, belonging to the 49 individual proceedings. For each proceedings, we computed its h-index and average number of citations per paper. We then compared the proceedings to determine, into which general ballpark (A*, A, or B) would Findings fit with their metrics. It only made sense to compare venues that took place around the same time (same opportunity to receive citations): we thus compared venues within a single calendar year.

The h-index is nowadays frequently used as an indicator of scientific dissemination impact. Normally, it is counted for individual researchers, but it can also measure impact of entire departments, if counted over a set of papers published in a given time range. It can also be used to measure impact of scientific venues (e.g., see 5-year h-indices of computational linguistics conferences by Google Scholar). And, it can also characterize the impact of paper collections, such as individual proceedings, as we do here.

The advantage of h-index is its robustness: neither long tails of little-cited papers, nor individual super-cited “unicorns” would influence it very much; rather a consistent impact of a non-trivial number of papers would. Still, the base number of papers in a proceedings influences the achievable h-index: there is a larger chance for a 1000 paper book to reach the h-index 50 than for a 100 paper book. Therefore, to allow comparison between differently sized venues, we also looked at average numbers of citations, even if it is more dependent on the overall distribution of paper citation numbers.

Having said all this, we are aware that citation based metrics are not at all sufficient to assess quality of papers, venues, or researchers. Far from it. But they can be pretty indicative, especially if you are comparing similar entities. And definitely better than anecdotal evidence or subjective opinions.

All the code for scraping and metric computation can be found here. Some of the data in the final results (like conference ranks), were added manually.

Results

In this table (see also below) we can see all 49 analyzed proceedings, grouped by years. For each year, the proceedings are sorted according to their h-index. All main conference proceedings have their CORE rank, which was valid at that time (note, for example, the promotion of EMNLP from A to A*, or RecSys from B to A). Also listed are the numbers of papers published in the proceedings, which is helpful in interpreting some of the h-index values. As some of the venues are small, it is good to look at their average citations per paper as well when comparing them with much larger venues.

When we look at the h-indices of Findings, we can see that they never match their main proceeding counterparts in the same year. The differences are substantial, more than 30%, (except for the most recent EMNLP 2023), with earlier editions reaching 40% difference. This disproves claims that the main and Findings are equally impactful. They aren’t.

But that doesn’t automatically mean they are not good. When we look at venues with similar h-indices (or average citations per paper to account for small venues), we see that findings regularly surpass or match well ranked venues, usually one level below the rank of their “mother” conference. 

If we specifically take only EMNLP and ACL Findings (7 out of 9 examined cases, and probably of the most interest to our readership), we can safely say that they perform a class better than B-ranked venues and that they have (with a high degree of confidence), at least A-rank performance. They, however, do not match A* venues (at least not the ones in our dataset).

Note: EMNLP officially attained the A* rank only recently (in 2023), however, its performance in previous years was clearly A*-ranked and we consider it as such in our result interpretation as well. Some of the readers may also know that CORE recently formally demoted ACL’s COLING from A to B. This doesn’t impact our current analysis, however, in the possible future editions, we will likely continue treating COLING as A-ranked if it keeps its performance.

Limitations

Following a good tradition of ACL papers that requires a limitation section in each paper, here it is for this blog.

To reiterate, citations are not everything. Even predatory journals tend to capture a lot of citations (some genuine, some through shady practices). Evaluating quality solely based on citations is unwise. The CORE ranking, for instance, also considers other aspects, such as acceptance rates, visibility, or credit of conference chairs and organizers. Yet, if we take the citation performance together with the fact that Findings are curated by the same people as the main conference proceedings, we consider our conclusions reliable.

Our conclusions are based on a rather limited number of proceedings analyzed, especially the Findings themselves. We feel that the four years of history is a bare minimum to draw some indication on how good the Findings actually are. When you look at the earlier years from our range (2020 and 2021), you see large differences in the number of citations between different editions of the same venue. For example, EMNLP 2020 currently has 73.17 citations per paper, while EMNLP 2021 only has 41.46. It is clear that even after 3 years, the citations flow in and much of the data may still lie in the future. Interpreting data for 2023 or 2022 events may thus be very, very preliminary and prone, e.g., to trends that will only be confirmed after a longer period.

As a coincidence, ACL created the Findings just during the COVID pandemic, which sort of impacted the academic publishing and conferences. We are unaware of any systemic influence on our analysis in this regard, but one never knows. Only more years of future data will tell.

More conferences could be included on the A-rank and B-rank side of the analysis, especially for the sake of comparison (to support authors deciding where to publish). However, we feel we included enough to answer the primary question on where Findings stand.

Scraping of citation data wasn’t straightforward. For non-ACL venues, we needed to seek papers on Google Scholar through title matching and near-similarity filtering. This hasn’t yielded matches for a very small number of papers (lower tens), scattered randomly across the venues. However, even though we analyzed incomplete data, it should have zero or marginal influence on our metrics.

Future work

At some later point, this analysis should be repeated, possibly with more A and B venues considered.

As a complementary analysis, an overlap study of paper authors of main proceedings and Findings could shed more light on the question. If there is a large overlap of authors of “main” and Findings, it would support the notion that Findings is a quality series, with research coming essentially from the same labs. On the other hand, a smaller overlap would indicate a larger quality difference.