Below you can see all the talks held at the Data Science Seminar in the 2022-2023 season. For forthcoming talks, see the main page.
Wednesday, April 12, 2023 at 14:30 (Popa Tatu 18)
Mihai Cucuringu (University of Oxford)
Machine learning on signed networks and time series analysis with applications to finance
Abstract:
We discuss scalable spectral methods for detecting hidden structures in large signed/directed networks, with an eye towards robustness under sampling sparsity and noise perturbation. As an application, we consider the problem of propagating news sentiment in a financial network. When considering the universe of SP500 instruments (stocks), only about one third of the instruments have news sentiment released on a typical trading day. This raises the question of how does the disseminated news sentiment impact the remaining set of instruments. We proposes fast algorithms for understanding how news sentiment propagates through a financial correlation network. Our approaches are broadly applicable to instances where one has available a sparse signal (e.g., news sentiment, for a subset of nodes) and would like to understand how the available signal measurements propagate through the network to the remaining nodes. We formulate this problem as an instance of the group synchronization problem over Z2 with anchor information. Time permitting, we discuss potential extensions that leverage directed graph clustering algorithms from the lead-lag detection literature.
Thursday, March 16, 2023 at 11:00 (FMI, Hall 214 “Google”)
Eduard C. Drăguț (Temple University)
Continuous, Gradual Entity Mining from Web Data Streams
Abstract:
Named Entity Recognition (NER) is a key component in many intelligent systems like knowledge graphs, question answering, information retrieval, and early prediction of emerging events. NER systems have been studied and developed for decades, nevertheless NER is a continuous, neverending learning process because language and its usage evolves over time. For example, the emergence of social media with colloquial user content exposed the previous state-of-the-art NER that expected long documents written in formal language. In this talk, I present our work on entity mining from microblog streams, where we advocate for continuous, gradual entity mining with revisits. It needs to be continuous because the system stays with a topic for its duration in a social media stream. It is gradual because the system begins with easy instances, which can be labeled with high accuracy, and then it gradually labels more challenging instances. The system revisits difficult instances that were encountered ahead of easy instances in a stream. If these three conditions are met than (near) real-time NER can be achieved over microblogs. I will also introduce our work on recognizing entities that follow or closely resemble a regular expression (regex) pattern, their applications to other (unexpected) domains, and how we use it to seed our work on human-in-the-loop mining.