- A general purpose community detection and network embedding library for research built on NetworkX. - benedekrozemberczki/karateclub
- In recent years there has been a growing public fascination with the complex "connectedness" of modern society. This connectedness is found in many incarnations: in the rapid growth of the Internet and the Web, in the ease with which global communication now takes place, and in the ability of news and information as well as epidemics and financial crises to spread around the world with surprising speed and intensity. These are phenomena that involve networks, incentives, and the aggregate behavior of groups of people; they are based on the links that connect us and the ways in which each of our decisions can have subtle consequences for the outcomes of everyone else. Networks, Crowds, and Markets combines different scientific perspectives in its approach to understanding networks and behavior. Drawing on ideas from economics, sociology, computing and information science, and applied mathematics, it describes the emerging field of study that is growing at the interface of all these areas, addressing fundamental questions about how the social, economic, and technological worlds are connected. The book is based on an inter-disciplinary course that we teach at Cornell. The book, like the course, is designed at the introductory undergraduate level with no formal prerequisites. To support deeper explorations, most of the chapters are supplemented with optional advanced sections.
- The Open Knowledge Network
- Hierarchical community detection algorithm based on the map equation.
- Mobile shell that supports roaming and intelligent local echo. Like SSH secure shell, but allows mobility and more responsive and robust.
- Enabling collaboration and discovery among scientists across all disciplines. The network of scientists will facilitate scholarly discovery. Institutions will participate in the network by installing VIVO, or by providing semantic web-compliant data to the network.
- Gephi is an open-source software for visualizing and analyzing large networks graphs. Gephi uses a 3D render engine to display graphs in real-time and speed up the exploration. Use Gephi to explore, analyse, spatialise, filter, cluterize, manipulate and export all types of graphs.
- Three-Toed Sloth Slow Takes from the Canopy (My Very Own Internet Tradition) June 15, 2007 « Reformatting in Progress | Main | Books to Read While the Algae Grow in Your Fur, May 2007 » So You Think You Have a Power Law — Well Isn't That Special? Regular readers who care about such things — I think there are about three of you — will recall that I have long had a thing about just how unsound many of the claims for the presence of power law distributions in real data are, especially those made by theoretical physicists, who, with some honorable exceptions, learn nothing about data analysis. (I certainly didn't.) I have even whined about how I should really be working on a paper about how to do all this right, rather than merely snarking in a weblog. As evidence that the age of wonders is not passed — and, more relevantly, that I have productive collaborators — this paper is now loosed upon the world: Aaron Clauset, CRS and M. E. J. Newman, "Power-law distributions in empirical data", arxiv:0706.1062, with code available in Matlab and R; forthcoming (2009) in SIAM Review Abstract: Power-law distributions occur in many situations of scientific interest and have significant consequences for our understanding of natural and man-made phenomena. Unfortunately, the empirical detection and characterization of power laws is made difficult by the large fluctuations that occur in the tail of the distribution. In particular, standard methods such as least-squares fitting are known to produce systematically biased estimates of parameters for power-law distributions and should not be used in most circumstances. Here we describe statistical techniques for making accurate parameter estimates for power-law data, based on maximum likelihood methods and the Kolmogorov-Smirnov statistic. We also show how to tell whether the data follow a power-law distribution at all, defining quantitative measures that indicate when the power law is a reasonable fit to the data and when it is not. We demonstrate these methods by applying them to twenty-four real-world data sets from a range of different disciplines. Each of the data sets has been conjectured previously to follow a power-law distribution. In some cases we find these conjectures to be consistent with the data while in others the power law is ruled out. The paper is deliberately aimed at physicists, so we assume some things that they know (like some of the mechanisms, e.g. critical fluctuations, which can lead to power laws), and devote extra detail to things they don't but which e.g. statisticians do know (such as how to find the cumulative distribution function of a standard Gaussian). In particular, we refrained from making a big deal about the need for an error-statistical approach to problems like this, but it definitely shaped our thinking. Aaron has already posted about the paper, but I'll do so myself anyway. Partly this is to help hammer the message home, and partly this is because I am a basically negative and critical man, and this sort of work gives me an excuse to vent my feelings of spite under the pretense of advancing truth (unlike Aaron and Mark, who are basically nice guys and constructive scholars). Here are the take-home points, none of which ought to be news, but which, taken together, would lead to a real change in the literature. (For example, half or more each issue of Physica A would disappear.) 1. Lots of distributions give you straight-ish lines on a log-log plot. True, a Gaussian or a Poisson won't, but lots of other things will. Don't even begin to talk to me about log-log plots which you claim are "piecewise linear". 2. Abusing linear regression makes the baby Gauss cry. Fitting a line to your log-log plot by least squares is a bad idea. It generally doesn't even give you a probability distribution, and even if your data do follow a power-law distribution, it gives you a bad estimate of the parameters. You cannot use the error estimates your regression software gives you, because those formulas incorporate assumptions which directly contradict the idea that you are seeing samples from a power law. And no, you cannot claim that because the line "explains" (really, describes) a lot of the variance that you must have a power law, because you can get a very high R^2 from other distributions (that test has no "power"). And this is without getting into the additional errors caused by trying to fit a line to binned histograms. It's true that fitting lines on log-log graphs is what Pareto did back in the day when he started this whole power-law business, but "the day" was the 1890s. There's a time and a place for being old school; this isn't it. 3. Use maximum likelihood to estimate the scaling exponent. It's fast! The formula is easy! Best of all, it works! The method of maximum likelihood was invented in 1922 [parts 1 and 2], by someone who studied statistical mechanics, no less. The maximum likelihood estimators for the discrete (Zipf/zeta) and continuous (Pareto) power laws were worked out in 1952 and 1957 (respectively). They converge on the correct value of the scaling exponent with probability 1, and they do so efficiently. You can even work out their sampling distribution (it's an inverse gamma) and so get exact confidence intervals. Use the MLEs! 4. Use goodness of fit to estimate where the scaling region begins. Few people pretend that the whole of their data-set follows a power law distribution; usually the claim is about the right or upper tail, the large values over some given threshold. This ought to raise the question of where the tail begins. Usually people find it by squinting at their log-log plot. Mark Handcock and James Jones, in one of the few worthwhile efforts here, suggested using Schwarz's information criterion. This isn't bad, but has trouble with with continuous data. Aaron devised an ingenious method which finds the empirically-best scaling region, by optimizing the Kolmogorov-Smirnov goodness-of-fit statistic; it performs slightly better than the information criterion. (Yes, one could imagine more elaborate semi-parametric approaches to this problem. Feel free to go ahead and implement them.) 5. Use a goodness-of-fit test to check goodness of fit. In particular, if you're looking at the goodness of fit of a distribution, use a statistic meant for distributions, not one for regression curves. This means forgetting about R^2, the fraction of variance accounted for by the curve, and using the Kolmogorov-Smirnov statistic, the maximum discrepancy between the empirical distribution and the theoretical one. If you've got the right theoretical distribution, KS statistic will converge to zero as you get more data (that's the Glivenko-Cantelli theorem). The one hitch in this case is that you can't use the usual tables/formulas for significance levels, because you're estimating the parameters of the power law from the data. (If you really want to see where the problem comes from, see Pollard, starting on p. 99.) This is why God, in Her wisdom and mercy, gave us the bootstrap. If the chance of getting data which fits the estimated distribution as badly as your data fits your power law is, oh, one in a thousand or less, you had better have some other, very compelling reason to think that you're looking at a power law. 6. Use Vuong's test to check alternatives, and be prepared for disappointment. Even if you've estimated the parameters of your parameters properly, and the fit is decent, you're not done yet. You also need to see whether other, non-power-law distributions could have produced the data. This is a model selection problem, with the complication that possibly neither the power law nor the alternative you're looking at is exactly right; in that case you'd at least like to know which one is closer to the truth. There is a brilliantly simple solution to this problem (at least for cases like this) which was first devised by Quang Vuong in a 1989 Econometrica paper: use the log-likelihood ratio, normalized by an estimate of the magnitude of the fluctuations in that ratio. Vuong showed that this test statistic asymptotically has a standard Gaussian distribution when the competing models are equally good; otherwise it will almost surely converge on picking out the better model. This is extremely clever and deserves to be much better known. And, unlike things like the fit to a log-log regression line, it actually has the power to discriminate among the alternatives. If you use sensible, heavy-tailed alternative distributions, like the log-normal or the Weibull (stretched exponential), you will find that it is often very, very hard to rule them out. In the two dozen data sets we looked at, all chosen because people had claimed they followed power laws, the log-normal's fit was almost always competitive with the power law, usually insignificantly better and sometimes substantially better. (To repeat a joke: Gauss is not mocked.) For about half the data sets, the fit is substantially improved by adding an exponential cut-off to the power law. (I'm too lazy to produce the necessary equations; read the paper.) This means that there is a characteristic scale after all, and that super-mega-enormous, as opposed to merely enormous, events are, indeed, exponentially rare. Strictly speaking, a cut-off power law should always fit the data better than a pure one (just let the cut-off scale go to infinity, if need be), so you need to be a little careful in seeing whether the improvement is real or just noise; but often it's real. 7. Ask yourself whether you really care. Maybe you don't. A lot of the time, we think, all that's genuine important is that the tail is heavy, and it doesn't really matter whether it decays linearly in the log of the variable (power law) or quadratically (log-normal) or something else. If that's all that matters, then you should really consider doing some kind of non-parametric density estimation (e.g. Markovitch and Krieger's [preprint]). Sometimes, though, you do care. Maybe you want to make a claim which depends heavily on just how common hugely large observations are. Or maybe you have a particular model in mind for the data-generating process, and that model predicts some particular distribution for the tail. Then knowing whether it really is a power law, or closer to a power law than (say) a stretched exponential, actually matters to you. In that case, you owe it to yourself to do the data analysis right. You also owe it to yourself to think carefully about whether there are other ways of checking your model. If the only testable prediction it makes is about the shape of the tail, it doesn't sound like a very good model, and it will be intrinsically hard to check it. Because this is, of course, what everyone ought to do with a computational paper, we've put our code online, so you can check our calculations, or use these methods on your own data, without having to implement them from scratch. I trust that I will no longer have to referee papers where people use GnuPlot to draw lines on log-log graphs, as though that meant something, and that in five to ten years even science journalists and editors of Wired will begin to get the message. Manual trackbacks: The Statistical Mechanic; Uncertain Principles; zs; LanguageLog; Science After Sunclipse; Philosophia Naturalis 11 (at Highly Allochthonous); Langreiter; blogs for industry ... blogs for the dead; Infectious Greed; No Free Lunch; Look Here First; TPMCafe; Science After Sunclipse; Cosmic Variance; Messy Matters Power Laws; Enigmas of Chance; Complexity Posted by crshalizi at June 15, 2007 13:00 | permanent link Three-Toed Sloth: Hosted, but not endorsed, by the Center for the Study of Complex Systems
- Brad Fitzpatrick recently wrote an elegant and important post about the Social Graph, a term used by Facebook to describe their social network. In his post, Fitzpatrick defines "social graph" as "the global mapping of everybody and how they're related". He went on to outline the problems with it, as well as a broad set of goals going forward. One problem is that currently you need to have different logins for different social networks. Another issue is portability and ownership of an individual's information, explicitly and implicitly revealed while using social networks. As was recently asserted in the Social...
- Think back to the last time you were in a crowded airport or bus terminal far from home. Did you consider that the person sitting next to you probably knew a friend of a friend of a friend of yours? In the 1960s, social psychologist Stanley Milgram’s “small world experiment” famously tested the idea that any two people in the world are separated by only a small number of intermediate connections, arguably the first experimental study to reveal the surprising structure of social networks. With the rise of modern computing, social networks are now being mapped in digital form, giving researchers the ability to study them on a much grander, even global, scale. Continuing this tradition of social network research, Facebook, in collaboration with researchers at the Università degli Studi di Milano, is today releasing two studies of the Facebook social graph. First, we measured how many friends people have, and found that this distribution differs significantly from previous studies of large-scale social networks. Second, we found that the degrees of separation between any two Facebook users is smaller than the commonly cited six degrees, and has been shrinking over the past three years as Facebook has grown. Finally, we observed that while the entire world is only a few degrees away, a user’s friends are most likely to be of a similar age and come from the same country. In our studies, performed earlier this year, we examined all 721 million active Facebook users (more than 10% of the global population), with 69 billion friendships among them. To date, these are the largest social network studies ever released.
- The main research focus of the Social Computing Laboratory at HP Labs is harvesting the collective intelligence of groups of people to optimize the interaction between users and information. HP Labs is the advanced research center for Hewlett-Packard..
- A Revised Taxonomy of Social Networking Data Lately I've been reading about user security and privacy -- control, really -- on social networking sites. The issues are hard and the solutions harder, but I'm seeing a lot of confusion in even forming the questions. Social networking sites deal with several different types of user data, and it's essential to separate them. Below is my taxonomy of social networking data, which I first presented at the Internet Governance Forum meeting last November, and again -- revised -- at an OECD workshop on the role of Internet intermediaries in June. * Service data is the data you give to a social networking site in order to use it. Such data might include your legal name, your age, and your credit-card number. * Disclosed data is what you post on your own pages: blog entries, photographs, messages, comments, and so on. * Entrusted data is what you post on other people's pages. It's basically the same stuff as disclosed data, but the difference is that you don't have control over the data once you post it -- another user does. * Incidental data is what other people post about you: a paragraph about you that someone else writes, a picture of you that someone else takes and posts. Again, it's basically the same stuff as disclosed data, but the difference is that you don't have control over it, and you didn't create it in the first place. * Behavioral data is data the site collects about your habits by recording what you do and who you do it with. It might include games you play, topics you write about, news articles you access (and what that says about your political leanings), and so on. * Derived data is data about you that is derived from all the other data. For example, if 80 percent of your friends self-identify as gay, you're likely gay yourself. There are other ways to look at user data. Some of it you give to the social networking site in confidence, expecting the site to safeguard the data. Some of it you publish openly and others use it to find you. And some of it you share only within an enumerated circle of other users. At the receiving end, social networking sites can monetize all of it: generally by selling targeted advertising. Different social networking sites give users different rights for each data type. Some are always private, some can be made private, and some are always public. Some can be edited or deleted -- I know one site that allows entrusted data to be edited or deleted within a 24-hour period -- and some cannot. Some can be viewed and some cannot. It's also clear that users should have different rights with respect to each data type. We should be allowed to export, change, and delete disclosed data, even if the social networking sites don't want us to. It's less clear what rights we have for entrusted data -- and far less clear for incidental data. If you post pictures from a party with me in them, can I demand you remove those pictures -- or at least blur out my face? (Go look up the conviction of three Google executives in Italian court over a YouTube video.) And what about behavioral data? It's frequently a critical part of a social networking site's business model. We often don't mind if a site uses it to target advertisements, but are less sanguine when it sells data to third parties. As we continue our conversations about what sorts of fundamental rights people have with respect to their data, and more countries contemplate regulation on social networking sites and user data, it will be important to keep this taxonomy in mind. The sorts of things that would be suitable for one type of data might be completely unworkable and inappropriate for another.
- ipcalc takes an IP address and netmask and calculates the resulting broadcast, network, Cisco wildcard mask, and host range. By giving a second netmask, you can design subnets and supernets. It is also intended to be a teaching tool and presents the subnetting results as easy-to-understand binary values. Enter your netmask(s) in CIDR notation (/25) or dotted decimals (255.255.255.0). Inverse netmasks are recognized. If you omit the netmask ipcalc uses the default netmask for the class of your network. Look at the space between the bits of the addresses: The bits before it are the network part of the address, the bits after it are the host part. You can see two simple facts: In a network address all host bits are zero, in a broadcast address they are all set. The class of your network is determined by its first bits. If your network is a private internet according to RFC 1918 this is remarked. When displaying subnets the new bits in the network part of the netmask are marked in a different color The wildcard is the inverse netmask as used for access control lists in Cisco routers. Do you want to split your network into subnets? Enter the address and netmask of your original network and play with the second netmask until the result matches your needs. You can have all this fun at your shell prompt. Originally ipcalc was not intended for creating HTML and still works happily in /usr/local/bin/ :-)

*Proceedings of the 14th International Conference on Knowledge Science, Engineering and Management ,**Springer,*(*2021*)- (
*2019*)*cite arxiv:1907.06902Comment: Source code available at: https://github.com/MaurizioFD/RecSys2019_DeepLearning_Evaluation.* *MIT Press,*(*2016*)*Advances in Neural Information Processing Systems ,**29,**page 3630-3638.**Curran Associates, Inc.,*(*2016*)*PLOS ONE**6 (4): 1--10*(*April 2011*)*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) ,**page 7516--7533.**Association for Computational Linguistics,*(*November 2020*)*Proceedings of the International Conference on Learning Representations ,*(*2021*)*International Journal on Digital Libraries**19 (4): 323--337*(*Nov 1, 2018*)*World Wide Web**8 (2): 159--178*(*Jun 1, 2005*)*Nature**435 (7039): 207--211*(*2005*)*Proceedings of the 23rd international conference on World Wide Web ,**ACM Press,*(*2014*)- (
*2020*)*cite arxiv:2002.08155Comment: Accepted to Findings of EMNLP 2020. 12 pages.* *Proceedings of the 10th ACM Conference on Electronic Commerce ,**page 325--334.**New York, NY, USA,**ACM,*(*2009*)- (
*2019*)*cite arxiv:1904.01500Comment: NAACL-HLT 2019 Workshop on Evaluating Vector Space Representations for NLP (RepEval).* *Neural Networks*(*2018*)- (
*2018*)*cite arxiv:1801.03400Comment: 14 pages, 9 figures, 2 tables, 5 appendices.* *Proceedings of the International Conference on Web and Social Media ,**AAAI,*(*2012*)*Proceedings of the 24th International Conference on World Wide Web ,**page 1003--1013.**Republic and Canton of Geneva, Switzerland,**International World Wide Web Conferences Steering Committee,*(*2015*)*Journal of the Association for Information Science and Technology**68 (9): 2037--2062*(*2017*)*Journal of Informetrics**9 (4): 809 - 825*(*2015*)