When Apple introduced App Store in 2008, developers’ attention moved from web-based to native mobile apps.
A few years later the app market stabilized. Facebook, Amazon, and Google apps dominate in their verticals. Consumers don’t want to install new apps anymore. According to comScore’s mobile app report, most US smartphone owners download zero apps in a typical month, and a “staggering 42% of all app time spent on smartphones occurs on the individual’s single most used app”.
More than half of the time we spend on our phones is talking, texting or in email, according to Experian’s report.
In the end of January, RE-WORK organized a Virtual Assistant Summit, which took place in San Francisco at the same time as RE-WORK Deep Learning Summit.
Craig Villamor wrote a nice overview of key things discussed on the summit.
I didn’t attend these conferences, but I watched a few presentations which RE-Work kindly uploaded to YouTube. I would like to share notes I took while watching these videos. I could have misinterpreted something, so please keep that in mind, and watch original videos for more details.
Deep learning is computationally intensive. Model training and model querying have very different computation complexities. A query phase is fast: you apply a function to a vector of input parameters (forward pass), get results.
Model training is much more intensive. Deep learning requires large training datasets in order to produce good results. Datasets with millions of samples are common now, e.g. ImageNet dataset contains over 1 million images. Training is an iterative process: you do forward pass on each sample of the training set, do backward pass to adjust model parameters, repeat the process a few times (epochs). Thus training requires millions, or even billions more computation than one forward pass, and a model can include billions of parameters to adjust.
It’s interesting that The New York Times published an article about brain cryonics, immortality, connectomics, trans-humanism, and uploading. Kim Suozzi, who died of cancer at age 23, chose to have her brain preserved in hope to get alive sometime in future. One of the options is to scan the brain and map the connections between individual neurons.
“I can see within, say, 40 years that we would have a method to generate a digital replica of a person’s mind,” said Winfried Denk, a director at the Max Planck Institute of Neurobiology in Germany, who has invented one of several mapping techniques.”
“The mapping technique pioneered by Dr. Denk and others involves scanning brains in impossibly thin sheets with an electron microscope. Stacked together on a computer, the scans reveal a three-dimensional map of the connections between each neuron in the tissue, the critical brain anatomy known as the connectome.”
The author doesn’t dive into details of reconstructing a map of neuron connections, though. As Yan LeCunn points out, “connectomics efforts use 3D convolutional nets to analyze the volumetric brain images and to reconstruct the neural circuits.”
As strange as it may sound, neuroscientists use artificial neural networks to reconstruct models of human neural networks. Yet another good use of deep learning techniques.
I’m trying to solve one of recent Kaggle competitions: “ICDM 2015: Drawbridge Cross-Device Connections“. That competition provides data on device/browser usage and asks you to determine which cookies belong to an individual using a device.
The data for this competition is available in two formats: CSV files and SQLite database. A relational database looks more suitable for ad-hoc queries because SQL is a quite powerful tool: you can easily join tables, filter and group data. Though it lacks some of statistical analysis capabilities which you can get in R or other tools specialized in statistics.
SQLite is an awesome technology for small embedded databases, but there are certainly no good GUI applications for querying SQLite databases. I also was afraid that SQLite query execution engine is not very smart, and SQLite dialect of SQL is not rich enough, in comparison to SQL Server or Oracle, so I decided to import SQLite database into Microsoft SQL Server 2012.
A huge part of database related work is to make sure that the data is consistent. In real world data is never ideal, and whenever you need using data from existing data sources, you have to understand what is right and what is wrong there, and know how to circumvent data quality issues. Two most frequent data integrity issues in relational databases are missing date and duplicate data. A record/document is missing if it was not written to the database by an application, or was mistakenly deleted. A record/document is duplicated if it was recorded more than once.
Why does application write same record more than once? A user or an upstream code could send same document twice, and an application doesn’t handle this case. Or a user could send incorrect record the first time, and a corrected one later: an application could be designed to save all records instead of modifying existing ones.
JetBrains introduced Architecture tools in Resharper 8. It lets you visualize dependencies between projects and types.
To start using it, open a solution in Visual Studio and go to Resharper – Architecture menu. “Show Project Dependency Diagram” menu item opens a new tab with graph of project dependencies:
Human civilization history dates thousands of years. Archaeologists strive to investigate fragile clues of former cultures in order to better understand the past. It is not easy to do because many artifacts are partially or completely destroyed.
It would be a mistake to think that only archaeologists deal with information loss. It is as easy to lose information about systems and organizations which were built quite recently, just a few years ago. In “Institutional memory and reverse smuggling”, an engineer tells a story about a petrochemical company where knowledge about plant design and processes was lost after decades of operation, and they had to bring in a former engineer to smuggle the knowledge back to the company.
Even if you had perfect database design from the start, it is likely that requirements changed, and you have to change database schema accordingly. As there is legacy code, there can be legacy data, with schema designed for use cases which are no longer actual.
It is usually harder to fix data design than to fix code design, but it is doable. And as with code, you can apply refactoring techniques: improving the design without change in behavior.
A book “Refactoring Databases: Evolutionary Database Design” describes basics of database refactoring and lists different kinds of changes to relational databases. It starts from explanation why refactoring is a good thing, why it is important to make small incremental changes, and which organizational obstacles you can get on the way of implementing these techniques. You can imagine how hard it can be to make database changes if DBA and application developers are in different teams, and any database change requires coordination between these teams, sometimes using strict change management process. Basically, agile methodologies with their focus on cross-functional teams and short development cycles can help here.
How can you find if your website is fast enough? You would like to know an answer to this question before angry customers start complaining in social networks, and probably even before your code is deployed to the production environment.
There are two kinds of tests which can be useful: individual web page performance tests and load tests.
Performance of a web page is usually measured in a real web browser. One of the popular tools to automate such tests is WebPageTest. WebPageTest starts a new instance of chosen web browser, for example, Chrome, navigates to specified URL and waits for the page to complete loading. Loading here includes receiving HTTP response, CSS styles, JS scripts, images, fonts and all other stuff on the page, then rendering the page and executing scripts. WebPageTest measures time to render, time to full complete, number of HTTP requests and few other metrics; captures a timeline of loading of different resources and opening/closing HTTP connections. It can also capture video of page loading and a screenshot. Then it starts the process again to measure repeat visit: it is usually faster because some resources are already cached by the browser.