(testing signal)

Category: TRAINING

Definitions and Procedures around Data Science, Machine Learning and Blockchain.

K Means Clustering Project (Pieran Data)

For this project we will attempt to use KMeans Clustering to cluster Universities into to two groups, Private and Public. It is very important to note, we actually have the labels for this data set, but we will NOT use them for the KMeans clustering algorithm, since that is an unsupervised learning algorithm.

When using the Kmeans algorithm under normal circumstances, it is because you don’t have labels. In this case we will use the labels to try to get an idea of how well the algorithm performed, but you won’t usually do this for Kmeans, so the classification report and confusion matrix at the end of this project, don’t truly make sense in a real world setting!.

The Data
We will use a data frame with 777 observations on the following 18 variables.… Read more...

Logistic Regression: Ad Clicks Model

Source: Pierian Data. In this project we will be working with a fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement. We will try to create a logistic regression model that will predict whether or not they will click on an ad based off the features of that user.

Logistic Regression

Logistic regression is a method for classification: the problem to indentify to which label or category some new prediction belongs to, such as email in spam, good lenders, etc.

The most popular model is the binary clasification, which means the prediction is YES/NO. This is modelized with the Sigmoid Function (SF) as a probability. The SFis the key to LR: convert a continuous number into 0 or 1.

– LR is a method for classification: What labels are assigned to certain prediction.
– Binary classification: convention is to have 2 classes: 0 and 1
– The result is usually a probability, so we can assign 0 or 1 if <0.5, or >0.5

After training the model with LT the way to evaluate it is with the Confussion Matrix.… Read more...

Linear Regression

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

Practical definition in ML

Given a dataset, we want to predict a range of numeric (continuous) values. One or several variables of the dataset predict (are correlated with) a numerical outcome (the future), which is usually another column in the data.… Read more...

House price prediction using linear regression

Lets review a classical model from Pierian Data. Your neighbor is a real estate agent and wants some help predicting housing prices for regions in the USA. It would be great if you could somehow create a model with Python and scikit-learn for her, that allows her to put in a few features of a house and returns back an estimate of what the house would sell for.

She has asked you if you could help her out with your new data science skills. You say yes, and decide that Linear Regression might be a good path to solve this problem!.

Your neighbor then gives you some information about a bunch of houses in regions of the United States,it is all in the data set: USA_Housing.csv.

The data contains the following columns:

‘Avg. Area Income’: Avg.… Read more...

Consensus Algorythms

A consensus algorithm is a process in computer science used to achieve agreement on a single data value among distributed processes or systems. Consensus algorithms are designed to achieve reliability in a network involving multiple unreliable nodes. As a result, consensus algorithms must be fault-tolerant. Lets review some of the most popular algos:

PROOF OF WORK (PoW)
A proof of work is a piece of data which is difficult (costly, time-consuming) to produce but easy for others to verify and which satisfies certain requirements. Producing a proof of work can be a random process with low probability so that a lot of trial and error is required on average before a valid proof of work is generated.

Bitcoin uses the Hashcash proof of work system.… Read more...

The DAO attack on June 2016: the Recursive Call

After the exit door a vulnerability known as “Recursive call bug” which allowed the attacker to drain the Ether from the DAO’s account. If one wished to exit the DAO, then they could do so by sending a request. The splitting function would then follow the following two steps:

– Give the user back his/her Ether in exchange of their DAO tokens.
– Register the transaction in the ledger and update the internal token balance.

So how does the hack happen? the hacker implements a recursive function in the request, and then this is how the splitting function goes:

– Take the DAO tokens from the user and give them the Ether requested.
– Before the blockchain could register the transaction, the recursive function made the code go back and transfer more Ether for the same DAO tokens… and so on.… Read more...

NLP Natural Language Processing

Main ideas

The process for NLP is always similar to other classification algos:

  • Compile documents. Get the data which uses to be raw text.
  • Featurize documents. Get the text in a format that ML algorithms understand.
  • Compare features for classification. Use ML techniques to build the model.

Unstructured text ⇒ Compile documents ⇒ Featurize them ⇒ Compare features

How does does the algorithm work

In NLP the featurization is done through vectorization:

  • Corpus of D documents: a = “The House is Blue” , b = ”The House is Red”.
  • Build and index of relevant, meaningful keywords. Eg (house,blue,red)
  • Vectorize documents. Eg a = “The Blue House” ⇒ (1,1,0)
  • Compare the docs as follows:

Use cosine similarity to compare: similarity docs(a,b) = cos (θ)

Characterize the terms:
Term Frequency TF(t) = TF(t,d) ⇒ Importance of the term t within doc d
Inverse Doc Frequency IDF(t) = log (D/t) ⇒ Importance of term within corpus D
TF-IDF = This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.… Read more...

Ethereum Virtual Machine (EVM)

ETH aspires to be a kind of giant decentralized computer, and that is why it is called EVM Ethereum Virtual Machine. Where BTC can help users avoid banks, ETH can help them avoid Facebook, Amazon and all that layer of centralized intermediaries that has solidified and become entrenched in today’s electronic societies.

Ethereum is a programmable Blockcain: instead of giving users fixed fucntionalities like BTC, it allows them to create their own operations. Serves as a platform for currencies and other apps.

Cryptos are just one application. Ethereum is Turing complete and developers can use manz other langugages besides Solidity.

Each node executes the codes and therefore for this reason Ethereum VM is defined as a large vault computer where multtiple nodes execute a code called smart contract.… Read more...

Smart Contracts

Smart contract is a term used to describe computer program code that is capable of facilitating, executing, and enforcing the negotiation or performance of an agreement (i.e. contract) using Blockchain technology.

Its basically a fancy term to describe code running on BC that can change its state. This code can be Pascal, Python, PHP, Java, Fortran, C++. Each language has its own strengths and weaknesses. You don’t program webs in C or compress video in Ruby, but you could do it paying an enormous price in convenience, performance and length.

Smart contract is a term used to describe computer program code that is capable of facilitating, executing, and enforcing the negotiation or performance of an agreement (i.e.… Read more...

DAPPS Pros and Cons

PROS OF DAPPS
• Immutability – A third party cannot make any changes to data.
• Corruption & tamper proof – Apps are based on a network formed around the
principle of consensus, making censorship impossible.
• Secure – With no central point of failure and secured using cryptography,
applications are well protected against hacking attacks and fraudulent activities.
• Zero downtime – Apps never go down and can never be switched off.

CONS OF DAPPS
• Decentralized app completely depend on the code and the code is written by
human. So, It may contain bugs in code.
• Problem of Identity management <<<—–
• Lack of implementations in real life. It needs good practice.
• Because Ethereum is a platform to develop dapps, it will never be as effective as other chains that are designed specifically to be a cryptocurrency.… Read more...

Some interesting use cases of Ethereum

HEALTHCARE. Ethereum will completely revolutionize the health-care system. For example, You can go to a doctor in Thailand for a check up when you are on holidays and to a hospital in New York when you are back home again, and both will have the same information about you. You can also share your med data in real time with med facilities everywhere.

SECURITY. Security from hackers. The fact that there is no central server for a hacker to attack makes it a lot harder to break into.

TRANSACTIONS. Easy and secure transactions. On Ethereum, so-called “smart contracts” can be made. These make it possible to exchange anything of value, completely risk-free. Instead of paper use computer code.

PRIVACY: Privacy from third parties.… Read more...

Paper and Hardware Wallets

PAPER WALLET

Once your crypto is printed, store it at some safe place. When ready to cash in the currency use some online or mobile wallet app to get the coins from the paper to the wallet or your bank. Some exchanges allow you to go straight from the paper wallet to the exchanges´s online wallet.

Pros

– Cannot be hacked online
– Can be stored in a safe place like a safety deposit
– Dont worry about outages or hardware issues.

Cons

– Accidentally lost or destroyed
– Your password can be stolen while creating the paper wallet

HARDWARE WALLET

In the form of USB device where you can store ether or any other crypto. Trezor is the most popular one

Pros

– Backup your wallet and restore them later even if the hardware gets lots
– Water resistant, durable
– Can store multiple cryptos at the same time

Cons

– Vulnerable as any other electronic device
– If someone has your PIN you may loose all cryptos
– If someone get your recovery card, they can reset your pin
– Can be hacked when online

Recommender Systems

Full implementation is complex: it needs advanced linear algebra.

Types of Recommender Systems:

  • Content based. Focus on the attributes of the items: the usual “related items”.
  • Collaborative filter (CF). Uses “wisdom of the crowd” to recommend items: eg Amazon. CF is most used on content based systems. It can do feature learning by itself.
    The Movie land dataset of movies to study.  These methods can be:
    – Memory based CF: singular value decomposition.
    – Collaborative CF: computing cosine similarity.

What is Artificial Intelligence

AI (Artificial intelligence) is a subfield of computer science, that was created in the 1960s, and it was (is) concerned with solving tasks that are easy for humans, but hard for computers. The term comes from the massive computers of the 50s (prof McArthy).

Machine Learning is where relational DDBB where in the early 90s. Everyone knew it would be useful for essentially every company, but very few companies had the ability to take advantage of it.

AI (Artificial intelligence) is a subfield of computer science, that was created in the 1960s, and it was (is) concerned with solving tasks that are easy for humans, but hard for computers.

Bias vs Variance, Overfitting vs Underfitting

This is related to confusing signal with noise.

THE BIAS-VARIANCE TRADE-OFF
– Bias: distance of the results to the target.
– Variance: the spread of the results

OVERFITTING VS UNDERFITTING
– Overfitting: The model get more complex and fits too much to the noise from the data. This results in low error on training set, but high error on new data, test/validation sets.
– Underfitting: Model too simple does not capture the underlying trend of the data and does not fit the data well enough. Low variance but high bias.

 

 

Ethereum vs Bitcoin

The essential difference between ETH and BTC is that in the former you can not only transmit money but also run SC and do DApps. This happens with EVM (Ethereum Virtual Machine), and thanks to the native language of ETH, called Solidity, which is used to write smart contracts and DApps. From there, ETH crypto-currency is used to make these apps and intelligent contracts work.

BTC is developing other tools such as RSK, which is the first open platform for intelligent contracts on BTC, and which remunerates BTC miners by allowing them to participate in the SC revolution. If it works, it could make ETH somewhat less relevant.

The RSK project started as a fork of an Ethereum codebase in 2016 and it was developed for 2 years until it was finally launched in 2018.… Read more...